Python Data Science 简明教程
Python - Reading HTML Pages
称为 beautifulsoup 的库。使用此库,我们可以搜索 html 标签的值并获取特定数据,如页面的标题和页面中的标题列表。
library known as beautifulsoup. Using this library, we can search for the values of html tags and get specific data like title of the page and the list of headers in the page.
Install Beautifulsoup
使用 Anaconda 包管理器安装所需的包及其依赖包。
Use the Anaconda package manager to install the required package and its dependent packages.
conda install Beaustifulsoap
Reading the HTML file
在下面的示例中,我们请求一个待加载到 python 环境中的 url。然后使用 html 解析器参数读取整个 html 文件。接下来,我们打印 html 页面的前几行。
In the below example we make a request to an url to be loaded into the python environment. Then use the html parser parameter to read the entire html file. Next, we print first few lines of the html page.
import urllib2
from bs4 import BeautifulSoup
# Fetch the html file
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')
# Format the parsed html file
strhtm = soup.prettify()
# Print the first few characters
print (strhtm[:225])
当我们执行上面的代码时,它会产生以下结果。
When we execute the above code, it produces the following result.
<!DOCTYPE html>
<!--[if IE 8]><html class="ie ie8"> <![endif]-->
<!--[if IE 9]><html class="ie ie9"> <![endif]-->
<!--[if gt IE 9]><!-->
<html>
<!--<![endif]-->
<head>
<!-- Basic -->
<meta charset="utf-8"/>
<title>
Extracting Tag Value
我们可以使用以下代码从标签的第一实例中提取标签值。
We can extract tag value from the first instance of the tag using the following code.
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)
当我们执行上面的代码时,它会产生以下结果。
When we execute the above code, it produces the following result.
Python Overview
Python Overview
None
Python is Interpreted
Extracting All Tags
我们可以使用以下代码从标签的所有实例中提取标签值。
We can extract tag value from all the instances of a tag using the following code.
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for x in soup.find_all('b'): print(x.string)
当我们执行上面的代码时,它会产生以下结果。
When we execute the above code, it produces the following result.
Python is Interpreted
Python is Interactive
Python is Object-Oriented
Python is a Beginner's Language
Easy-to-learn
Easy-to-read
Easy-to-maintain
A broad standard library
Interactive Mode
Portable
Extendable
Databases
GUI Programming
Scalable