Beautiful Soup 简明教程
Beautiful Soup - Parsing Tables
除了文本内容,HTML 文档还可能具有 HTML 表格形式的结构化数据。借助 Beautiful Soup,我们可以将表格数据提取为 Python 对象(例如列表或字典),如果需要,可以将这些数据存储在数据库或电子表格中,然后再执行处理操作。在本章中,我们将使用 Beautiful Soup 解析 HTML 表格。
In addition to a textual content, a HTML document may also have a structured data in the form of HTML tables. With Beautiful Soup, we can extract the tabular data in Python objects such as list or dictionary, if required store it in databases or spreadsheets, and perform processing. In this chapter, we shall parse HTML table using Beautiful Soup.
虽然 Beautiful Soup 没有用于提取表格数据的任何特殊函数或方法,但我们可以通过一些简单的抓取技巧来实现此目的。就像 SQL 或电子表格中的任何表格一样,HTML 表格包含行和列。
Although Beautiful Soup doesn’t any special function or method for extracting table data, we can achieve it by simple scraping techniques. Just like any table, say in SQL or spreadsheet, HTML table consists of rows and columns.
HTML 中使用 <table> 标签来构建一个表格结构。有一个或多个嵌套 <tr> 标签,每个标签对应一行。每一行由 <td> 标签组成,这些标签用于保存行中每个单元格中的数据。第一行通常用于列标题,且标题不放在 <td> 标签中,而是放在 <th> 标签中
HTML has <table> tag to build a tabular structure. There are one or more nested <tr> tags one each for a row. Each row consists of <td> tags to hold the data in each cell of the row. First row usually is used for column headings, and the headings are placed in <th> tag instead of <td>
以下 HTML 脚本在浏览器窗口中呈现一个简单的表格 −
Following HTML script renders a simple table on the browser window −
<html>
<body>
<h2>Beautiful Soup - Parse Table</h2>
<table border="1">
<tr>
<th>Name</th>
<th>Age</th>
<th>Marks</th>
</tr>
<tr class='data'>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr class='data'>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>
请注意,数据行的外观是使用 CSS 类 data 自定义的,用于将其与标题行区分开来。
Note that, the appearance of data rows is customized with a CSS class data, in order to distinguish it from the header row.
我们现在来看如何解析表格数据。首先,我们在 BeautifulSoup 对象中获取文档树。然后,将所有列标题收集到一个列表中。
We shall now see how to parse the table data. First, we obtain the document tree in the BeautifulSoup object. Then collect all the column headers in a list.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "html.parser")
tbltag = soup.find('table')
headers = []
headings = tbltag.find_all('th')
for h in headings: headers.append(h.string)
然后获取标题行之后的具有 class='data' 属性的数据行标签。将一个词典对象(以列标题为键,每个单元格中的相应值为值)形成,并将其附加到一个字典对象列表中。
The data row tags with class='data' attribute following the header row are then fetched. A dictionary object with column header as key and corresponding value in each cell is formed and appended to a list of dict objects.
rows = tbltag.find_all_next('tr', {'class':'data'})
trows=[]
for i in rows:
row = {}
data = i.find_all('td')
n=0
for j in data:
row[headers[n]] = j.string
n+=1
trows.append(row)
trows 中收集了一个字典对象列表。然后,你可以将此列表用于其他各种目的,例如存储在 SQL 表格中、以 JSON 或 pandas 数据框对象的形式保存。
A list of dictionary objects is collected in trows. You can then use it for different purposes such as storing in a SQL table, saving as a JSON or pandas dataframe object.
以下是完整代码 −
The complete code is given below −
markup = """
<html>
<body>
<p>Beautiful Soup - Parse Table</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>Marks</th>
</tr>
<tr class='data'>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr class='data'>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "html.parser")
tbltag = soup.find('table')
headers = []
headings = tbltag.find_all('th')
for h in headings: headers.append(h.string)
print (headers)
rows = tbltag.find_all_next('tr', {'class':'data'})
trows=[]
for i in rows:
row = {}
data = i.find_all('td')
n=0
for j in data:
row[headers[n]] = j.string
n+=1
trows.append(row)
print (trows)