Beautiful Soup 简明教程
Beautiful Soup - Searching the Tree
在本章中,我们将讨论 Beautiful Soup 中用于在不同方向上浏览 HTML 文档树的不同方法 - 上下、左右以及来回。
本章所有示例中,都将使用以下 HTML 字符串 −
html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>
<p class="tutorial">...</p>
"""
所需标记的名称可用于导航解析树。例如,soup.head 会为您提取 <head> 元素 −
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (soup.head.prettify())
Going down
一个标记可能包含字符串或包含在其内部的其他标记。Tag 对象的 .contents 属性会返回属于它的所有子元素的列表。
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
print (list(tag.children))
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.body
print (list(tag.children))
Output
['\n', <p class="title"><b>Online Tutorials Library</b></p>, '\n',
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>, '\n', <p class="tutorial">...</p>, '\n']
不必将其获取为列表,也可以使用 .children 生成器对标记的子元素进行迭代 −
Output
<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>
<p class="tutorial">...</p>
.descendents
BeautifulSoup对象位于所有标记层次结构的顶部。因此其 .descendents 属性包括 HTML 字符串中的所有元素。
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (soup.descendants)
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
for element in tag.descendants:
print (element)
Output
<title>TutorialsPoint</title>
TutorialsPoint
head 标记包含一个 title 标记,该标记又包含一个 NavigableString 对象 TutorialsPoint。<head>标记只有一个子元素,但具有两个后代:<title>标记和<title>标记的子元素。不过,BeautifulSoup 对象仅有一个直接子元素(<html>标记),但具有许多后代。
Going Up
就像使用子元素和后代属性导航文档的下游信息一样,BeautifulSoup 提供了 .parent和 .parent 属性来导航标记的上游信息
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.title
print (tag.parent)
Output
<head><title>TutorialsPoint</title></head>
由于 title 标记包含一个字符串(NavigableString),因此字符串的父级就是 title 标记自身。
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.title
string = tag.string
print (string.parent)
.parents
可以使用 .parents 遍历元素的所有父元素。此示例使用 .parents 从位于文档深处的 <a> 标记遍历到文档的顶部。在以下代码中,我们跟踪示例 HTML 字符串中第一个 <a> 的父元素。
Sideways
显示在相同缩进级别的 HTML 标记称为兄弟标记。考虑以下 HTML 代码段
<p>
<b>
Hello
</b>
<i>
Python
</i>
</p>
在外部 <p> 标记中,我们具有处于同一缩进级别的 <b> 和 <i> 标记,因此它们称为兄弟标记。BeautifulSoup 使得在相同级别的标记之间导航成为可能。
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag1 = soup.b
print ("next:",tag1.next_sibling)
tag2 = soup.i
print ("previous:",tag2.previous_sibling)
Output
next: <i>Python</i>
previous: <b>Hello</b>
由于 <b> 标记左侧没有兄弟标记,并且 <i> 标记右侧没有兄弟标记,因此在两种情况下都返回 Nobe。
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag1 = soup.b
print ("next:",tag1.previous_sibling)
tag2 = soup.i
print ("previous:",tag2.next_sibling)
.next_siblings and .previous_siblings
如果某个标记的右侧或左侧有两个或多个兄弟标记,则可以使用 .next_siblings 和 .previous_siblings 属性分别导航它们。它们都返回生成器对象,因此可以使用 for 循环进行迭代。
让我们为此目的使用以下 HTML 片段:
<p>
<b>
Excellent
</b>
<i>
Python
</i>
<u>
Tutorial
</u>
</p>
使用以下代码来遍历后面的和前面的兄弟标签。
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')
tag1 = soup.b
print ("next siblings:")
for tag in tag1.next_siblings:
print (tag)
print ("previous siblings:")
tag2 = soup.u
for tag in tag2.previous_siblings:
print (tag)
Back and forth
在Beautiful Soup中,next_element属性返回解析树中的下一个字符串或标记。另一方面,previous_element属性返回解析树中前面的字符串或标记。有时,next_element和previous_element属性的返回值与next_sibling和previous_sibling属性类似。
Example
html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>
<p class="tutorial">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find("a", id="link3")
print (tag.next_element)
tag = soup.find("a", id="link1")
print (tag.previous_element)
Output
PHP
TutorialsPoint has an excellent collection of tutorials on:
id=“link3”的<a>标签之后的next_element是字符串PHP。类似地,previous_element返回id=“link1”的<a>标签之前的字符串。
.next_elements and .previous_elements
Tag对象的这些属性分别返回生成器,其中是它后面和前面的所有标签和字符串。
Next elements example
tag = soup.find("a", id="link1")
for element in tag.next_elements:
print (element)
Output
Python
,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a>
Java
and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>
PHP
;
Enhance your Programming skills.
<p class="tutorial">...</p>
...
Previous elements example
tag = soup.find("body")
for element in tag.previous_elements:
print (element)
Output
<html><head><title>TutorialsPoint</title></head>