Beautiful Soup 简明教程
Beautiful Soup - Searching the Tree
在本章中,我们将讨论 Beautiful Soup 中用于在不同方向上浏览 HTML 文档树的不同方法 - 上下、左右以及来回。
In this chapter, we shall discuss different methods in Beautiful Soup for navigating the HTML document tree in different directions - going up and down, sideways, and back and forth.
本章所有示例中,都将使用以下 HTML 字符串 −
We shall use the following HTML string in all the examples in this chapter −
html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>
<p class="tutorial">...</p>
"""
所需标记的名称可用于导航解析树。例如,soup.head 会为您提取 <head> 元素 −
The name of required tag lets you navigate the parse tree. For example soup.head fetches you the <head> element −
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (soup.head.prettify())
Going down
一个标记可能包含字符串或包含在其内部的其他标记。Tag 对象的 .contents 属性会返回属于它的所有子元素的列表。
A tag may contain strings or other tags enclosed in it. The .contents property of Tag object returns a list of all the children elements belonging to it.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
print (list(tag.children))
Output
[<title>TutorialsPoint</title>]
返回的对象是一个列表,尽管在这种情况下,head 元素中仅包含一个子标记。
The returned object is a list, although in this case, there is only a single child tag enclosed in head element.
.children
The .children property also returns a list of all the enclosed elements in a tag. Below, all the elements in body tag are given as a list.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.body
print (list(tag.children))
Output
['\n', <p class="title"><b>Online Tutorials Library</b></p>, '\n',
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>, '\n', <p class="tutorial">...</p>, '\n']
不必将其获取为列表,也可以使用 .children 生成器对标记的子元素进行迭代 −
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator −
Output
<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a class="lang" href="https://tutorialspoint.com/Python" id="link1">Python</a>,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a> and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>;
Enhance your Programming skills.</p>
<p class="tutorial">...</p>
.descendents
The .contents and .children attributes only consider a tag’s direct children. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on.
BeautifulSoup对象位于所有标记层次结构的顶部。因此其 .descendents 属性包括 HTML 字符串中的所有元素。
The BeautifulSoup object is at the top of hierarchy of all the tags. Hence its .descendents property includes all the elements in the HTML string.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print (soup.descendants)
The .descendents attribute returns a generator, which can be iterated with a for loop. Here, we list out the descendents of the head tag.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.head
for element in tag.descendants:
print (element)
Output
<title>TutorialsPoint</title>
TutorialsPoint
head 标记包含一个 title 标记,该标记又包含一个 NavigableString 对象 TutorialsPoint。<head>标记只有一个子元素,但具有两个后代:<title>标记和<title>标记的子元素。不过,BeautifulSoup 对象仅有一个直接子元素(<html>标记),但具有许多后代。
The head tag contains a title tag, which in turn encloses a NavigableString object TutorialsPoint. The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag’s child. But the BeautifulSoup object only has one direct child (the <html> tag), but it has many descendants.
Going Up
就像使用子元素和后代属性导航文档的下游信息一样,BeautifulSoup 提供了 .parent和 .parent 属性来导航标记的上游信息
Just as you navigate the downstream of a document with children and descendents properties, BeautifulSoup offers .parent and .parent properties to navigate the upstream of a tag
.parent
每个标记和每个字符串都拥有包含它的父标记。可以使用 parent 属性访问元素的父元素。在我们的示例中,<head>标记是<title>标记的父级。
every tag and every string has a parent tag that contains it. You can access an element’s parent with the parent attribute. In our example, the <head> tag is the parent of the <title> tag.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.title
print (tag.parent)
Output
<head><title>TutorialsPoint</title></head>
由于 title 标记包含一个字符串(NavigableString),因此字符串的父级就是 title 标记自身。
Since the title tag contains a string (NavigableString), the parent for the string is title tag itself.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.title
string = tag.string
print (string.parent)
.parents
可以使用 .parents 遍历元素的所有父元素。此示例使用 .parents 从位于文档深处的 <a> 标记遍历到文档的顶部。在以下代码中,我们跟踪示例 HTML 字符串中第一个 <a> 的父元素。
You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document. In the following code, we track the parents of the first <a> tag in the example HTML string.
Sideways
显示在相同缩进级别的 HTML 标记称为兄弟标记。考虑以下 HTML 代码段
The HTML tags appearing at the same indentation level are called siblings. Consider the following HTML snippet
<p>
<b>
Hello
</b>
<i>
Python
</i>
</p>
在外部 <p> 标记中,我们具有处于同一缩进级别的 <b> 和 <i> 标记,因此它们称为兄弟标记。BeautifulSoup 使得在相同级别的标记之间导航成为可能。
In the outer <p> tag, we have <b> and <i> tags at the same indent level, hence they are called siblings. BeautifulSoup makes it possible to navigate between the tags at same level.
.next_sibling and .previous_sibling
这些属性分别返回处于同一级别的下一个标记和处于同一级别的前一个标记。
These attributes respectively return the next tag at the same level, and the previous tag at same level.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag1 = soup.b
print ("next:",tag1.next_sibling)
tag2 = soup.i
print ("previous:",tag2.previous_sibling)
Output
next: <i>Python</i>
previous: <b>Hello</b>
由于 <b> 标记左侧没有兄弟标记,并且 <i> 标记右侧没有兄弟标记,因此在两种情况下都返回 Nobe。
Since the <b> tag doesn’t have a sibling to its left, and <i> tag doesn’t have a sibling to its right, it returns Nobe in both cases.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Hello</b><i>Python</i></p>", 'html.parser')
tag1 = soup.b
print ("next:",tag1.previous_sibling)
tag2 = soup.i
print ("previous:",tag2.next_sibling)
.next_siblings and .previous_siblings
如果某个标记的右侧或左侧有两个或多个兄弟标记,则可以使用 .next_siblings 和 .previous_siblings 属性分别导航它们。它们都返回生成器对象,因此可以使用 for 循环进行迭代。
If there are two or more siblings to the right or left of a tag, they can be navigated with the help of the .next_siblings and .previous_siblings attributes respectively. Both of them return generator object so that a for loop can be used to iterate.
让我们为此目的使用以下 HTML 片段:
Let us use the following HTML snippet for this purpose −
<p>
<b>
Excellent
</b>
<i>
Python
</i>
<u>
Tutorial
</u>
</p>
使用以下代码来遍历后面的和前面的兄弟标签。
Use the following code to traverse next and previous sibling tags.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>", 'html.parser')
tag1 = soup.b
print ("next siblings:")
for tag in tag1.next_siblings:
print (tag)
print ("previous siblings:")
tag2 = soup.u
for tag in tag2.previous_siblings:
print (tag)
Back and forth
在Beautiful Soup中,next_element属性返回解析树中的下一个字符串或标记。另一方面,previous_element属性返回解析树中前面的字符串或标记。有时,next_element和previous_element属性的返回值与next_sibling和previous_sibling属性类似。
In Beautiful Soup, the next_element property returns the next string or tag in the parse tree. On the other hand, the previous_element property returns the previous string or tag in the parse tree. Sometimes, the return value of next_element and previous_element attributes is similar to next_sibling and previous_sibling properties.
Example
html = """
<html><head><title>TutorialsPoint</title></head>
<body>
<p class="title"><b>Online Tutorials Library</b></p>
<p class="story">TutorialsPoint has an excellent collection of tutorials on:
<a href="https://tutorialspoint.com/Python" class="lang" id="link1">Python</a>,
<a href="https://tutorialspoint.com/Java" class="lang" id="link2">Java</a> and
<a href="https://tutorialspoint.com/PHP" class="lang" id="link3">PHP</a>;
Enhance your Programming skills.</p>
<p class="tutorial">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find("a", id="link3")
print (tag.next_element)
tag = soup.find("a", id="link1")
print (tag.previous_element)
Output
PHP
TutorialsPoint has an excellent collection of tutorials on:
id=“link3”的<a>标签之后的next_element是字符串PHP。类似地,previous_element返回id=“link1”的<a>标签之前的字符串。
The next_element after <a> tag with id = "link3" is the string PHP. Similarly, the previous_element returns the string before <a> tag with id = "link1".
.next_elements and .previous_elements
Tag对象的这些属性分别返回生成器,其中是它后面和前面的所有标签和字符串。
These attributes of the Tag object return generator respectively of all tags and strings after and before it.
Next elements example
Next elements example
tag = soup.find("a", id="link1")
for element in tag.next_elements:
print (element)
Output
Python
,
<a class="lang" href="https://tutorialspoint.com/Java" id="link2">Java</a>
Java
and
<a class="lang" href="https://tutorialspoint.com/PHP" id="link3">PHP</a>
PHP
;
Enhance your Programming skills.
<p class="tutorial">...</p>
...
Previous elements example
Previous elements example
tag = soup.find("body")
for element in tag.previous_elements:
print (element)
Output
<html><head><title>TutorialsPoint</title></head>