Beautiful Soup 简明教程
Beautiful Soup - Modifying the Tree
Beautiful Soup 库的一个强大功能是可以操作解析后的 HTML 或 XML 文档并修改其内容。
One of the powerful features of Beautiful Soup library is to be able to be able to manipulate the parsed HTML or XML document and modify its contents.
Beautiful Soup 库具有不同的函数来执行以下操作 -
Beautiful Soup library has different functions to perform the following operations −
-
Add contents or a new tag to an existing tag of the document
-
Insert contents before or after an existing tag or string
-
Clear the contents of an already existing tag
-
Modify the contents of a tag element
Add content
您可以通过在 Tag 对象上使用 append() 方法向现有标签的内容添加内容。它像 Python 的列表对象的 append() 方法一样工作。
You can add to the content of an existing tag by using append() method on a Tag object. It works like the append() method of Python’s list object.
在以下示例中,HTML 脚本有一个 <p> 标签。使用 append() 附加附加文本。
In the following example, the HTML script has a <p> tag. With append(), additional text is appended.
Example
from bs4 import BeautifulSoup
markup = '<p>Hello</p>'
soup = BeautifulSoup(markup, 'html.parser')
print (soup)
tag = soup.p
tag.append(" World")
print (soup)
Output
<p>Hello</p>
<p>Hello World</p>
使用 append() 方法,您可以在现有标签的末尾添加新标签。首先使用 new_tag() 方法创建一个新 Tag 对象,然后将其传递给 append() 方法。
With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method.
Example
from bs4 import BeautifulSoup, Tag
markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag1 = soup.new_tag('i')
tag1.string = 'World'
tag.append(tag1)
print (soup.prettify())
Output
<b>
Hello
<i>
World
</i>
</b>
如果您必须向文档添加字符串,则可以附加 NavigableString 对象。
If you have to add a string to the document, you can append a NavigableString object.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
new_string = NavigableString(" World")
tag.append(new_string)
print (soup.prettify())
Output
<b>
Hello
World
</b>
从 Beautiful Soup 4.7 版本开始,extend() 方法已添加到 Tag 类中。它将列表中的所有元素添加到标签中。
From Beautiful Soup version 4.7 onwards, the extend() method has been added to Tag class. It adds all the elements in a list to the tag.
Insert Contents
您不必在末尾添加新元素,而是可以使用 insert() 方法在 Tag 元素的子项列表中给定位置添加元素。Beautiful Soup 中的 insert() 方法的行为类似于 Python 列表对象上的 insert()。
Instead of adding a new element at the end, you can use insert() method to add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object.
在以下示例中,新字符串被添加到 <b> 标记,位置为 1。结果解析的文档显示结果。
In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag.insert(1, "Tutorial ")
print (soup.prettify())
Output
<b>
Excellent
Tutorial
</b>
<u>
from TutorialsPoint
</u>
Beautiful Soup 还有 insert_before() 和 insert_after() 方法。它们的各自目的是在指定标记对象之前或之后插入标记或字符串。以下代码显示字符 "Python Tutorial" 添加到了 <b> 标记之后。
Beautiful Soup also has insert_before() and insert_after() methods. Their respective purpose is to insert a tag or a string before or after a given Tag object. The following code shows that a string "Python Tutorial" is added after the <b> tag.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
tag.insert_after("Python Tutorial")
print (soup.prettify())
Clear the Contents
Beautiful Soup 提供多种方法,从文档树中删除元素的内容。这些方法各自具有独特的功能。
Beautiful Soup provides more than one ways to remove contents of an element from the document tree. Each of these methods has its unique features.
clear() 方法最为直接。它仅仅删除指定标记元素的内容。以下示例显示了它的使用情况。
The clear() method is the most straight-forward. It simply removes the contents of a specified Tag element. Following example shows its usage.
Example
from bs4 import BeautifulSoup, NavigableString
markup = '<b>Excellent </b><u>from TutorialsPoint</u>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.find('u')
tag.clear()
print (soup.prettify())
Output
<b>
Excellent
</b>
<u>
</u>
可以看到, clear() 方法删除了内容,保持标记完好。
It can be seen that the clear() method removes the contents, keeping the tag intact.
对于以下示例,我们解析以下 HTML 文档,对所有标记调用 clear() 方法。
For the following example, we parse the following HTML document and call clear() metho on all tags.
<html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs./p>
</body>
</html>
使用 clear() 方法的 Python 代码如下
Here is the Python code using clear() method
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
tag.clear()
print (soup.prettify())
Output
<html>
</html>
extract() 方法从文档树中删除标记或字符串,并返回已删除的对象。
The extract() method removes either a tag or a string from the document tree, and returns the object that was removed.
Example
from bs4 import BeautifulSoup
fp = open('index.html')
soup = BeautifulSoup(fp, 'html.parser')
tags = soup.find_all()
for tag in tags:
obj = tag.extract()
print ("Extracted:",obj)
print (soup)
Output
Extracted: <html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
Extracted: <body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
Extracted: <p> The quick, brown fox jumps over a lazy dog.</p>
Extracted: <p> DJs flock by when MTV ax quiz prog.</p>
Extracted: <p> Junk MTV quiz graced by fox whelps.</p>
Extracted: <p> Bawds jog, flick quartz, vex nymphs.</p>
你可以提取标记或字符串。以下示例显示提取一个标记。
You can extract either a tag or a string. The following example shows antag being extracted.
Example
html = '''
<ol id="HR">
<li>Rani</li>
<li>Ankita</li>
</ol>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
obj=soup.find('ol')
obj.find_next().extract()
print (soup)
Output
<ol id="HR">
<li>Ankita</li>
</ol>
更改 extract() 语句以删除第一个 <li> 元素的内文本。
Change the extract() statement to remove inner text of first <li> element.
Output
<ol id="HR">
<li>Ankita</li>
</ol>
另一种方法 decompose() 从树中删除标记,然后完全销毁它及其内容 −
There is another method decompose() that removes a tag from the tree, then completely destroys it and its contents −
Modify the Contents
我们将着眼于 replace_with() 方法,该方法允许替换标记的内容。
We shall look at the replace_with() method that allows contents of a tag to be replaced.
正如 Python 字符串一样(不可变),NavigableString 也不能就地修改。但是,使用 replace_with() 可以将标记的内部字符串替换为另一个字符串。
Just as a Python string, which is immutable, the NavigableString also can’t be modified in place. However, use replace_with() to replace the inner string of a tag with another.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')
tag = soup.h2
tag.string.replace_with("OnLine Tutorials Library")
print (tag.string)
Output
OnLine Tutorials Library
这里有另一个示例,演示了 replace_with() 的用法。如果将 BeautifulSoup 对象作为参数传递给某些函数,例如 replace_with(),则可以合并两个已解析的文档。
Here is another example to show the use of replace_with(). Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().2524
Example
from bs4 import BeautifulSoup
obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml")
obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml")
obj2.find('b').replace_with(obj1)
print (obj2)
Output
<html><body><book><title>Python</title></book></body></html>
wrap() 方法用你指定的标记包装一个元素。它返回新的包装器。
The wrap() method wraps an element in the tag you specify. It returns the new wrapper.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>Hello Python</p>", 'html.parser')
tag = soup.p
newtag = soup.new_tag('b')
tag.string.wrap(newtag)
print (soup)
Output
<p><b>Hello Python</b></p>
另一方面, unwrap() 方法用标记中的内容替换该标记。这适合用于剥离标记。
On the other hand, the unwrap() method replaces a tag with whatever’s inside that tag. It’s good for stripping out markup.