Beautiful Soup 简明教程
Beautiful Soup - Kinds of objects
当我们将 HTML 文档或字符串传递给 beautifulsoup 构造函数时,beautifulsoup 基本上将复杂的 HTML 页面转换成不同的 Python 对象。下面我们将讨论 bs4 包中定义的四种主要对象。
When we pass a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects defined in bs4 package.
-
Tag
-
NavigableString
-
BeautifulSoup
-
Comments
Tag Object
HTML 标签用于定义各种类型的内容。BeautifulSoup 中的标签对象对应于实际页面或文档中的 HTML 或 XML 标签。
A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (type(tag))
Output
<class 'bs4.element.Tag'>
标签包含大量的属性和方法,而标签的两个重要特征是其名称和属性。
Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.
Name (tag.name)
每个标签都包含一个名称,可以通过“.name”作为后缀进行访问。tag.name 将返回标签的类型。
Every tag contains a name and can be accessed through '.name' as suffix. tag.name will return the type of tag it is.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
print (tag.name)
Output
html
但是,如果我们更改了标签名称,则在 BeautifulSoup 生成的 HTML 标记中也会反映出相同的更改。
However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html
tag.name = "strong"
print (tag)
Attributes (tag.attrs)
标签对象可以具有任意数量的属性。在上面的示例中,标签 <b class="boldest"> 有一个属性 'class',其值为 "boldest”。任何不是标签的内容基本上都是一种属性,并且必须包含一个值。attrs 返回属性及其值的字典。您也可以通过访问键来访问属性。
A tag object can have any number of attributes. In the above example, the tag <b class="boldest"> has an attribute 'class' whose value is "boldest". Anything that is NOT tag, is basically an attribute and must contain a value. A dictionary of attributes and their values is returned by "attrs". You can access the attributes either through accessing the keys too.
在下面的示例中,Beautifulsoup() 构造函数的字符串自变量包含 HTML 输入标签。输入标签的属性由 “attr” 返回。
In the example below, the string argument for Beautifulsoup() constructor contains HTML input tag. The attributes of input tag are returned by "attr".
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input
print (tag.attrs)
Output
{'type': 'text', 'name': 'name', 'value': 'Raju'}
我们可以使用字典操作符或方法对标签的属性进行任何类型的修改(添加/删除/修改)。
We can do all kind of modifications to our tag’s attributes (add/remove/modify), using dictionary operators or methods.
在下面的示例中,更新了值标签。更新后的 HTML 字符串显示了更改。
In the following example, the value tag is updated. The updated HTML string shows changes.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input
print (tag.attrs)
tag['value']='Ravi'
print (soup)
Output
<html><body><input name="name" type="text" value="Ravi"/></body></html>
我们添加了一个新的 id 标签,并删除了 value 标签。
We add a new id tag, and delete the value tag.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input
tag['id']='nm'
del tag['value']
print (soup)
Multi-valued attributes
一些 HTML5 属性可以有多个值。最常用的类属性可以有多个 CSS 值。其他内容包括“rel”、“rev”、“headers”、“accesskey”和“accept-charset”。beautiful soup 中的多值属性显示为列表。
Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include 'rel', 'rev', 'headers', 'accesskey' and 'accept-charset'. The multi-valued attributes in beautiful soup are shown as list.
Example
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p class="body"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])
css_soup = BeautifulSoup('<p class="body bold"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])
Output
css_soup.p['class']: ['body']
css_soup.p['class']: ['body', 'bold']
但是,如果任何属性包含多个值,但它不是任何 HTML 标准版本的多值属性,则 beautiful soup 会将该属性保留下来——
However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone −
NavigableString object
通常情况下,一个字符串会放在特定类型的起始标签和结束标签中。浏览器的 HTML 引擎在渲染元素时,会将预期效果应用到字符串。例如,在 <b>Hello World</b> 中,您会在 <b> 和 </b> 标签中间找到一个字符串,以便以粗体渲染它。
Usually, a certain string is placed in opening and closing tag of a certain type. The HTML engine of the browser applies the intended effect on the string while rendering the element. For example , in <b>Hello World</b>, you find a string in the middle of <b> and </b> tags so that it is rendered in bold.
NavigableString 对象表示标签的内容。它是 bs4.element.NavigableString 类的对象。要访问内容,请将 “.string” 与标签一起使用。
The NavigableString object represents the contents of a tag. It is an object of bs4.element.NavigableString class. To access the contents, use ".string" with tag.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>", 'html.parser')
print (soup.string)
print (type(soup.string))
Output
Hello, Tutorialspoint!
<class 'bs4.element.NavigableString'>
NavigableString 对象类似于 Python Unicode 字符串。它的一些功能支持导航树和搜索树。可以使用 str() 函数将 NavigableString 转换为 Unicode 字符串。
A NavigableString object is similar to a Python Unicode string. some of its features support Navigating the tree and Searching the tree. A NavigableString can be converted to a Unicode string with str() function.
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')
tag = soup.h2
string = str(tag.string)
print (string)
Output
Hello, Tutorialspoint!
正如 Python 字符串一样(不可变),NavigableString 也不能就地修改。但是,使用 replace_with() 可以将标记的内部字符串替换为另一个字符串。
Just as a Python string, which is immutable, the NavigableString also can’t be modified in place. However, use replace_with() to replace the inner string of a tag with another.
BeautifulSoup object
BeautifulSoup 对象表示整个已解析对象。但是,它可以被认为类似于 Tag 对象。它是我们在尝试抓取网络资源时创建的对象。因为它类似于 Tag 对象,所以它支持解析和搜索文档树所需的功能。
The BeautifulSoup object represents the entire parsed object. However, it can be considered to be similar to Tag object. It is the object created when we try to scrape a web resource. Because it is similar to a Tag object, it supports the functionality required to parse and search the document tree.
Example
from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')
print (soup)
print (soup.name)
print ('type:',type(soup))
Output
<html>
<head>
<title>TutorialsPoint</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul>
<li>Accounts</li>
<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
[document]
type: <class 'bs4.BeautifulSoup'>
BeautifulSoup 对象的 name 属性始终返回 [document]。
The name property of BeautifulSoup object always returns [document].
如果将 BeautifulSoup 对象作为参数传递给特定函数(例如 replace_with()),则可以合并两个已解析的文档。
Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().
Comment object
在 HTML 和 XML 文档中,任何写在 <!-- 和 -→ 之间的内容都被视为注释。BeautifulSoup 可以将此类注释文本检测为 Comment 对象。
Any text written between <!-- and -→ in HTML as well as XML document is treated as comment. BeautifulSoup can detect such commented text as a Comment object.
Example
from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print (comment, type(comment))