Beautiful Soup 简明教程

Beautiful Soup - Souping the Page

是时候在其中一个 html 页面（采用网页 - https://www.tutorialspoint.com/index.htm ，您可以选择任何其他您想要的网页）中测试我们的 Beautiful Soup 程序包，并从中提取一些信息了。

It is time to test our Beautiful Soup package in one of the html pages (taking web page - https://www.tutorialspoint.com/index.htm, you can choose any-other web page you want) and extract some information from it.

在以下代码中，我们尝试从网页中提取标题 -

In the below code, we are trying to extract the title from the webpage −

Example

from bs4 import BeautifulSoup
import requests


url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)

soup = BeautifulSoup(req.content, "html.parser")

print(soup.title)

Output

<title>Online Courses and eBooks Library<title>

一项常见任务是从网页中提取所有 URL。为此，我们只需要添加以下代码行 -

One common task is to extract all the URLs within a webpage. For that we just need to add the below line of code −

for link in soup.find_all('a'):
   print(link.get('href'))

Output

下面显示了上述循环的部分输出 -

Shown below is the partial output of the above loop −

https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/business/index.asp
https://www.tutorialspoint.com/market/teach_with_us.jsp
https://www.facebook.com/tutorialspointindia
https://www.instagram.com/tutorialspoint_/
https://twitter.com/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/categories/development
https://www.tutorialspoint.com/categories/it_and_software
https://www.tutorialspoint.com/categories/data_science_and_ai_ml
https://www.tutorialspoint.com/categories/cyber_security
https://www.tutorialspoint.com/categories/marketing
https://www.tutorialspoint.com/categories/office_productivity
https://www.tutorialspoint.com/categories/business
https://www.tutorialspoint.com/categories/lifestyle
https://www.tutorialspoint.com/latest/prime-packs
https://www.tutorialspoint.com/market/index.asp
https://www.tutorialspoint.com/latest/ebooks
…
…

要分析存储在当前工作目录中的网页，请获取指向 html 文件的文件对象，并将其用作 Beautiful Soup() 构造函数的参数。

To parse a web page stored locally in the current working directory, obtain the file object pointing to the html file, and use it as argument to the BeautifulSoup() constructor.

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

print(soup)

Output

<html>
<head>
<title>Hello World</title>
</head>
<body>
<h1 style="text-align:center;">Hello World</h1>
</body>
</html>

您还可以按如下方式使用包含 HTML 脚本的字符串作为构造函数的参数 -

You can also use a string that contains HTML script as constructor’s argument as follows −

from bs4 import BeautifulSoup

html = '''
<html>
   <head>
      <title>Hello World</title>
   </head>
   <body>
      <h1 style="text-align:center;">Hello World</h1>
   </body>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')

print(soup)

Beautiful Soup 使用可用的最佳解析器解析文档。如果没有另行指定，它将使用 HTML 解析器。

Beautiful Soup uses the best available parser to parse the document. It will use an HTML parser unless specified otherwise.