Beautiful Soup 简明教程

Beautiful Soup - Souping the Page

是时候在其中一个 html 页面(采用网页 - https://www.tutorialspoint.com/index.htm ,您可以选择任何其他您想要的网页)中测试我们的 Beautiful Soup 程序包,并从中提取一些信息了。

在以下代码中,我们尝试从网页中提取标题 -

Example

from bs4 import BeautifulSoup
import requests


url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)

soup = BeautifulSoup(req.content, "html.parser")

print(soup.title)

Output

<title>Online Courses and eBooks Library<title>

一项常见任务是从网页中提取所有 URL。为此,我们只需要添加以下代码行 -

for link in soup.find_all('a'):
   print(link.get('href'))

Output

下面显示了上述循环的部分输出 -

https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/business/index.asp
https://www.tutorialspoint.com/market/teach_with_us.jsp
https://www.facebook.com/tutorialspointindia
https://www.instagram.com/tutorialspoint_/
https://twitter.com/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/categories/development
https://www.tutorialspoint.com/categories/it_and_software
https://www.tutorialspoint.com/categories/data_science_and_ai_ml
https://www.tutorialspoint.com/categories/cyber_security
https://www.tutorialspoint.com/categories/marketing
https://www.tutorialspoint.com/categories/office_productivity
https://www.tutorialspoint.com/categories/business
https://www.tutorialspoint.com/categories/lifestyle
https://www.tutorialspoint.com/latest/prime-packs
https://www.tutorialspoint.com/market/index.asp
https://www.tutorialspoint.com/latest/ebooks
…
…

要分析存储在当前工作目录中的网页,请获取指向 html 文件的文件对象,并将其用作 Beautiful Soup() 构造函数的参数。

Example

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

print(soup)

Output

<html>
<head>
<title>Hello World</title>
</head>
<body>
<h1 style="text-align:center;">Hello World</h1>
</body>
</html>

您还可以按如下方式使用包含 HTML 脚本的字符串作为构造函数的参数 -

from bs4 import BeautifulSoup

html = '''
<html>
   <head>
      <title>Hello World</title>
   </head>
   <body>
      <h1 style="text-align:center;">Hello World</h1>
   </body>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')

print(soup)

Beautiful Soup 使用可用的最佳解析器解析文档。如果没有另行指定,它将使用 HTML 解析器。