Beautiful Soup 简明教程
Beautiful Soup - Souping the Page
是时候在其中一个 html 页面(采用网页 - https://www.tutorialspoint.com/index.htm ,您可以选择任何其他您想要的网页)中测试我们的 Beautiful Soup 程序包,并从中提取一些信息了。
在以下代码中,我们尝试从网页中提取标题 -
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
print(soup.title)
Output
<title>Online Courses and eBooks Library<title>
一项常见任务是从网页中提取所有 URL。为此,我们只需要添加以下代码行 -
for link in soup.find_all('a'):
print(link.get('href'))
Output
下面显示了上述循环的部分输出 -
https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/business/index.asp
https://www.tutorialspoint.com/market/teach_with_us.jsp
https://www.facebook.com/tutorialspointindia
https://www.instagram.com/tutorialspoint_/
https://twitter.com/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/categories/development
https://www.tutorialspoint.com/categories/it_and_software
https://www.tutorialspoint.com/categories/data_science_and_ai_ml
https://www.tutorialspoint.com/categories/cyber_security
https://www.tutorialspoint.com/categories/marketing
https://www.tutorialspoint.com/categories/office_productivity
https://www.tutorialspoint.com/categories/business
https://www.tutorialspoint.com/categories/lifestyle
https://www.tutorialspoint.com/latest/prime-packs
https://www.tutorialspoint.com/market/index.asp
https://www.tutorialspoint.com/latest/ebooks
…
…
要分析存储在当前工作目录中的网页,请获取指向 html 文件的文件对象,并将其用作 Beautiful Soup() 构造函数的参数。
Example
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
print(soup)
Output
<html>
<head>
<title>Hello World</title>
</head>
<body>
<h1 style="text-align:center;">Hello World</h1>
</body>
</html>
您还可以按如下方式使用包含 HTML 脚本的字符串作为构造函数的参数 -
from bs4 import BeautifulSoup
html = '''
<html>
<head>
<title>Hello World</title>
</head>
<body>
<h1 style="text-align:center;">Hello World</h1>
</body>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Beautiful Soup 使用可用的最佳解析器解析文档。如果没有另行指定,它将使用 HTML 解析器。