Beautiful Soup 简明教程
BeautifulSoup - Scraping Link from HTML
在从网站资源中抓取和分析内容时,通常需要提取页面包含的所有链接。在本节中,我们将会了解如何从 HTML 文档中提取链接。
While scraping and analysing the content from resources with a website, you are often required to extract all the links that a certain page contains. In this chapter, we shall find out how we can extract links from a HTML document.
HTML 有锚标记 <a> 用于插入超链接。锚标记的 href 属性允许你建立链接。此属性使用以下语法:
HTML has the anchor tag <a> to insert a hyperlink. The href attribute of anchor tag lets you to establish the link. It uses the following syntax −
<a href=="web page URL">hypertext</a>
利用 find_all() 方法,我们可以在文档中收集所有锚标记,然后打印每个锚标记的 href 属性值。
With the find_all() method we can collect all the anchor tags in a document and then print the value of href attribute of each of them.
在下面的示例中,我们提取 Google 主页上找到的所有链接。我们使用 requests 库来收集 https://google.com 的 HTML 内容,将其解析为 soup 对象,然后收集所有 <a> 标记。最后,我们打印 href 属性。
In the example below, we extract all the links found on Google’s home page. We use requests library to collect the HTML contents of https://google.com, parse it in a soup object, and then collect all <a> tags. Finally, we print href attributes.
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
tags = soup.find_all('a')
links = [tag['href'] for tag in tags]
for link in links:
print (link)
运行上述程序后的部分输出如下:
Here’s the partial output when the above program is run −
Output
https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/&ec=GAZAAQ
/advanced_search?hl=en-IN&authuser=0
https://www.google.com/url?q=https://io.google/2023/%3Futm_source%3Dgoogle-hpp%26utm_medium%3Dembedded_marketing%26utm_campaign%3Dhpp_watch_live%26utm_content%3D&source=hpp&id=19035434&ct=3&usg=AOvVaw0qzqTkP5AEv87NM-MUDd_u&sa=X&ved=0ahUKEwiPzpjku-z-AhU1qJUCHVmqDJoQ8IcBCAU
但是,HTML 文档中可能包含有不同协议方案的超链接,例如 mailto: 协议用于链接到电子邮件 ID,tel: 方案用于链接到电话号码,或带有 file:// URL 方案链接到本地文件。在这种情况下,如果我们有兴趣提取以:https:// 方案开头的链接,则可以通过以下示例来实现。我们有一个由不同类型的超链接组成的 HTML 文档,其中仅提取以 https:// 开头的链接。
However, a HTML document may have hyperlinks of different protocol schemes, such as mailto: protocol for link to an email ID, tel: scheme for link to a telephone number, or a link to a local file with file:// URL scheme. In such a case, if we are interested in extracting links with https:// scheme, we can do so by the following example. We have a HTML document that consists of hyperlinks of different types, out of which only ones with https:// prefix are being extracted.
html = '''
<p><a href="https://www.tutorialspoint.com">Web page link </a></p>
<p><a href="https://www.example.com">Web page link </a></p>
<p><a href="mailto:nowhere@mozilla.org">Email link</a></p>
<p><a href="tel:+4733378901">Telephone link</a></p>
'''
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all('a')
links = [tag['href'] for tag in tags]
for link in links:
if link.startswith("https"):
print (link)