Beautiful Soup 简明教程
Beautiful Soup - Extract Email IDs
从网页中提取电子邮件地址是 Web 爬取库(如 BeautifulSoup)的一个重要应用。在任何网页中,电子邮件 ID 通常出现在链接 <a> 标记的 href 属性中。电子邮件 ID 使用 mailto URL 方案编写。许多时候,电子邮件地址可能以普通文本形式(没有超链接)存在于页面内容中。在本章中,我们将使用 BeautifulSoup 库,使用简单技术从 HTML 页面中获取电子邮件 ID。
To Extract Email addresses from a web page is an important application a web scraping library such as BeautifulSoup. In any web page, the Email IDs usually appear in the href attribute of anchor <a> tag. The Email ID is written using mailto URL scheme. Many a times, the Email Address may be present in page content as a normal text (without any hyperlink). In this chapter, we shall use BeautifulSoup library to fetch Email IDs from HTML page, with simple techniques.
在 href 属性中使用电子邮件 ID 的典型用法如下 -
A typical usage of Email ID in href attribute is as below −
<a href = "mailto:xyz@abc.com">test link</a>
在第一个示例中,我们将考虑以下 HTML 文档,用于从超链接中提取电子邮件 ID -
In the first example, we shall consider the following HTML document for extracting the Email IDs from the hyperlinks −
<html>
<head>
<title>BeautifulSoup - Scraping Email IDs</title>
</head>
<body>
<h2>Contact Us</h2>
<ul>
<li><a href = "mailto:sales@company.com">Sales Enquiries</a></li>
<li><a href = "mailto:careers@company.com">Careers</a></li>
<li><a href = "mailto:partner@company.com">Partner with us</a></li>
</ul>
</body>
</html>
以下是查找电子邮件 ID 的 Python 代码。我们收集文档中的所有 <a> 标记,并检查标记是否具有 href 属性。如果为真,它的值在第 6 个字符之后的部分就是电子邮件 ID。
Here’s the Python code that finds the Email Ids. We collect all the <a> tags in the document, and check if the tag has href attribute. If true, the part of its value after 6th character is the email Id.
from bs4 import BeautifulSoup
import re
fp = open("contact.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all("a")
for tag in tags:
if tag.has_attr("href") and tag['href'][:7]=='mailto:':
print (tag['href'][7:])
对于给定的 HTML 文档,电子邮件 ID 将按如下方式提取 -
For the given HTML document, the Email IDs will be extracted as follows −
sales@company.com
careers@company.com
partner@company.com
在第二个示例中,我们假定电子邮件 ID 出现在文本中的任何位置。要提取它们,我们使用 regex 搜索机制。Regex 是一个复杂的字符模式。Python 的 re 模块有助于处理 regex(正则表达式)模式。以下 regex 模式用于搜索电子邮件地址 -
In the second example, we assume that the Email IDs appear anywhere in the text. To extract them, we use the regex searching mechanism. Regex is a complex character pattern. Python’s re module helps in processing the regex (Regular Expression) patterns. The following regex pattern is used for searching the email address −
pat = r'[\w.+-]+@[\w-]+\.[\w.-]+'
对于此练习,我们将使用以下 HTML 文档,其中电子邮件 ID 位于 <li> 标记中。
For this exercise, we shall use the following HTML document, having Email IDs in <li>tags.
<html>
<head>
<title>BeautifulSoup - Scraping Email IDs</title>
</head>
<body>
<h2>Contact Us</h2>
<ul>
<li>Sales Enquiries: sales@company.com</a></li>
<li>Careers: careers@company.com</a></li>
<li>Partner with us: partner@company.com</a></li>
</ul>
</body>
</html>
使用电子邮件 regex,我们将找到模式在每个 <li> 标记字符串中的出现。以下是 Python 代码 -
Using the email regex, we’ll find the appearance of the pattern in each <li> tag string. Here is the Python code −