Beautiful Soup 简明教程
Beautiful Soup - Scrape HTML Content
从网站中提取数据的过程称为网络抓取。网页可能包含 URL、电子邮件地址、图像或任何其他内容,我们可以将其存储在文件中或数据库中。手动搜索网站是一个繁琐的过程。有各种网络抓取工具可以实现该过程的自动化。
The process of extracting data from websites is called Web scraping. A web page may have urls, Email addresses, images or any other content, which we can be stored in a file or database. Searching a website manually is cumbersome process. There are different web scaping tools that automate the process.
有时通过使用“robots.txt”文件禁止网络抓取。一些热门的网站提供了 API,以结构化方式访问其数据。不道德的网络抓取可能会导致你的 IP 被封禁。
Web scraping is is sometimes prohibited by the use of 'robots.txt' file. Some popular sites provide APIs to access their data in a structured way. Unethical web scraping may result in getting your IP blocked.
Python 被广泛用于网络抓取。Python 标准库具有 urllib 包,该包可用于从 HTML 页面中提取数据。由于 urllib 模块已与标准库捆绑在一起,因此不需要安装它。
Python is widely used for web scraping. Python standard library has urllib package, which can be used to extract data from HTML pages. Since urllib module is bundled with the standard library, it need not be installed.
urllib 包是 Python 编程语言的 HTTP 客户端。当我们想要打开和读取 URL 时,urllib.request 模块非常有用。urllib 包中的其他模块有 −
The urllib package is an HTTP client for python programming language. The urllib.request module is usefule when we want to open and read URLs. Other module in urllib package are −
-
urllib.error defines the exceptions and errors raised by the urllib.request command.
-
urllib.parse is used for parsing URLs.
-
urllib.robotparser is used for parsing robots.txt files.
使用 urllib 模块中的 urlopen() 函数从网站读取网页的内容。
Use the urlopen() function in urllib module to read the content of a web page from a website.
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
你也可以为此目的使用 requests 库。使用之前你需要安装它。
You can also use the requests library for this purpose. You need to install it before using.
pip3 安装 requests
pip3 install requests
在以下代码中,抓取了 http://www.tutorialspoint.com 的主页 −
In the below code, the homepage of http://www.tutorialspoint.com is scraped −
from bs4 import BeautifulSoup
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
然后用 Beautiful Soup 解析由以上两种方法获得的内容。
The content obtained by either of the above two methods are then parsed with Beautiful Soup.