Python Web Scraping 简明教程
Python Modules for Web Scraping
在本章中,让我们了解我们可以用于网页抓取的各种 Python 模块。
In this chapter, let us learn various Python modules that we can use for web scraping.
Python Development Environments using virtualenv
Virtualenv 是一个创建隔离的 Python 环境的工具。借助 virtualenv,我们可以创建一个文件夹,其中包含使用 Python 项目所需软件包的所有必需的可执行文件。它还允许我们添加和修改 Python 模块,而无需访问全局安装。
Virtualenv is a tool to create isolated Python environments. With the help of virtualenv, we can create a folder that contains all necessary executables to use the packages that our Python project requires. It also allows us to add and modify Python modules without access to the global installation.
你可以使用以下命令安装 virtualenv −
You can use the following command to install virtualenv −
(base) D:\ProgramData>pip install virtualenv
Collecting virtualenv
Downloading
https://files.pythonhosted.org/packages/b6/30/96a02b2287098b23b875bc8c2f58071c3
5d2efe84f747b64d523721dc2b5/virtualenv-16.0.0-py2.py3-none-any.whl
(1.9MB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1.9MB 86kB/s
Installing collected packages: virtualenv
Successfully installed virtualenv-16.0.0
现在,我们需要创建一个目录,该目录将使用以下命令来表示项目 −
Now, we need to create a directory which will represent the project with the help of following command −
(base) D:\ProgramData>mkdir webscrap
现在,使用以下命令进入该目录 −
Now, enter into that directory with the help of this following command −
(base) D:\ProgramData>cd webscrap
现在,我们需要按以下方式初始化我们选择的虚拟环境文件夹 −
Now, we need to initialize virtual environment folder of our choice as follows −
(base) D:\ProgramData\webscrap>virtualenv websc
Using base prefix 'd:\\programdata'
New python executable in D:\ProgramData\webscrap\websc\Scripts\python.exe
Installing setuptools, pip, wheel...done.
现在,使用下面给出的命令激活虚拟环境。一旦成功激活,你将在左侧的括号中看到它的名称。
Now, activate the virtual environment with the command given below. Once successfully activated, you will see the name of it on the left hand side in brackets.
(base) D:\ProgramData\webscrap>websc\scripts\activate
我们可以在此环境中安装任何模块,如下所示 −
We can install any module in this environment as follows −
(websc) (base) D:\ProgramData\webscrap>pip install requests
Collecting requests
Downloading
https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69
c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl (9
1kB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 92kB 148kB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests)
Downloading
https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca
55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133
kB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 369kB/s
Collecting certifi>=2017.4.17 (from requests)
Downloading
https://files.pythonhosted.org/packages/df/f7/04fee6ac349e915b82171f8e23cee6364
4d83663b34c539f7a09aed18f9e/certifi-2018.8.24-py2.py3-none-any.whl
(147kB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 153kB 527kB/s
Collecting urllib3<1.24,>=1.21.1 (from requests)
Downloading
https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5
3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl (133k
B)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 517kB/s
Collecting idna<2.8,>=2.5 (from requests)
Downloading
https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746
a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 61kB 339kB/s
Installing collected packages: chardet, certifi, urllib3, idna, requests
Successfully installed certifi-2018.8.24 chardet-3.0.4 idna-2.7 requests-2.19.1
urllib3-1.23
为了停用虚拟环境,我们可以使用以下命令 −
For deactivating the virtual environment, we can use the following command −
(websc) (base) D:\ProgramData\webscrap>deactivate
(base) D:\ProgramData\webscrap>
你可以看到 (websc) 已停用。
You can see that (websc) has been deactivated.
Python Modules for Web Scraping
网页抓取是构建一个代理的过程,该代理可以自动从网络中提取、解析、下载和整理有用的信息。换句话说,网页抓取软件将根据我们的要求自动加载和提取来自多个网站的数据,而不是手动从网站保存数据。
Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.
在本节中,我们将讨论用于网页抓取的有用的 Python 库。
In this section, we are going to discuss about useful Python libraries for web scraping.
Requests
这是一个简单的 python 网页抓取库。它是一个用于访问网页的高效 HTTP 库。借助 Requests ,我们可以获取网页的原始 HTML,然后对其进行解析以检索数据。在使用 requests 之前,让我们了解一下它的安装。
It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before using requests, let us understand its installation.
Installing Requests
我们可以在我们的虚拟环境中或全局安装中安装它。使用 pip 命令,我们可以轻松地将其安装如下 −
We can install it in either on our virtual environment or on the global installation. With the help of pip command, we can easily install it as follows −
(base) D:\ProgramData> pip install requests
Collecting requests
Using cached
https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69
c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl
Requirement already satisfied: idna<2.8,>=2.5 in d:\programdata\lib\sitepackages
(from requests) (2.6)
Requirement already satisfied: urllib3<1.24,>=1.21.1 in
d:\programdata\lib\site-packages (from requests) (1.22)
Requirement already satisfied: certifi>=2017.4.17 in d:\programdata\lib\sitepackages
(from requests) (2018.1.18)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in
d:\programdata\lib\site-packages (from requests) (3.0.4)
Installing collected packages: requests
Successfully installed requests-2.19.1
Example
在这个示例中,我们正在对网页发出 GET HTTP 请求。为此,我们需要首先导入 requests 库,如下所示 −
In this example, we are making a GET HTTP request for a web page. For this we need to first import requests library as follows −
In [1]: import requests
在以下代码行中,我们使用 requests 对 url: https://authoraditiagarwal.com/ 发出 GET HTTP 请求,即通过发出 GET 请求。
In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.
In [2]: r = requests.get('https://authoraditiagarwal.com/')
现在我们可以使用 .text 属性检索内容,如下所示 −
Now we can retrieve the content by using .text property as follows −
In [5]: r.text[:200]
请注意,在以下输出中,我们获得了前 200 个字符。
Observe that in the following output, we got the first 200 characters.
Out[5]: '<!DOCTYPE html>\n<html lang="en-US"\n\titemscope
\n\titemtype="http://schema.org/WebSite" \n\tprefix="og: http://ogp.me/ns#"
>\n<head>\n\t<meta charset
="UTF-8" />\n\t<meta http-equiv="X-UA-Compatible" content="IE'
Urllib3
这是另一个 Python 库,可用于从类似于 requests 库的 URL 中检索数据。你可以阅读其技术文档 https://urllib3.readthedocs.io/en/latest/ 了解更多信息。
It is another Python library that can be used for retrieving data from URLs similar to the requests library. You can read more on this at its technical documentation at https://urllib3.readthedocs.io/en/latest/.
Installing Urllib3
使用 pip 命令,我们可以在我们的虚拟环境中或在全局安装中安装 urllib3 。
Using the pip command, we can install urllib3 either in our virtual environment or in global installation.
(base) D:\ProgramData>pip install urllib3
Collecting urllib3
Using cached
https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5
3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl
Installing collected packages: urllib3
Successfully installed urllib3-1.23
Example: Scraping using Urllib3 and BeautifulSoup
在以下示例中,我们正在使用 Urllib3 和 BeautifulSoup 来抓取网页。我们正在使用 Urllib3 来代替 requests 库,从网页中获取原始数据(HTML)。然后我们使用 BeautifulSoup 解析该 HTML 数据。
In the following example, we are scraping the web page by using Urllib3 and BeautifulSoup. We are using Urllib3 at the place of requests library for getting the raw data (HTML) from web page. Then we are using BeautifulSoup for parsing that HTML data.
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(r.data, 'lxml')
print (soup.title)
print (soup.title.text)
这是在你运行此代码时会观察到的输出 −
This is the output you will observe when you run this code −
<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal
Selenium
它是一个跨浏览器的开源自动化测试套件,适用于不同的平台和应用程序。它不是一个单一的工具,而是一套软件。我们有面向 Python、Java、C#、Ruby 和 JavaScript 的 selenium 绑定。在这里,我们将使用 selenium 及其 Python 绑定来执行网络抓取。你可以在链接 Selenium 上了解有关使用 Java 进行 Selenium 的更多信息。
It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we are going to perform web scraping by using selenium and its Python bindings. You can learn more about Selenium with Java on the link Selenium.
Selenium Python 绑定提供了一个便捷的 API,可以访问 Selenium WebDrivers,如 Firefox、IE、Chrome、Remote 等。当前支持的 Python 版本为 2.7、3.5 及更高版本。
Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above.
Installing Selenium
使用 pip 命令,我们可以在我们的虚拟环境中或在全局安装中安装 urllib3 。
Using the pip command, we can install urllib3 either in our virtual environment or in global installation.
pip install selenium
由于 selenium 需要一个驱动程序来与所选浏览器进行交互,因此我们需要下载它。下表显示了不同的浏览器及其用于下载同一浏览器的链接。
As selenium requires a driver to interface with the chosen browser, we need to download it. The following table shows different browsers and their links for downloading the same.
Chrome |
Edge |
Firefox |
Safari |
Example
此示例显示了使用 selenium 的网络抓取。它还可以用于称为 selenium 测试的测试。
This example shows web scraping using selenium. It can also be used for testing which is called selenium testing.
在为指定版本的浏览器下载特定驱动程序后,我们需要用 Python 编程。
After downloading the particular driver for the specified version of browser, we need to do programming in Python.
首先,需要从 selenium 导入 webdriver ,如下所示 −
First, need to import webdriver from selenium as follows −
from selenium import webdriver
现在,提供我们根据要求下载的 web driver 的路径 −
Now, provide the path of web driver which we have downloaded as per our requirement −
path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
browser = webdriver.Chrome(executable_path = path)
现在,提供我们想要在现在由我们的 Python 脚本控制的 web 浏览器中打开的 url。
Now, provide the url which we want to open in that web browser now controlled by our Python script.
browser.get('https://authoraditiagarwal.com/leadershipmanagement')
我们还可以通过提供 lxml 中提供的 xpath 来抓取特定元素。
We can also scrape a particular element by providing the xpath as provided in lxml.
browser.find_element_by_xpath('/html/body').click()
你可以在受 Python 脚本控制的浏览器中查看输出。
You can check the browser, controlled by Python script, for output.
Scrapy
Scrapy 是一个快速、开放源代码的网络抓取框架,是用 Python 编写的,用于借助基于 XPath 的选择器从网页中提取数据。Scrapy 最初于 2008 年 6 月 26 日发布,获得 BSD 许可,1.0 里程碑于 2015 年 6 月发布。它为我们提供了从网站提取、处理和构造数据所需的所有工具。
Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June 2015. It provides us all the tools we need to extract, process and structure the data from websites.