Python Web Scraping 简明教程

Python Web Scraping - Quick Guide

Python Web Scraping - Introduction

Web 抓取是从 Web 中提取信息的自动化过程。本章将向您提供有关 Web 抓取、它与 Web 爬网的比较以及为什么您应该选择 Web 抓取的深入了解。您还将了解 Web 抓取工具的组成部分和工作原理。

Web scraping is an automatic process of extracting information from web. This chapter will give you an in-depth idea of web scraping, its comparison with web crawling, and why you should opt for web scraping. You will also learn about the components and working of a web scraper.

What is Web Scraping?

“抓取”一词的字典含义是指从 Web 上获取某些内容。这里出现两个问题:我们可以从网络上获取什么以及如何获取。

The dictionary meaning of word ‘Scrapping’ implies getting something from the web. Here two questions arise: What we can get from the web and How to get that.

第一个问题的答案是 ‘data’ 。数据对于任何程序员来说都是必不可少的,每个编程项目的基本要求是大量有用的数据。

The answer to the first question is ‘data’. Data is indispensable for any programmer and the basic requirement of every programming project is the large amount of useful data.

第二个问题的答案有点棘手,因为获取数据的方法有很多。通常,我们可能从数据库或数据文件和其他来源获取数据。但是,如果我们需要大量可在线获取的数据怎么办?获取此类数据的一种方法是手动搜索(在 Web 浏览器中点击)并保存(复制粘贴到电子表格或文件)所需的数据。这种方法相当乏味且耗时。获取此类数据的另一种方法是使用 web scraping

The answer to the second question is a bit tricky, because there are lots of ways to get data. In general, we may get data from a database or data file and other sources. But what if we need large amount of data that is available online? One way to get such kind of data is to manually search (clicking away in a web browser) and save (copy-pasting into a spreadsheet or file) the required data. This method is quite tedious and time consuming. Another way to get such data is using web scraping.

Web scraping ,也称为 web data miningweb harvesting ,是构建一个代理的过程,该代理可以自动从 Web 中提取、解析、下载和组织有用的信息。换句话说,我们可以说,Web 抓取软件将自动加载和提取来自多个网站的数据,而不是手动从网站中保存数据,并根据我们的要求进行提取。

Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, we can say that instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

Origin of Web Scraping

Web 抓取的起源是屏幕抓取,它用于集成非基于 Web 的应用程序或本机 Windows 应用程序。最初,屏幕抓取是在万维网 (WWW) 广泛使用之前使用的,但它无法扩展 WWW。这使得有必要将屏幕抓取方法自动化, ‘Web Scraping’ 技术应运而生。

The origin of web scraping is screen scrapping, which was used to integrate non-web based applications or native windows applications. Originally screen scraping was used prior to the wide use of World Wide Web (WWW), but it could not scale up WWW expanded. This made it necessary to automate the approach of screen scraping and the technique called ‘Web Scraping’ came into existence.

Web Crawling v/s Web Scraping

术语 Web 爬网和抓取经常互换使用,因为它们的基本概念是提取数据。但是,它们是彼此不同的。我们可以从它们的定义中理解基本差异。

The terms Web Crawling and Scraping are often used interchangeably as the basic concept of them is to extract data. However, they are different from each other. We can understand the basic difference from their definitions.

网络爬取基本上用于使用机器人(又名爬虫)对页面上的信息进行索引。它也称为 indexing 。另一方面,网络抓取是一种使用机器人(又名抓取器)提取信息自动化方式。它也称为 data extraction

Web crawling is basically used to index the information on the page using bots aka crawlers. It is also called indexing. On the hand, web scraping is an automated way of extracting the information using bots aka scrapers. It is also called data extraction.

为了理解这两个术语之间的差异,让我们看看下面给出的比较表 -

To understand the difference between these two terms, let us look into the comparison table given hereunder −

Web Crawling

Web Scraping

Refers to downloading and storing the contents of a large number of websites.

Refers to extracting individual data elements from the website by using a site-specific structure.

Mostly done on large scale.

Can be implemented at any scale.

Yields generic information.

Yields specific information.

Used by major search engines like Google, Bing, Yahoo. Googlebot is an example of a web crawler.

The information extracted using web scraping can be used to replicate in some other website or can be used to perform data analysis. For example the data elements can be names, address, price etc.

Uses of Web Scraping

使用网络抓取的原因和用途和万维网的用途一样无穷无尽。网络抓取器可以像人类一样执行任何操作,比如在线订餐、为你扫描在线购物网站,并在它们可用时购买比赛门票。下面讨论了网络抓取的一些重要用途 -

The uses and reasons for using web scraping are as endless as the uses of the World Wide Web. Web scrapers can do anything like ordering online food, scanning online shopping website for you and buying ticket of a match the moment they are available etc. just like a human can do. Some of the important uses of web scraping are discussed here −

  1. E-commerce Websites − Web scrapers can collect the data specially related to the price of a specific product from various e-commerce websites for their comparison.

  2. Content Aggregators − Web scraping is used widely by content aggregators like news aggregators and job aggregators for providing updated data to their users.

  3. Marketing and Sales Campaigns − Web scrapers can be used to get the data like emails, phone number etc. for sales and marketing campaigns.

  4. Search Engine Optimization (SEO) − Web scraping is widely used by SEO tools like SEMRush, Majestic etc. to tell business how they rank for search keywords that matter to them.

  5. Data for Machine Learning Projects − Retrieval of data for machine learning projects depends upon web scraping.

Data for Research - 研究人员可以通过这种自动化过程节省时间,为其研究工作收集有用的数据。

Data for Research − Researchers can collect useful data for the purpose of their research work by saving their time by this automated process.

Components of a Web Scraper

网络抓取器包括以下组件 -

A web scraper consists of the following components −

Web Crawler Module

网络抓取器的一个非常必要的组件网络爬虫模块,用于通过对 URL 发出 HTTP 或 HTTPS 请求在目标网站上导航。抓取器下载非结构化数据(HTML 内容)并将其传递给提取器(下一个模块)。

A very necessary component of web scraper, web crawler module, is used to navigate the target website by making HTTP or HTTPS request to the URLs. The crawler downloads the unstructured data (HTML contents) and passes it to extractor, the next module.

Extractor

提取器处理提取的 HTML 内容并将数据提取为半结构化格式。这也称为解析器模块,并使用正则表达式、HTML 解析、DOM 解析或人工智能等不同的解析技术来运行。

The extractor processes the fetched HTML content and extracts the data into semistructured format. This is also called as a parser module and uses different parsing techniques like Regular expression, HTML Parsing, DOM parsing or Artificial Intelligence for its functioning.

Data Transformation and Cleaning Module

提取的上述数据不适合直接使用。它必须通过一些清理模块,以便我们能使用它。字符串操作或正则表达式等方法可用于此目的。请注意,提取和转换也可以一步完成。

The data extracted above is not suitable for ready use. It must pass through some cleaning module so that we can use it. The methods like String manipulation or regular expression can be used for this purpose. Note that extraction and transformation can be performed in a single step also.

Storage Module

在提取数据之后,我们需要根据我们的需求来存储数据。存储模块将以标准格式输出数据,该数据可以存储在数据库或 JSON 或 CSV 格式中。

After extracting the data, we need to store it as per our requirement. The storage module will output the data in a standard format that can be stored in a database or JSON or CSV format.

Working of a Web Scraper

网络抓取可定义为用于下载多个网页的内容并从中提取数据的一个软件或脚本。

Web scraper may be defined as a software or script used to download the contents of multiple web pages and extracting data from it.

web scraper

我们从上图的图表中,可以简单几个步骤了解网络抓取的工作原理。

We can understand the working of a web scraper in simple steps as shown in the diagram given above.

Step 1: Downloading Contents from Web Pages

在此步骤中,网络抓取将从多个网页中下载所请求的内容。

In this step, a web scraper will download the requested contents from multiple web pages.

Step 2: Extracting Data

网站上的数据是 HTML 格式,且大部分是非结构化的。因此,在此步骤中,网络抓取将从下载的内容中解析并提取结构化的数据。

The data on websites is HTML and mostly unstructured. Hence, in this step, web scraper will parse and extract structured data from the downloaded contents.

Step 3: Storing the Data

此处,网络抓取将存储和保存以 CSV、JSON 或者数据库等格式提取的数据。

Here, a web scraper will store and save the extracted data in any of the format like CSV, JSON or in database.

Step 4: Analyzing the Data

在完成所有这些步骤后,网络抓取将分析所获得的数据。

After all these steps are successfully done, the web scraper will analyze the data thus obtained.

Getting Started with Python

在第一章中,我们学习了网络抓取是什么。在本章中,让我们看看如何使用Python来实现网络抓取。

In the first chapter, we have learnt what web scraping is all about. In this chapter, let us see how to implement web scraping using Python.

Why Python for Web Scraping?

Python是实现网络抓取的流行工具。Python编程语言也用于与网络安全、渗透测试以及数字取证应用程序相关的其他有用的项目。使用Python的基本编程,可以在不使用任何其他第三方工具的情况下执行网络抓取。

Python is a popular tool for implementing web scraping. Python programming language is also used for other useful projects related to cyber security, penetration testing as well as digital forensic applications. Using the base programming of Python, web scraping can be performed without using any other third party tool.

Python编程语言正获得巨大的欢迎,使Python非常适合网络抓取项目的理由如下:

Python programming language is gaining huge popularity and the reasons that make Python a good fit for web scraping projects are as below −

Syntax Simplicity

与其他编程语言相比,Python具有最简单的结构。Python的这个特性使其测试更加容易,开发人员可以更多地关注编程。

Python has the simplest structure when compared to other programming languages. This feature of Python makes the testing easier and a developer can focus more on programming.

Inbuilt Modules

使用Python进行网络抓取的另一个原因是它拥有内置的和外部的有用库。我们可以通过使用Python作为编程的基础,执行与网络抓取相关的许多实现。

Another reason for using Python for web scraping is the inbuilt as well as external useful libraries it possesses. We can perform many implementations related to web scraping by using Python as the base for programming.

Open Source Programming Language

Python获得了社区的巨大支持,因为它是一种开源编程语言。

Python has huge support from the community because it is an open source programming language.

Wide range of Applications

Python可用于从小型shell脚本到企业web应用程序的各种编程任务。

Python can be used for various programming tasks ranging from small shell scripts to enterprise web applications.

Installation of Python

Python发行版适用于Windows、MAC和Unix/Linux等平台。我们只需要下载适用于我们平台的二进制代码即可安装Python。但是如果我们平台的二进制代码不可用,则我们必须有一个C编译器,以便可以手工编译源代码。

Python distribution is available for platforms like Windows, MAC and Unix/Linux. We need to download only the binary code applicable for our platform to install Python. But in case if the binary code for our platform is not available, we must have a C compiler so that source code can be compiled manually.

我们可在不同平台上安装 Python,方法如下 −

We can install Python on various platforms as follows −

Installing Python on Unix and Linux

您需要执行以下步骤才能在 Unix/Linux 机器上安装 Python −

You need to followings steps given below to install Python on Unix/Linux machines −

Step 1 − 访问链接 https://www.python.org/downloads/

Step 1 − Go to the link https://www.python.org/downloads/

Step 2 − 下载适用于 Unix/Linux 的压缩源代码,这是在上述链接中提供的。

Step 2 − Download the zipped source code available for Unix/Linux on above link.

Step 3 − 将这些文件解压到您的机器上。

Step 3 − Extract the files onto your computer.

Step 4 − 使用以下命令完成安装 −

Step 4 − Use the following commands to complete the installation −

run ./configure script
make
make install

您可以在标准位置 /usr/local/bin 中找到已安装的 Python,其库位于 /usr/local/lib/pythonXX ,其中 XX 是 Python 的版本。

You can find installed Python at the standard location /usr/local/bin and its libraries at /usr/local/lib/pythonXX, where XX is the version of Python.

Installing Python on Windows

您需要执行以下步骤才能在 Windows 机器上安装 Python −

You need to followings steps given below to install Python on Windows machines −

Step 1 − 访问链接 https://www.python.org/downloads/

Step 1 − Go to the link https://www.python.org/downloads/

Step 2 − 下载 Windows 安装程序 python-XYZ.msi 文件,其中 XYZ 是我们需要安装的版本。

Step 2 − Download the Windows installer python-XYZ.msi file, where XYZ is the version we need to install.

Step 3 − 现在,将安装程序文件保存在您的本地机器中并运行 MSI 文件。

Step 3 − Now, save the installer file to your local machine and run the MSI file.

Step 4 − 最后,运行下载的文件,调出 Python 安装向导。

Step 4 − At last, run the downloaded file to bring up the Python install wizard.

Installing Python on Macintosh

我们必须使用 Homebrew 来在 Mac OS X 上安装 Python 3。Homebrew 易于安装且是一个出色的软件包安装程序。

We must use Homebrew for installing Python 3 on Mac OS X. Homebrew is easy to install and a great package installer.

也可以使用以下命令安装 Homebrew −

Homebrew can also be installed by using the following command −

$ ruby -e "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/master/install)"

为更新软件包管理器,我们可以使用以下命令 −

For updating the package manager, we can use the following command −

$ brew update

借助以下命令,我们可在我们的 MAC 机器上安装 Python3 −

With the help of the following command, we can install Python3 on our MAC machine −

$ brew install python3

Setting Up the PATH

您可以使用以下说明设置不同环境中的路径 −

You can use the following instructions to set up the path on various environments −

Setting Up the Path on Unix/Linux

使用以下命令通过不同的命令 shell 设置路径 −

Use the following commands for setting up paths using various command shells −

For csh shell

setenv PATH "$PATH:/usr/local/bin/python".

For bash shell (Linux)

ATH="$PATH:/usr/local/bin/python".

For sh or ksh shell

PATH="$PATH:/usr/local/bin/python".

Setting Up the Path on Windows

为设置 Windows 上的路径,我们可以在命令提示符处使用以下路径 %path%;C:\Python ,然后按 Enter。

For setting the path on Windows, we can use the path %path%;C:\Python at the command prompt and then press Enter.

Running Python

我们可以通过以下三种方式中的任何一种启动 Python −

We can start Python using any of the following three ways −

Interactive Interpreter

可使用提供命令行解释器或 shell 的操作系统,如 UNIX 和 DOS 来启动 Python。

An operating system such as UNIX and DOS that is providing a command-line interpreter or shell can be used for starting Python.

我们可以按照以下方式在交互解释器中开始编码:

We can start coding in interactive interpreter as follows −

Step 1 − 在命令行中输入 python

Step 1 − Enter python at the command line.

Step 2 - 然后,我们可以在交互解释器中立即开始编码。

Step 2 − Then, we can start coding right away in the interactive interpreter.

$python # Unix/Linux
or
python% # Unix/Linux
or
C:> python # Windows/DOS

Script from the Command-line

我们可以通过调用解释器来在命令行执行 Python 脚本。它可以理解为以下内容:

We can execute a Python script at command line by invoking the interpreter. It can be understood as follows −

$python script.py # Unix/Linux
or
python% script.py # Unix/Linux
or
C: >python script.py # Windows/DOS

Integrated Development Environment

如果系统具有支持 Python 的 GUI 应用程序,我们还可以从 GUI 环境运行 Python。以下列出了一些在各种平台上支持 Python 的集成开发环境:

We can also run Python from GUI environment if the system is having GUI application that is supporting Python. Some IDEs that support Python on various platforms are given below −

IDE for UNIX - UNIX 针对 Python 具有 IDLE IDE。

IDE for UNIX − UNIX, for Python, has IDLE IDE.

IDE for Windows - Windows 具有具有 GUI 的 PythonWin IDE。

IDE for Windows − Windows has PythonWin IDE which has GUI too.

IDE for Macintosh - Macintosh 具有 IDLE IDE,可以从主网站下载为 MacBinary 或 BinHex’d 文件。

IDE for Macintosh − Macintosh has IDLE IDE which is downloadable as either MacBinary or BinHex’d files from the main website.

Python Modules for Web Scraping

在本章中,让我们了解我们可以用于网页抓取的各种 Python 模块。

In this chapter, let us learn various Python modules that we can use for web scraping.

Python Development Environments using virtualenv

Virtualenv 是一个创建隔离的 Python 环境的工具。借助 virtualenv,我们可以创建一个文件夹,其中包含使用 Python 项目所需软件包的所有必需的可执行文件。它还允许我们添加和修改 Python 模块,而无需访问全局安装。

Virtualenv is a tool to create isolated Python environments. With the help of virtualenv, we can create a folder that contains all necessary executables to use the packages that our Python project requires. It also allows us to add and modify Python modules without access to the global installation.

你可以使用以下命令安装 virtualenv

You can use the following command to install virtualenv

(base) D:\ProgramData>pip install virtualenv
Collecting virtualenv
   Downloading
https://files.pythonhosted.org/packages/b6/30/96a02b2287098b23b875bc8c2f58071c3
5d2efe84f747b64d523721dc2b5/virtualenv-16.0.0-py2.py3-none-any.whl
(1.9MB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1.9MB 86kB/s
Installing collected packages: virtualenv
Successfully installed virtualenv-16.0.0

现在,我们需要创建一个目录,该目录将使用以下命令来表示项目 −

Now, we need to create a directory which will represent the project with the help of following command −

(base) D:\ProgramData>mkdir webscrap

现在,使用以下命令进入该目录 −

Now, enter into that directory with the help of this following command −

(base) D:\ProgramData>cd webscrap

现在,我们需要按以下方式初始化我们选择的虚拟环境文件夹 −

Now, we need to initialize virtual environment folder of our choice as follows −

(base) D:\ProgramData\webscrap>virtualenv websc
Using base prefix 'd:\\programdata'
New python executable in D:\ProgramData\webscrap\websc\Scripts\python.exe
Installing setuptools, pip, wheel...done.

现在,使用下面给出的命令激活虚拟环境。一旦成功激活,你将在左侧的括号中看到它的名称。

Now, activate the virtual environment with the command given below. Once successfully activated, you will see the name of it on the left hand side in brackets.

(base) D:\ProgramData\webscrap>websc\scripts\activate

我们可以在此环境中安装任何模块,如下所示 −

We can install any module in this environment as follows −

(websc) (base) D:\ProgramData\webscrap>pip install requests
Collecting requests
   Downloading
https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69
c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl (9
1kB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 92kB 148kB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests)
   Downloading
https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca
55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133
kB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 369kB/s
Collecting certifi>=2017.4.17 (from requests)
   Downloading
https://files.pythonhosted.org/packages/df/f7/04fee6ac349e915b82171f8e23cee6364
4d83663b34c539f7a09aed18f9e/certifi-2018.8.24-py2.py3-none-any.whl
(147kB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 153kB 527kB/s
Collecting urllib3<1.24,>=1.21.1 (from requests)
   Downloading
https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5
3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl (133k
B)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 143kB 517kB/s
Collecting idna<2.8,>=2.5 (from requests)
   Downloading
https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746
a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 61kB 339kB/s
Installing collected packages: chardet, certifi, urllib3, idna, requests
Successfully installed certifi-2018.8.24 chardet-3.0.4 idna-2.7 requests-2.19.1
urllib3-1.23

为了停用虚拟环境,我们可以使用以下命令 −

For deactivating the virtual environment, we can use the following command −

(websc) (base) D:\ProgramData\webscrap>deactivate
(base) D:\ProgramData\webscrap>

你可以看到 (websc) 已停用。

You can see that (websc) has been deactivated.

Python Modules for Web Scraping

网页抓取是构建一个代理的过程,该代理可以自动从网络中提取、解析、下载和整理有用的信息。换句话说,网页抓取软件将根据我们的要求自动加载和提取来自多个网站的数据,而不是手动从网站保存数据。

Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

在本节中,我们将讨论用于网页抓取的有用的 Python 库。

In this section, we are going to discuss about useful Python libraries for web scraping.

Requests

这是一个简单的 python 网页抓取库。它是一个用于访问网页的高效 HTTP 库。借助 Requests ,我们可以获取网页的原始 HTML,然后对其进行解析以检索数据。在使用 requests 之前,让我们了解一下它的安装。

It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before using requests, let us understand its installation.

Installing Requests

我们可以在我们的虚拟环境中或全局安装中安装它。使用 pip 命令,我们可以轻松地将其安装如下 −

We can install it in either on our virtual environment or on the global installation. With the help of pip command, we can easily install it as follows −

(base) D:\ProgramData> pip install requests
Collecting requests
Using cached
https://files.pythonhosted.org/packages/65/47/7e02164a2a3db50ed6d8a6ab1d6d60b69
c4c3fdf57a284257925dfc12bda/requests-2.19.1-py2.py3-none-any.whl
Requirement already satisfied: idna<2.8,>=2.5 in d:\programdata\lib\sitepackages
(from requests) (2.6)
Requirement already satisfied: urllib3<1.24,>=1.21.1 in
d:\programdata\lib\site-packages (from requests) (1.22)
Requirement already satisfied: certifi>=2017.4.17 in d:\programdata\lib\sitepackages
(from requests) (2018.1.18)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in
d:\programdata\lib\site-packages (from requests) (3.0.4)
Installing collected packages: requests
Successfully installed requests-2.19.1

Example

在这个示例中,我们正在对网页发出 GET HTTP 请求。为此,我们需要首先导入 requests 库,如下所示 −

In this example, we are making a GET HTTP request for a web page. For this we need to first import requests library as follows −

In [1]: import requests

在以下代码行中,我们使用 requests 对 url: https://authoraditiagarwal.com/ 发出 GET HTTP 请求,即通过发出 GET 请求。

In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.

In [2]: r = requests.get('https://authoraditiagarwal.com/')

现在我们可以使用 .text 属性检索内容,如下所示 −

Now we can retrieve the content by using .text property as follows −

In [5]: r.text[:200]

请注意,在以下输出中,我们获得了前 200 个字符。

Observe that in the following output, we got the first 200 characters.

Out[5]: '<!DOCTYPE html>\n<html lang="en-US"\n\titemscope
\n\titemtype="http://schema.org/WebSite" \n\tprefix="og: http://ogp.me/ns#"
>\n<head>\n\t<meta charset
="UTF-8" />\n\t<meta http-equiv="X-UA-Compatible" content="IE'

Urllib3

这是另一个 Python 库,可用于从类似于 requests 库的 URL 中检索数据。你可以阅读其技术文档 https://urllib3.readthedocs.io/en/latest/ 了解更多信息。

It is another Python library that can be used for retrieving data from URLs similar to the requests library. You can read more on this at its technical documentation at https://urllib3.readthedocs.io/en/latest/.

Installing Urllib3

使用 pip 命令,我们可以在我们的虚拟环境中或在全局安装中安装 urllib3

Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

(base) D:\ProgramData>pip install urllib3
Collecting urllib3
Using cached
https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5
3851ef4f56f62a3486e6a7d8ffb/urllib3-1.23-py2.py3-none-any.whl
Installing collected packages: urllib3
Successfully installed urllib3-1.23

Example: Scraping using Urllib3 and BeautifulSoup

在以下示例中,我们正在使用 Urllib3BeautifulSoup 来抓取网页。我们正在使用 Urllib3 来代替 requests 库,从网页中获取原始数据(HTML)。然后我们使用 BeautifulSoup 解析该 HTML 数据。

In the following example, we are scraping the web page by using Urllib3 and BeautifulSoup. We are using Urllib3 at the place of requests library for getting the raw data (HTML) from web page. Then we are using BeautifulSoup for parsing that HTML data.

import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(r.data, 'lxml')
print (soup.title)
print (soup.title.text)

这是在你运行此代码时会观察到的输出 −

This is the output you will observe when you run this code −

<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal

Selenium

它是一个跨浏览器的开源自动化测试套件,适用于不同的平台和应用程序。它不是一个单一的工具,而是一套软件。我们有面向 Python、Java、C#、Ruby 和 JavaScript 的 selenium 绑定。在这里,我们将使用 selenium 及其 Python 绑定来执行网络抓取。你可以在链接 Selenium 上了解有关使用 Java 进行 Selenium 的更多信息。

It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we are going to perform web scraping by using selenium and its Python bindings. You can learn more about Selenium with Java on the link Selenium.

Selenium Python 绑定提供了一个便捷的 API,可以访问 Selenium WebDrivers,如 Firefox、IE、Chrome、Remote 等。当前支持的 Python 版本为 2.7、3.5 及更高版本。

Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above.

Installing Selenium

使用 pip 命令,我们可以在我们的虚拟环境中或在全局安装中安装 urllib3

Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

pip install selenium

由于 selenium 需要一个驱动程序来与所选浏览器进行交互,因此我们需要下载它。下表显示了不同的浏览器及其用于下载同一浏览器的链接。

As selenium requires a driver to interface with the chosen browser, we need to download it. The following table shows different browsers and their links for downloading the same.

Chrome

https://sites.google.com/a/chromium.org/

Edge

https://developer.microsoft.com/

Firefox

https://github.com/

Safari

https://webkit.org/

Example

此示例显示了使用 selenium 的网络抓取。它还可以用于称为 selenium 测试的测试。

This example shows web scraping using selenium. It can also be used for testing which is called selenium testing.

在为指定版本的浏览器下载特定驱动程序后,我们需要用 Python 编程。

After downloading the particular driver for the specified version of browser, we need to do programming in Python.

首先,需要从 selenium 导入 webdriver ,如下所示 −

First, need to import webdriver from selenium as follows −

from selenium import webdriver

现在,提供我们根据要求下载的 web driver 的路径 −

Now, provide the path of web driver which we have downloaded as per our requirement −

path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
browser = webdriver.Chrome(executable_path = path)

现在,提供我们想要在现在由我们的 Python 脚本控制的 web 浏览器中打开的 url。

Now, provide the url which we want to open in that web browser now controlled by our Python script.

browser.get('https://authoraditiagarwal.com/leadershipmanagement')

我们还可以通过提供 lxml 中提供的 xpath 来抓取特定元素。

We can also scrape a particular element by providing the xpath as provided in lxml.

browser.find_element_by_xpath('/html/body').click()

你可以在受 Python 脚本控制的浏览器中查看输出。

You can check the browser, controlled by Python script, for output.

Scrapy

Scrapy 是一个快速、开放源代码的网络抓取框架,是用 Python 编写的,用于借助基于 XPath 的选择器从网页中提取数据。Scrapy 最初于 2008 年 6 月 26 日发布,获得 BSD 许可,1.0 里程碑于 2015 年 6 月发布。它为我们提供了从网站提取、处理和构造数据所需的所有工具。

Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June 2015. It provides us all the tools we need to extract, process and structure the data from websites.

Installing Scrapy

使用 pip 命令,我们可以在我们的虚拟环境中或在全局安装中安装 urllib3

Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

pip install scrapy

欲获取Scrapy的更多详细研究,您可以访问链接 Scrapy

For more detail study of Scrapy you can go to the link Scrapy

Legality of Web Scraping

使用 Python,我们可以抓取任何网站或网页上的特定元素,但你知道这是合法的还是非法的吗?在抓取任何网站之前,我们必须了解 Web 抓取的合法性。本章将解释与 Web 抓取的合法性相关的内容。

With Python, we can scrape any website or particular elements of a web page but do you have any idea whether it is legal or not? Before scraping any website we must have to know about the legality of web scraping. This chapter will explain the concepts related to legality of web scraping.

Introduction

一般来说,如果你要将抓取到的数据用于个人用途,那么可能没有任何问题。但如果你要重新发布这些数据,那么在这样做之前,你应该向所有人请求下载,或者对政策以及你将要抓取的数据做一些背景调查。

Generally, if you are going to use the scraped data for personal use, then there may not be any problem. But if you are going to republish that data, then before doing the same you should make download request to the owner or do some background research about policies as well about the data you are going to scrape.

Research Required Prior to Scraping

如果你以一个网站为目标从其抓取数据,我们需要了解它的规模和结构。以下是我们开始抓取之前需要分析的一些文件。

If you are targeting a website for scraping data from it, we need to understand its scale and structure. Following are some of the files which we need to analyze before starting web scraping.

Analyzing robots.txt

实际上,大多数发布商都允许程序员在一定程度上抓取他们的网站。换句话说,发布商希望抓取网站的特定部分。为了定义这一点,网站必须提出一些规则,说明哪些部分可以抓取,哪些部分不能抓取。这些规则在名为 robots.txt 的文件中定义。

Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.

robots.txt 是人类可读的文件,用于识别允许抓取和不允许抓取网站部分的部分。robots.txt 文件没有标准格式,网站的发布商可以根据他们的需求进行修改。我们可以通过在网站的 url 后面加上斜杠和 robots.txt 来检查特定网站的 robots.txt 文件。例如,如果我们想检查 Google.com 的文件,那么我们需要键入 https://www.google.com/robots.txt ,我们将得到以下内容:

robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type https://www.google.com/robots.txt and we will get something as follows −

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
and so on……..

网站 robots.txt 文件中定义的一些最常见的规则如下:

Some of the most common rules that are defined in a website’s robots.txt file are as follows −

   User-agent: BadCrawler
Disallow: /

以上规则表示 robots.txt 文件要求具有 BadCrawler 用户代理的抓取工具程序不要抓取他们的网站。

The above rule means the robots.txt file asks a crawler with BadCrawler user agent not to crawl their website.

User-agent: *
Crawl-delay: 5
Disallow: /trap

以上规则表示 robots.txt 文件会将带有所有用户代理的抓取工具程序下载请求之间的延迟设定为 5 秒,以避免服务器过载。 /trap 链接将尝试阻止恶意爬虫,这些爬虫会跟随不允许的链接。网站发布者可以根据他们的要求定义更多规则。这里讨论了其中一些规则:

The above rule means the robots.txt file delays a crawler for 5 seconds between download requests for all user-agents for avoiding overloading server. The /trap link will try to block malicious crawlers who follow disallowed links. There are many more rules that can be defined by the publisher of the website as per their requirements. Some of them are discussed here −

Analyzing Sitemap files

如果你想抓取一个网站以获取更新的信息,你该怎么做?你将抓取每个网页以获取更新的信息,但这会增加该特定网站的服务器流量。这就是网站提供站点地图文件帮助抓取工具程序在无需抓取每个网页的情况下找到更新内容的原因。站点地图标准定义在 http://www.sitemaps.org/protocol.html

What you supposed to do if you want to crawl a website for updated information? You will crawl every web page for getting that updated information, but this will increase the server traffic of that particular website. That is why websites provide sitemap files for helping the crawlers to locate updating content without needing to crawl every web page. Sitemap standard is defined at http://www.sitemaps.org/protocol.html.

Content of Sitemap file

以下是 robot.txt 文件中发现的 https://www.microsoft.com/robots.txt 站点地图文件的内容:

The following is the content of sitemap file of https://www.microsoft.com/robots.txt that is discovered in robot.txt file −

Sitemap: https://www.microsoft.com/en-us/explore/msft_sitemap_index.xml
Sitemap: https://www.microsoft.com/learning/sitemap.xml
Sitemap: https://www.microsoft.com/en-us/licensing/sitemap.xml
Sitemap: https://www.microsoft.com/en-us/legal/sitemap.xml
Sitemap: https://www.microsoft.com/filedata/sitemaps/RW5xN8
Sitemap: https://www.microsoft.com/store/collections.xml
Sitemap: https://www.microsoft.com/store/productdetailpages.index.xml
Sitemap: https://www.microsoft.com/en-us/store/locations/store-locationssitemap.xml

以上内容显示,站点地图列出了网站上的 URL,并进一步允许管理员指定一些其他信息,例如,每个 URL 的上次更新日期、内容更改、与其他 URL 的相关性重要性等。

The above content shows that the sitemap lists the URLs on website and further allows a webmaster to specify some additional information like last updated date, change of contents, importance of URL with relation to others etc. about each URL.

What is the Size of Website?

网站的大小是否会影响我们的抓取方式?即网站的网页数量是否会影响我们的抓取方式?当然是的。因为如果我们抓取的网页数量较少,那么效率就不是一个严重的问题,但假设我们的网站有数百万个网页,例如 Microsoft.com,那么按顺序下载每个网页需要几个月的时间,那么效率将成为一个严重的问题。

Is the size of a website, i.e. the number of web pages of a website affects the way we crawl? Certainly yes. Because if we have less number of web pages to crawl, then the efficiency would not be a serious issue, but suppose if our website has millions of web pages, for example Microsoft.com, then downloading each web page sequentially would take several months and then efficiency would be a serious concern.

Checking Website’s Size

通过检查 Google 抓取工具程序结果的大小,我们可以估计一个网站的大小。在进行 Google 搜索时,我们可以使用关键词 site 来过滤我们的结果。例如,估计 https://authoraditiagarwal.com/ 的大小如下:

By checking the size of result of Google’s crawler, we can have an estimate of the size of a website. Our result can be filtered by using the keyword site while doing the Google search. For example, estimating the size of https://authoraditiagarwal.com/ is given below −

checking the size

你可以看到大约有 60 个结果,这意味着它不是一个大网站,抓取不会导致效率问题。

You can see there are around 60 results which mean it is not a big website and crawling would not lead the efficiency issue.

Which technology is used by website?

另一个重要问题是网站使用的技术是否会影响我们的抓取方式?是的,它会影响。但是我们如何检查一个网站使用的技术?有一个名为 builtwith 的 Python 库,借助它,我们可以找出网站使用的技术。

Another important question is whether the technology used by website affects the way we crawl? Yes, it affects. But how we can check about the technology used by a website? There is a Python library named builtwith with the help of which we can find out about the technology used by a website.

Example

在这个示例中,我们将在 Python 库 builtwith 的帮助下检查网站 https://authoraditiagarwal.com 使用的技术。但在使用此库之前,我们需要按照以下步骤进行安装:

In this example we are going to check the technology used by the website https://authoraditiagarwal.com with the help of Python library builtwith. But before using this library, we need to install it as follows −

(base) D:\ProgramData>pip install builtwith
Collecting builtwith
   Downloading
https://files.pythonhosted.org/packages/9b/b8/4a320be83bb3c9c1b3ac3f9469a5d66e0
2918e20d226aa97a3e86bddd130/builtwith-1.3.3.tar.gz
Requirement already satisfied: six in d:\programdata\lib\site-packages (from
builtwith) (1.10.0)
Building wheels for collected packages: builtwith
   Running setup.py bdist_wheel for builtwith ... done
   Stored in directory:
C:\Users\gaurav\AppData\Local\pip\Cache\wheels\2b\00\c2\a96241e7fe520e75093898b
f926764a924873e0304f10b2524
Successfully built builtwith
Installing collected packages: builtwith
Successfully installed builtwith-1.3.3

现在,借助以下简单的代码行,我们可以检查特定网站使用的技术:

Now, with the help of following simple line of codes we can check the technology used by a particular website −

In [1]: import builtwith
In [2]: builtwith.parse('http://authoraditiagarwal.com')
Out[2]:
{'blogs': ['PHP', 'WordPress'],
   'cms': ['WordPress'],
   'ecommerce': ['WooCommerce'],
   'font-scripts': ['Font Awesome'],
   'javascript-frameworks': ['jQuery'],
   'programming-languages': ['PHP'],
   'web-servers': ['Apache']}

Who is the owner of website?

网站的所有者也很重要,因为如果所有人以封锁抓取工具程序出名,那么爬虫在从网站抓取数据时必须小心。有一个名为 Whois 的协议,借助它,我们可以找出网站的所有者。

The owner of the website also matters because if the owner is known for blocking the crawlers, then the crawlers must be careful while scraping the data from website. There is a protocol named Whois with the help of which we can find out about the owner of the website.

Example

在这个示例中,我们将在 Whois 的帮助下检查网站 say microsoft.com 的所有者。但在使用此库之前,我们需要按照以下步骤进行安装:

In this example we are going to check the owner of the website say microsoft.com with the help of Whois. But before using this library, we need to install it as follows −

(base) D:\ProgramData>pip install python-whois
Collecting python-whois
   Downloading
https://files.pythonhosted.org/packages/63/8a/8ed58b8b28b6200ce1cdfe4e4f3bbc8b8
5a79eef2aa615ec2fef511b3d68/python-whois-0.7.0.tar.gz (82kB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 92kB 164kB/s
Requirement already satisfied: future in d:\programdata\lib\site-packages (from
python-whois) (0.16.0)
Building wheels for collected packages: python-whois
   Running setup.py bdist_wheel for python-whois ... done
   Stored in directory:
C:\Users\gaurav\AppData\Local\pip\Cache\wheels\06\cb\7d\33704632b0e1bb64460dc2b
4dcc81ab212a3d5e52ab32dc531
Successfully built python-whois
Installing collected packages: python-whois
Successfully installed python-whois-0.7.0

现在,借助以下简单的代码行,我们可以检查特定网站使用的技术:

Now, with the help of following simple line of codes we can check the technology used by a particular website −

In [1]: import whois
In [2]: print (whois.whois('microsoft.com'))
{
   "domain_name": [
      "MICROSOFT.COM",
      "microsoft.com"
   ],
   -------
   "name_servers": [
      "NS1.MSFT.NET",
      "NS2.MSFT.NET",
      "NS3.MSFT.NET",
      "NS4.MSFT.NET",
      "ns3.msft.net",
      "ns1.msft.net",
      "ns4.msft.net",
      "ns2.msft.net"
   ],
   "emails": [
      "abusecomplaints@markmonitor.com",
      "domains@microsoft.com",
      "msnhst@microsoft.com",
      "whoisrelay@markmonitor.com"
   ],
}

Python Web Scraping - Data Extraction

分析网页意味着理解其结构。现在,问题来了,它对网络抓取来说为何至关重要?在本章中,让我们详细了解一下。

Analyzing a web page means understanding its sructure . Now, the question arises why it is important for web scraping? In this chapter, let us understand this in detail.

Web page Analysis

网页分析很重要,因为如果不进行分析,我们就无法知道在提取之后将以何种形式(结构化或非结构化)从该网页接收数据。以下几种方法可以进行网页分析

Web page analysis is important because without analyzing we are not able to know in which form we are going to receive the data from (structured or unstructured) that web page after extraction. We can do web page analysis in the following ways −

Viewing Page Source

这是一个通过检查源代码来了解网页如何构建的方法。要实现这一点,我们需要右键单击该页面,然后必须选择 View page source 选项。然后,我们将从该网页中以 HTML 形式获取我们感兴趣的数据。但主要问题是空白和格式,我们很难对其进行格式化。

This is a way to understand how a web page is structured by examining its source code. To implement this, we need to right click the page and then must select the View page source option. Then, we will get the data of our interest from that web page in the form of HTML. But the main concern is about whitespaces and formatting which is difficult for us to format.

Inspecting Page Source by Clicking Inspect Element Option

这是另一种分析网页的方法。但不同的是,它将解决网页源代码中的格式和空白问题。可以通过右键单击,然后从菜单中选择 InspectInspect element 选项来实现这一点。它将提供有关该网页的特定区域或元素的信息。

This is another way of analyzing web page. But the difference is that it will resolve the issue of formatting and whitespaces in the source code of web page. You can implement this by right clicking and then selecting the Inspect or Inspect element option from menu. It will provide the information about particular area or element of that web page.

Different Ways to Extract Data from Web Page

以下方法大多用于从网页中提取数据

The following methods are mostly used for extracting data from a web page −

Regular Expression

它们是Python中嵌入的高度专业化编程语言。我们可以通过Python re 模块使用它。它也被称为 RE 或正则表达式或正则表达式模式。在正则表达式的帮助下,我们可以为我们希望从数据中匹配的可能字符串集指定一些规则。

They are highly specialized programming language embedded in Python. We can use it through re module of Python. It is also called RE or regexes or regex patterns. With the help of regular expressions, we can specify some rules for the possible set of strings we want to match from the data.

如果您想进一步了解常规表达式,请转到链接 [role="bare" [role="bare"]https://www.tutorialspoint.com/automata_theory/regular_expressions.htm ,如果您想进一步了解Python中的re模块或正则表达式,您可以关注 link [role="bare" [role="bare"]https://www.tutorialspoint.com/python/python_reg_expressions.htm

If you want to learn more about regular expression in general, go to the link [role="bare"https://www.tutorialspoint.com/automata_theory/regular_expressions.htm] and if you want to know more about re module or regular expression in Python, you can follow the link [role="bare"https://www.tutorialspoint.com/python/python_reg_expressions.htm].

Example

在以下示例中,我们在通过正则表达式匹配 <td> 的内容后,将从 http://example.webscraping.com 抓取有关印度的数据。

In the following example, we are going to scrape data about India from http://example.webscraping.com after matching the contents of <td> with the help of regular expression.

import re
import urllib.request
response =
   urllib.request.urlopen('http://example.webscraping.com/places/default/view/India-102')
html = response.read()
text = html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)

Output

相应的输出将如下所示:

The corresponding output will be as shown here −

[
   '<img src="/places/static/images/flags/in.png" />',
   '3,287,590 square kilometres',
   '1,173,108,018',
   'IN',
   'India',
   'New Delhi',
   '<a href="/places/default/continent/AS">AS</a>',
   '.in',
   'INR',
   'Rupee',
   '91',
   '######',
   '^(\\d{6})$',
   'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc',
   '<div>
      <a href="/places/default/iso/CN">CN </a>
      <a href="/places/default/iso/NP">NP </a>
      <a href="/places/default/iso/MM">MM </a>
      <a href="/places/default/iso/BT">BT </a>
      <a href="/places/default/iso/PK">PK </a>
      <a href="/places/default/iso/BD">BD </a>
   </div>'
]

请注意,在以上输出中,您可以看到有关印度国家详细信息,方法是使用正则表达式。

Observe that in the above output you can see the details about country India by using regular expression.

Beautiful Soup

假设我们要从网页中收集所有超链接,那么我们可以使用一个名为 BeautifulSoup 的解析器,可以在 https://www.crummy.com/software/BeautifulSoup/bs4/doc/. 中更详细地了解它。简而言之,BeautifulSoup 是一个用于从 HTML 和 XML 文件中提取数据的 Python 库。它可与请求一起使用,因为它需要一个输入(文档或 url)来创建汤对象,因为它本身无法获取网页。您可以使用以下 Python 脚本来收集网页标题和超链接。

Suppose we want to collect all the hyperlinks from a web page, then we can use a parser called BeautifulSoup which can be known in more detail at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. In simple words, BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used with requests, because it needs an input (document or url) to create a soup object asit cannot fetch a web page by itself. You can use the following Python script to gather the title of web page and hyperlinks.

Installing Beautiful Soup

使用 pip 命令,我们可以将 beautifulsoup 安装在我们的虚拟环境或全局安装中。

Using the pip command, we can install beautifulsoup either in our virtual environment or in global installation.

(base) D:\ProgramData>pip install bs4
Collecting bs4
   Downloading
https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89
a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Requirement already satisfied: beautifulsoup4 in d:\programdata\lib\sitepackages
(from bs4) (4.6.0)
Building wheels for collected packages: bs4
   Running setup.py bdist_wheel for bs4 ... done
   Stored in directory:
C:\Users\gaurav\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d
52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1

Example

请注意,在此示例中,我们正在扩展使用请求 python 模块实现的以上示例。我们使用 r.text 来创建一个汤对象,该对象将进一步用于获取网页标题等详细信息。

Note that in this example, we are extending the above example implemented with requests python module. we are using r.text for creating a soup object which will further be used to fetch details like title of the webpage.

第一步,我们需要导入必要的 Python 模块

First, we need to import necessary Python modules −

import requests
from bs4 import BeautifulSoup

在以下代码行中,我们使用请求通过发出 GET 请求来对 URL https://authoraditiagarwal.com/ 发出 HTTP GET 请求。

In this following line of code we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.

r = requests.get('https://authoraditiagarwal.com/')

现在我们需要创建一个汤对象,如下所示

Now we need to create a Soup object as follows −

soup = BeautifulSoup(r.text, 'lxml')
print (soup.title)
print (soup.title.text)

Output

相应的输出将如下所示:

The corresponding output will be as shown here −

<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal

Lxml

我们将要讨论的用于网络抓取的另一个 Python 库是 lxml。这是一个高性能的 HTML 和 XML 解析库。它相对快速且简单。您可以在 https://lxml.de/. 上阅读更多相关内容

Another Python library we are going to discuss for web scraping is lxml. It is a highperformance HTML and XML parsing library. It is comparatively fast and straightforward. You can read about it more on https://lxml.de/.

Installing lxml

使用 pip 命令,我们可以将 lxml 安装在我们的虚拟环境中或将其全局安装。

Using the pip command, we can install lxml either in our virtual environment or in global installation.

(base) D:\ProgramData>pip install lxml
Collecting lxml
   Downloading
https://files.pythonhosted.org/packages/b9/55/bcc78c70e8ba30f51b5495eb0e
3e949aa06e4a2de55b3de53dc9fa9653fa/lxml-4.2.5-cp36-cp36m-win_amd64.whl
(3.
6MB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 3.6MB 64kB/s
Installing collected packages: lxml
Successfully installed lxml-4.2.5

Example: Data extraction using lxml and requests

在以下示例中,我们使用 lxml 和 requests 从 authoraditiagarwal.com 中抓取网页的特定元素 −

In the following example, we are scraping a particular element of the web page from authoraditiagarwal.com by using lxml and requests −

首先,我们需要按以下方式从 lxml 库导入 requests 和 html −

First, we need to import the requests and html from lxml library as follows −

import requests
from lxml import html

现在,我们需要提供要抓取的网页的 url

Now we need to provide the url of web page to scrap

url = https://authoraditiagarwal.com/leadershipmanagement/

现在,我们需要提供 (Xpath) 特定元素的路径 −

Now we need to provide the path (Xpath) to particular element of that web page −

path = '//*[@id="panel-836-0-0-1"]/div/div/p[1]'
response = requests.get(url)
byte_string = response.content
source_code = html.fromstring(byte_string)
tree = source_code.xpath(path)
print(tree[0].text_content())

Output

相应的输出将如下所示:

The corresponding output will be as shown here −

The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate
daily progress to the stakeholders. It tracks the completion of work for a given sprint
or an iteration. The horizontal axis represents the days within a Sprint. The vertical
axis represents the hours remaining to complete the committed work.

Python Web Scraping - Data Processing

在之前的章节中,我们学习了如何通过各种 Python 模块从网页或网络抓取中提取数据。在本章中,我们来了解处理已抓取数据的各种技术。

In earlier chapters, we learned about extracting the data from web pages or web scraping by various Python modules. In this chapter, let us look into various techniques to process the data that has been scraped.

Introduction

要处理已抓取的数据,我们必须将数据存储在本地计算机中,采用特定的格式,例如表格(CSV)、JSON 或有时是像 MySQL 的数据库。

To process the data that has been scraped, we must store the data on our local machine in a particular format like spreadsheet (CSV), JSON or sometimes in databases like MySQL.

CSV and JSON Data Processing

首先,我们将在从网页抓取后将信息写入 CSV 文件或电子表格中。让我们首先通过一个简单的示例来了解,在示例中,我们将首先使用 BeautifulSoup 模块抓取信息,就像之前所做的那样,然后使用 Python CSV 模块将文本信息写入 CSV 文件。

First, we are going to write the information, after grabbing from web page, into a CSV file or a spreadsheet. Let us first understand through a simple example in which we will first grab the information using BeautifulSoup module, as did earlier, and then by using Python CSV module we will write that textual information into CSV file.

首先,我们需要导入必要的 Python 库,如下所示:

First, we need to import the necessary Python libraries as follows −

import requests
from bs4 import BeautifulSoup
import csv

在以下代码行中,我们使用 requests 对 url: https://authoraditiagarwal.com/ 发出 GET HTTP 请求,即通过发出 GET 请求。

In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.

r = requests.get('https://authoraditiagarwal.com/')

现在,我们需要创建一个 Soup 对象,如下所示:

Now, we need to create a Soup object as follows −

soup = BeautifulSoup(r.text, 'lxml')

现在,借助以下代码行,我们会将抓取的数据写入名为 dataprocessing.csv 的 CSV 文件中。

Now, with the help of next lines of code, we will write the grabbed data into a CSV file named dataprocessing.csv.

f = csv.writer(open(' dataprocessing.csv ','w'))
f.writerow(['Title'])
f.writerow([soup.title.text])

运行此脚本后,文本信息或网页标题将保存在上面提到的本地计算机中的 CSV 文件中。

After running this script, the textual information or the title of the webpage will be saved in the above mentioned CSV file on your local machine.

同样,我们可以将收集的信息保存在 JSON 文件中。以下是一个易于理解的 Python 脚本,用于执行相同的操作,在该脚本中,我们会抓取与在上一个 Python 脚本中所做的相同的信息,但是这一次,使用 JSON Python 模块将抓取的信息保存在 JSONfile.txt 中。

Similarly, we can save the collected information in a JSON file. The following is an easy to understand Python script for doing the same in which we are grabbing the same information as we did in last Python script, but this time the grabbed information is saved in JSONfile.txt by using JSON Python module.

import requests
from bs4 import BeautifulSoup
import csv
import json
r = requests.get('https://authoraditiagarwal.com/')
soup = BeautifulSoup(r.text, 'lxml')
y = json.dumps(soup.title.text)
with open('JSONFile.txt', 'wt') as outfile:
   json.dump(y, outfile)

运行此脚本后,抓取的信息(即网页标题)将保存在本地计算机上上面提到的文本文件中。

After running this script, the grabbed information i.e. title of the webpage will be saved in the above mentioned text file on your local machine.

Data Processing using AWS S3

有时候,我们可能想将抓取的数据保存在本地存储中以进行存档。但是,如果我们需要大规模存储和分析此数据怎么办?答案是名为 Amazon S3 或 AWS S3(简单存储服务)的云存储服务。基本上,AWS S3 是一个对象存储,用于存储和检索任意数量的任何地方的数据。

Sometimes we may want to save scraped data in our local storage for archive purpose. But what if the we need to store and analyze this data at a massive scale? The answer is cloud storage service named Amazon S3 or AWS S3 (Simple Storage Service). Basically AWS S3 is an object storage which is built to store and retrieve any amount of data from anywhere.

我们可以按照以下步骤将数据存储在 AWS S3 中:

We can follow the following steps for storing data in AWS S3 −

Step 1 - 首先,我们需要一个 AWS 帐户,该帐户将为我们提供在存储数据时用于 Python 脚本的密钥。它将创建一个 S3 存储桶,我们可以在其中存储数据。

Step 1 − First we need an AWS account which will provide us the secret keys for using in our Python script while storing the data. It will create a S3 bucket in which we can store our data.

Step 2 − 接下来,我们需要安装 boto3 Python 库来访问 S3 存储桶。它可以通过以下命令进行安装−

Step 2 − Next, we need to install boto3 Python library for accessing S3 bucket. It can be installed with the help of the following command −

pip install boto3

Step 3 − 然后,我们可以使用以下 Python 脚本从网站抓取数据并将其保存到 AWS S3 存储桶中。

Step 3 − Next, we can use the following Python script for scraping data from web page and saving it to AWS S3 bucket.

首先,我们需要导入用于抓取的 Python 库,在这里我们正在使用 requests ,并 boto3 将数据保存到 S3 存储桶中。

First, we need to import Python libraries for scraping, here we are working with requests, and boto3 saving data to S3 bucket.

import requests
import boto3

现在我们可以从 URL 中抓取数据。

Now we can scrape the data from our URL.

data = requests.get("Enter the URL").text

现在需要存储数据到 S3 存储桶中,我们需要按如下方式创建 S3 客户端−

Now for storing data to S3 bucket, we need to create S3 client as follows −

s3 = boto3.client('s3')
bucket_name = "our-content"

下面的代码行将按如下方式创建 S3 存储桶−

Next line of code will create S3 bucket as follows −

s3.create_bucket(Bucket = bucket_name, ACL = 'public-read')
s3.put_object(Bucket = bucket_name, Key = '', Body = data, ACL = "public-read")

现在,您可以通过 AWS 帐户中的名称为 our-content 检查该存储桶。

Now you can check the bucket with name our-content from your AWS account.

Data processing using MySQL

接下来,让我们学习如何使用 MySQL 处理数据。如果想了解有关 MySQL 的信息,您可以关注链接 [role="bare" [role="bare"]https://www.tutorialspoint.com/mysql/ 。]

Let us learn how to process data using MySQL. If you want to learn about MySQL, then you can follow the link [role="bare"https://www.tutorialspoint.com/mysql/.]

通过以下步骤,我们可以将数据抓取并处理到 MySQL 表中−

With the help of following steps, we can scrape and process data into MySQL table −

Step 1 − 首先,通过使用 MySQL,我们需要创建一个数据库和表,用以保存我们抓取的数据。例如,我们通过以下查询创建该表−

Step 1 − First, by using MySQL we need to create a database and table in which we want to save our scraped data. For example, we are creating the table with following query −

CREATE TABLE Scrap_pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,
title VARCHAR(200), content VARCHAR(10000),PRIMARY KEY(id));

Step 2 − 接下来,我们需要处理 Unicode。请注意,MySQL 默认不处理 Unicode。我们需要通过以下命令来启用此功能,这将更改数据库的默认字符集、表的默认字符集与两个列的默认字符集−

Step 2 − Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by default. We need to turn on this feature with the help of following commands which will change the default character set for the database, for the table and for both of the columns −

ALTER DATABASE scrap CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CONVERT TO CHARACTER SET utf8mb4 COLLATE
utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CHANGE title title VARCHAR(200) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages CHANGE content content VARCHAR(10000) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

Step 3 − 现在,将 MySQL 与 Python 集成。为此我们需要 PyMySQL,可以通过以下命令进行安装−

Step 3 − Now, integrate MySQL with Python. For this, we will need PyMySQL which can be installed with the help of the following command

pip install PyMySQL

Step 4 − 现在,我们之前创建的名为 Scrap 的数据库已做好准备,用于从 web 抓取数据后保存到名为 Scrap_pages 的表中。在示例中,我们将从 Wikipedia 抓取数据,然后将其保存在我们的数据库中。

Step 4 − Now, our database named Scrap, created earlier, is ready to save the data, after scraped from web, into table named Scrap_pages. Here in our example we are going to scrape data from Wikipedia and it will be saved into our database.

首先,我们需要导入所需的 Python 模块。

First, we need to import the required Python modules.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re

现在,建立一个连接,即将其与 Python 集成。

Now, make a connection, that is integrate this with Python.

conn = pymysql.connect(host='127.0.0.1',user='root', passwd = None, db = 'mysql',
charset = 'utf8')
cur = conn.cursor()
cur.execute("USE scrap")
random.seed(datetime.datetime.now())
def store(title, content):
   cur.execute('INSERT INTO scrap_pages (title, content) VALUES ''("%s","%s")', (title, content))
   cur.connection.commit()

现在,连接到 Wikipedia 并从此处获取数据。

Now, connect with Wikipedia and get data from it.

def getLinks(articleUrl):
   html = urlopen('http://en.wikipedia.org'+articleUrl)
   bs = BeautifulSoup(html, 'html.parser')
   title = bs.find('h1').get_text()
   content = bs.find('div', {'id':'mw-content-text'}).find('p').get_text()
   store(title, content)
   return bs.find('div', {'id':'bodyContent'}).findAll('a',href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
   while len(links) > 0:
      newArticle = links[random.randint(0, len(links)-1)].attrs['href']
      print(newArticle)
      links = getLinks(newArticle)

最后,我们需要关闭游标和连接。

Lastly, we need to close both cursor and connection.

finally:
   cur.close()
   conn.close()

这会将从 Wikipedia 收集的数据保存到名为 scrap_pages 的表中。如果您熟悉 MySQL 和 web 抓取,那么上面这段代码将不难理解。

This will save the data gather from Wikipedia into table named scrap_pages. If you are familiar with MySQL and web scraping, then the above code would not be tough to understand.

Data processing using PostgreSQL

PostgreSQL 由全球志愿者团队开发,它是一个开源关系数据库管理系统 (RDMS)。使用 PostgreSQL 处理抓取数据的过程与 MySQL 类似。两处会有所改变:第一,命令将与 MySQL 不同,第二,在这里,我们将使用 psycopg2 Python 库执行其与 Python 的集成。

PostgreSQL, developed by a worldwide team of volunteers, is an open source relational database Management system (RDMS). The process of processing the scraped data using PostgreSQL is similar to that of MySQL. There would be two changes: First, the commands would be different to MySQL and second, here we will use psycopg2 Python library to perform its integration with Python.

如果您不熟悉 PostgreSQL,可以在 [role="bare" [role="bare"]https://www.tutorialspoint.com/postgresql/ 学习。按照以下命令,我们可以安装 psycopg2 Python 库 −

If you are not familiar with PostgreSQL then you can learn it at [role="bare"https://www.tutorialspoint.com/postgresql/.] And with the help of following command we can install psycopg2 Python library −

pip install psycopg2

Processing Images and Videos

网络抓取通常涉及下载、存储和处理网络媒体内容。在本章中,让我们了解如何处理从网络下载的内容。

Web scraping usually involves downloading, storing and processing the web media content. In this chapter, let us understand how to process the content downloaded from the web.

Introduction

抓取过程中获取的网络媒体内容可以是图像、音频和视频文件,既可以为非网页形式,也可以是数据文件。但是,我们能否信任所下载数据,尤其是要下载并存储在计算机内存中的数据的扩展?这导致了解我们要在本地存储的数据类型至关重要。

The web media content that we obtain during scraping can be images, audio and video files, in the form of non-web pages as well as data files. But, can we trust the downloaded data especially on the extension of data we are going to download and store in our computer memory? This makes it essential to know about the type of data we are going to store locally.

Getting Media Content from Web Page

在本节中,我们将学习如何根据来自 Web 服务器的信息正确表示媒体类型下载媒体内容。我们能够借助 Python requests 模块完成这一操作,就像在上一章中所做的那样。

In this section, we are going to learn how we can download media content which correctly represents the media type based on the information from web server. We can do it with the help of Python requests module as we did in previous chapter.

首先,我们需要导入必要的 Python 模块,如下所示:

First, we need to import necessary Python modules as follows −

import requests

现在,提供要下载和本地存储的媒体内容的 URL。

Now, provide the URL of the media content we want to download and store locally.

url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"

使用以下代码创建 HTTP 响应对象。

Use the following code to create HTTP response object.

r = requests.get(url)

借助以下代码行,我们可以将收到的内容另存为 .png 文件。

With the help of following line of code, we can save the received content as .png file.

with open("ThinkBig.png",'wb') as f:
   f.write(r.content)

运行上述 Python 脚本后,我们将获得名为 ThinkBig.png 的文件,该文件包含已下载的图像。

After running the above Python script, we will get a file named ThinkBig.png, which would have the downloaded image.

Extracting Filename from URL

从网站下载内容后,我们也希望将其保存到文件中,文件名应在 URL 中找到。但我们也可以检查 URL 中是否存在其他片段数字。为此,我们需要从 URL 中找到实际文件名。

After downloading the content from web site, we also want to save it in a file with a file name found in the URL. But we can also check, if numbers of additional fragments exist in URL too. For this, we need to find the actual filename from the URL.

借助以下 Python 脚本,使用 urlparse ,我们可以从 URL 中提取文件名:

With the help of following Python script, using urlparse, we can extract the filename from URL −

import urllib3
import os
url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
a = urlparse(url)
a.path

你可以观察到输出,如下所示 −

You can observe the output as shown below −

'/wp-content/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg'
os.path.basename(a.path)

你可以观察到输出,如下所示 −

You can observe the output as shown below −

'MetaSlider_ThinkBig-1080x180.jpg'

运行上述脚本后,我们将从 URL 获取文件名。

Once you run the above script, we will get the filename from URL.

Information about Type of Content from URL

从 Web 服务器提取内容时,通过 GET 请求,我们还可以检查它所提供的信息。借助以下 Python 脚本,我们可以确定 Web 服务器对于内容类型意味着什么:

While extracting the contents from web server, by GET request, we can also check its information provided by the web server. With the help of following Python script we can determine what web server means with the type of the content −

首先,我们需要导入必要的 Python 模块,如下所示:

First, we need to import necessary Python modules as follows −

import requests

现在,我们需要提供要下载并本地存储的媒体内容的 URL。

Now, we need to provide the URL of the media content we want to download and store locally.

url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"

以下代码行将创建 HTTP 响应对象。

Following line of code will create HTTP response object.

r = requests.get(url, allow_redirects=True)

现在,我们可以获得 Web 服务器可提供关于内容的哪些类型信息。

Now, we can get what type of information about content can be provided by web server.

for headers in r.headers: print(headers)

你可以观察到输出,如下所示 −

You can observe the output as shown below −

Date
Server
Upgrade
Connection
Last-Modified
Accept-Ranges
Content-Length
Keep-Alive
Content-Type

借助以下代码行,我们可以获取关于内容类型(例如 content-type)的特定信息:

With the help of following line of code we can get the particular information about content type, say content-type −

print (r.headers.get('content-type'))

你可以观察到输出,如下所示 −

You can observe the output as shown below −

image/jpeg

借助以下代码行,我们可以获取关于内容类型(例如 EType)的特定信息:

With the help of following line of code, we can get the particular information about content type, say EType −

print (r.headers.get('ETag'))

你可以观察到输出,如下所示 −

You can observe the output as shown below −

None

观察以下命令:

Observe the following command −

print (r.headers.get('content-length'))

你可以观察到输出,如下所示 −

You can observe the output as shown below −

12636

借助以下代码行,我们可以获取关于内容类型(例如 Server)的特定信息:

With the help of following line of code we can get the particular information about content type, say Server −

print (r.headers.get('Server'))

你可以观察到输出,如下所示 −

You can observe the output as shown below −

Apache

Generating Thumbnail for Images

缩略图是非常小的说明或表现形式。用户可能希望仅保存大图像的缩略图或同时保存图像和缩略图。在本节中,我们将创建名为 ThinkBig.png 的图像缩略图,该图像已下载于上一节“从网页获取媒体内容”。

Thumbnail is a very small description or representation. A user may want to save only thumbnail of a large image or save both the image as well as thumbnail. In this section we are going to create a thumbnail of the image named ThinkBig.png downloaded in the previous section “Getting media content from web page”.

对于这个Python脚本,我们需要安装Pillow,Python图像库的分支,它有用于处理图像的有用函数。可以使用以下命令安装它:

For this Python script, we need to install Python library named Pillow, a fork of the Python Image library having useful functions for manipulating images. It can be installed with the help of following command −

pip install pillow

以下Python脚本将创建图像的缩略图,并将缩略图文件加上 Th_ 前缀保存到当前目录

The following Python script will create a thumbnail of the image and will save it to the current directory by prefixing thumbnail file with Th_

import glob
from PIL import Image
for infile in glob.glob("ThinkBig.png"):
   img = Image.open(infile)
   img.thumbnail((128, 128), Image.ANTIALIAS)
   if infile[0:2] != "Th_":
      img.save("Th_" + infile, "png")

上面的代码非常容易理解,你可以检查当前目录中的缩略图文件。

The above code is very easy to understand and you can check for the thumbnail file in the current directory.

Screenshot from Website

在网络抓取中,一项非常常见的任务是对网站进行屏幕截图。为了实现这一点,我们将使用selenium和webdriver。以下Python脚本将从网站获取屏幕截图,并将其保存到当前目录。

In web scraping, a very common task is to take screenshot of a website. For implementing this, we are going to use selenium and webdriver. The following Python script will take the screenshot from website and will save it to current directory.

From selenium import webdriver
path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
browser = webdriver.Chrome(executable_path = path)
browser.get('https://tutorialspoint.com/')
screenshot = browser.save_screenshot('screenshot.png')
browser.quit

你可以观察到输出,如下所示 −

You can observe the output as shown below −

DevTools listening on ws://127.0.0.1:1456/devtools/browser/488ed704-9f1b-44f0-
a571-892dc4c90eb7
<bound method WebDriver.quit of <selenium.webdriver.chrome.webdriver.WebDriver
(session="37e8e440e2f7807ef41ca7aa20ce7c97")>>

运行脚本后,你可以检查当前目录中的 screenshot.png 文件。

After running the script, you can check your current directory for screenshot.png file.

screenshot

Thumbnail Generation for Video

假设我们从网站上下载了视频,并希望为它们生成缩略图,以便可以根据缩略图单击特定的视频。为了生成视频的缩略图,我们需要一个名为 ffmpeg 的简单工具,可以从 www.ffmpeg.org 下载。下载后,我们需要根据操作系统的规范进行安装。

Suppose we have downloaded videos from website and wanted to generate thumbnails for them so that a specific video, based on its thumbnail, can be clicked. For generating thumbnail for videos we need a simple tool called ffmpeg which can be downloaded from www.ffmpeg.org. After downloading, we need to install it as per the specifications of our OS.

以下Python脚本将生成视频的缩略图,并将其保存到我们的本地目录中:

The following Python script will generate thumbnail of the video and will save it to our local directory −

import subprocess
video_MP4_file = “C:\Users\gaurav\desktop\solar.mp4
thumbnail_image_file = 'thumbnail_solar_video.jpg'
subprocess.call(['ffmpeg', '-i', video_MP4_file, '-ss', '00:00:20.000', '-
   vframes', '1', thumbnail_image_file, "-y"])

运行上述脚本后,我们将在本地目录中得到一个名为 thumbnail_solar_video.jpg 的缩略图。

After running the above script, we will get the thumbnail named thumbnail_solar_video.jpg saved in our local directory.

Ripping an MP4 video to an MP3

假设你从网站下载了一些视频文件,但你只需要该文件中的音频就足够了,那么就可以在Python中通过名为 moviepy 的Python库来完成,可以使用以下命令进行安装:

Suppose you have downloaded some video file from a website, but you only need audio from that file to serve your purpose, then it can be done in Python with the help of Python library called moviepy which can be installed with the help of following command −

pip install moviepy

现在,在使用以下脚本成功安装moviepy后,我们可以将MP4转换为MP3。

Now, after successfully installing moviepy with the help of following script we can convert and MP4 to MP3.

import moviepy.editor as mp
clip = mp.VideoFileClip(r"C:\Users\gaurav\Desktop\1234.mp4")
clip.audio.write_audiofile("movie_audio.mp3")

你可以观察到输出,如下所示 −

You can observe the output as shown below −

[MoviePy] Writing audio in movie_audio.mp3
100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 674/674 [00:01<00:00,
476.30it/s]
[MoviePy] Done.

上述脚本将MP3音频文件保存在本地目录中。

The above script will save the audio MP3 file in the local directory.

Python Web Scraping - Dealing with Text

在上一章中,我们了解了如何处理作为网络爬取内容一部分获得的视频和图片。在本章中,我们将使用 Python 库来处理文本分析,并详细了解其相关内容。

In the previous chapter, we have seen how to deal with videos and images that we obtain as a part of web scraping content. In this chapter we are going to deal with text analysis by using Python library and will learn about this in detail.

Introduction

您可以使用名为自然语言工具包 (NLTK) 的 Python 库来执行文本分析。在深入了解 NLTK 概念之前,我们先了解文本分析和网络爬取之间的关系。

You can perform text analysis in by using Python library called Natural Language Tool Kit (NLTK). Before proceeding into the concepts of NLTK, let us understand the relation between text analysis and web scraping.

分析文本中的单词可以帮助我们了解哪些单词很重要,哪些单词不常见,单词如何分组。此项分析简化了网络爬取任务。

Analyzing the words in the text can lead us to know about which words are important, which words are unusual, how words are grouped. This analysis eases the task of web scraping.

Getting started with NLTK

自然语言工具包 (NLTK) 是 Python 库的集合,专门为识别和标记自然语言(如英语)文本中发现的词性而设计。

The Natural language toolkit (NLTK) is collection of Python libraries which is designed especially for identifying and tagging parts of speech found in the text of natural language like English.

Installing NLTK

您可以使用以下命令在 Python 中安装 NLTK −

You can use the following command to install NLTK in Python −

pip install nltk

如果您使用的是 Anaconda,则可以通过使用以下命令来构建 NLTK 的 conda 包 −

If you are using Anaconda, then a conda package for NLTK can be built by using the following command −

conda install -c anaconda nltk

Downloading NLTK’s Data

在安装 NLTK 后,我们必须下载预设文本库。但在下载文本预设库之前,我们需要通过 import 命令导入 NLTK,如下所示 −

After installing NLTK, we have to download preset text repositories. But before downloading text preset repositories, we need to import NLTK with the help of import command as follows −

mport nltk

现在,可以通过以下命令下载 NLTK 数据 −

Now, with the help of following command NLTK data can be downloaded −

nltk.download()

安装所有可用的 NLTK 软件包需要一些时间,但始终建议安装所有软件包。

Installation of all available packages of NLTK will take some time, but it is always recommended to install all the packages.

Installing Other Necessary packages

我们还需要其他一些 Python 软件包,如 gensimpattern ,才能使用 NLTK 执行文本分析以及构建自然语言处理应用程序。

We also need some other Python packages like gensim and pattern for doing text analysis as well as building building natural language processing applications by using NLTK.

gensim − 一个强大的语义建模库,适用于许多应用。可以通过以下命令进行安装 −

gensim − A robust semantic modeling library which is useful for many applications. It can be installed by the following command −

pip install gensim

pattern − 用于使 gensim 软件包正常工作。可以通过以下命令进行安装 −

pattern − Used to make gensim package work properly. It can be installed by the following command −

pip install pattern

Tokenization

将给定文本分解为称为标记的较小单位的过程称为标记化。这些标记可以是单词、数字或标点符号。它也称为 word segmentation

The Process of breaking the given text, into the smaller units called tokens, is called tokenization. These tokens can be the words, numbers or punctuation marks. It is also called word segmentation.

Example

tokenization

NLTK 模块为标记化提供了不同的软件包。我们可以根据需要使用这些软件包。此处描述了其中一些软件包 −

NLTK module provides different packages for tokenization. We can use these packages as per our requirement. Some of the packages are described here −

sent_tokenize package − 此软件包将输入文本划分为句子。可以使用以下命令导入此软件包 −

sent_tokenize package − This package will divide the input text into sentences. You can use the following command to import this package −

from nltk.tokenize import sent_tokenize

word_tokenize package − 此软件包将输入文本划分为单词。可以使用以下命令导入此软件包 −

word_tokenize package − This package will divide the input text into words. You can use the following command to import this package −

from nltk.tokenize import word_tokenize

WordPunctTokenizer package − 此软件包将输入文本以及标点符号划分为单词。可以使用以下命令导入此软件包 −

WordPunctTokenizer package − This package will divide the input text as well as the punctuation marks into words. You can use the following command to import this package −

from nltk.tokenize import WordPuncttokenizer

Stemming

在任何语言中,单词都有不同的形式。由于语法原因,语言包含许多变体。例如,考虑以下单词 democracydemocraticdemocratization 。对于机器学习以及网络爬取项目,机器可以理解这些不同的单词具有相同的词干很重要。因此,我们可以说在分析文本时,提取单词的词干可能很有用。

In any language, there are different forms of a words. A language includes lots of variations due to the grammatical reasons. For example, consider the words democracy, democratic, and democratization. For machine learning as well as for web scraping projects, it is important for machines to understand that these different words have the same base form. Hence we can say that it can be useful to extract the base forms of the words while analyzing the text.

这可以通过词干提取来实现,词干提取可以定义为通过切掉单词结尾来提取单词基本形式的启发式过程。

This can be achieved by stemming which may be defined as the heuristic process of extracting the base forms of the words by chopping off the ends of words.

NLTK 模块提供了不同的词干提取包。我们可以根据需要使用这些包。其中一些包此处进行了说明 -

NLTK module provides different packages for stemming. We can use these packages as per our requirement. Some of these packages are described here −

PorterStemmer package - 此 Python 词干提取包使用波特算法来提取基本形式。您可以使用以下命令导入此包 -

PorterStemmer package − Porter’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package −

from nltk.stem.porter import PorterStemmer

例如,在此词干提取器中输入单词 ‘writing’ 后,词干提取后的输出将是单词 ‘write’

For example, after giving the word ‘writing’ as the input to this stemmer, the output would be the word ‘write’ after stemming.

LancasterStemmer package - 此 Python 词干提取包使用兰开斯特算法来提取基本形式。您可以使用以下命令导入此包 -

LancasterStemmer package − Lancaster’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package −

from nltk.stem.lancaster import LancasterStemmer

例如,在此词干提取器中输入单词 ‘writing’ 后,词干提取后的输出将是单词 ‘writ’

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘writ’ after stemming.

SnowballStemmer package - 此 Python 词干提取包使用 Snowball 算法来提取基本形式。您可以使用以下命令导入此包 -

SnowballStemmer package − Snowball’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package −

from nltk.stem.snowball import SnowballStemmer

例如,在此词干提取器中输入单词“writing”后,词干提取后的输出将是单词“write”。

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘write’ after stemming.

Lemmatization

提取单词基本形式的另一种方法是词形还原,通常通过使用词汇表和形态分析去除屈折词尾。任何单词在词形还原之后的单词基本形式被称为词形。

An other way to extract the base form of words is by lemmatization, normally aiming to remove inflectional endings by using vocabulary and morphological analysis. The base form of any word after lemmatization is called lemma.

NLTK 模块提供了以下词形还原包 -

NLTK module provides following packages for lemmatization −

WordNetLemmatizer package - 它会根据单词用作名词还是动词来提取基本形式。您可以使用以下命令导入此包 -

WordNetLemmatizer package − It will extract the base form of the word depending upon whether it is used as noun as a verb. You can use the following command to import this package −

from nltk.stem import WordNetLemmatizer

Chunking

分块是指将数据分成小块,这是自然语言处理中用于识别词性和小短语(如名词短语)的重要过程。分块是标记令牌。我们可以借助分块过程来获取句子的结构。

Chunking, which means dividing the data into small chunks, is one of the important processes in natural language processing to identify the parts of speech and short phrases like noun phrases. Chunking is to do the labeling of tokens. We can get the structure of the sentence with the help of chunking process.

Example

在本例中,我们将使用 NLTK Python 模块来实现名词短语分块。名词短语分块是分块的一种类型,它将在句子中找到名词短语块。

In this example, we are going to implement Noun-Phrase chunking by using NLTK Python module. NP chunking is a category of chunking which will find the noun phrases chunks in the sentence.

Steps for implementing noun phrase chunking

我们需要按照以下步骤实现名词短语分块 -

We need to follow the steps given below for implementing noun-phrase chunking −

Step 1 − Chunk grammar definition

第一步中,我们将为分块定义语法。它包括我们需要遵循的规则。

In the first step we will define the grammar for chunking. It would consist of the rules which we need to follow.

Step 2 − Chunk parser creation

现在,我们将创建一个分块解析器。它将解析语法并给出输出。

Now, we will create a chunk parser. It would parse the grammar and give the output.

Step 3 − The Output

在最后一步中,输出将以树格式生成。

In this last step, the output would be produced in a tree format.

首先,我们需要按如下方式导入 NLTK 包 -

First, we need to import the NLTK package as follows −

import nltk

接下来,我们需要定义句子。此处 DT:限定词,VBP:动词,JJ:形容词,IN:介词和 NN:名词。

Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the adjective, IN: the preposition and NN: the noun.

sentence = [("a", "DT"),("clever","JJ"),("fox","NN"),("was","VBP"),("jumping","VBP"),("over","IN"),("the","DT"),("wall","NN")]

接下来,我们以正则表达式的形式给出了语法。

Next, we are giving the grammar in the form of regular expression.

grammar = "NP:{<DT>?<JJ>*<NN>}"

现在,下一行代码将定义用于解析语法的一个解析器。

Now, next line of code will define a parser for parsing the grammar.

parser_chunking = nltk.RegexpParser(grammar)

现在,解析器将解析这个句子。

Now, the parser will parse the sentence.

parser_chunking.parse(sentence)

然后,我们把输出保存在变量里。

Next, we are giving our output in the variable.

Output = parser_chunking.parse(sentence)

借助以下代码,我们能够图形化输出,如图所示。

With the help of following code, we can draw our output in the form of a tree as shown below.

output.draw()
phrase chunking

Bag of Word (BoW) Model Extracting and converting the Text into Numeric Form

单词袋 (BoW) 是自然语言处理中一个有用的模型,主要用于提取文本中的特征。从文本中提取特征后,它可用于机器学习算法中的建模,因为原始数据不能用于 ML 应用程序。

Bag of Word (BoW), a useful model in natural language processing, is basically used to extract the features from text. After extracting the features from the text, it can be used in modeling in machine learning algorithms because raw data cannot be used in ML applications.

Working of BoW Model

最初,模型从文档中的所有单词中提取词汇表。稍后,使用文档术语矩阵,它将建立一个模型。这样一来,BoW 模型将文档仅表示为单词的集合,顺序或结构将被丢弃。

Initially, model extracts a vocabulary from all the words in the document. Later, using a document term matrix, it would build a model. In this way, BoW model represents the document as a bag of words only and the order or structure is discarded.

Example

假设我们有以下两个句子:

Suppose we have the following two sentences −

Sentence1 – 这是单词袋模型的示例。

Sentence1 − This is an example of Bag of Words model.

Sentence2 – 我们可以使用单词袋模型提取特征。

Sentence2 − We can extract features by using Bag of Words model.

现在,通过考虑这两个句子,我们有以下 14 个不同的单词:

Now, by considering these two sentences, we have the following 14 distinct words −

  1. This

  2. is

  3. an

  4. example

  5. bag

  6. of

  7. words

  8. model

  9. we

  10. can

  11. extract

  12. features

  13. by

  14. using

Building a Bag of Words Model in NLTK

我们来看一看以下 Python 脚本,它将在 NLTK 中构建一个 BoW 模型。

Let us look into the following Python script which will build a BoW model in NLTK.

首先,导入以下包:

First, import the following package −

from sklearn.feature_extraction.text import CountVectorizer

接下来,定义句子集:

Next, define the set of sentences −

Sentences=['This is an example of Bag of Words model.', ' We can extract
   features by using Bag of Words model.']
   vector_count = CountVectorizer()
   features_text = vector_count.fit_transform(Sentences).todense()
   print(vector_count.vocabulary_)

Output

它显示我们在以上两个句子中发现了 14 个不同的单词:

It shows that we have 14 distinct words in the above two sentences −

{
   'this': 10, 'is': 7, 'an': 0, 'example': 4, 'of': 9,
   'bag': 1, 'words': 13, 'model': 8, 'we': 12, 'can': 3,
   'extract': 5, 'features': 6, 'by': 2, 'using':11
}

Topic Modeling: Identifying Patterns in Text Data

通常,文档被分组为主题,主题建模是一种用于识别文本中与特定主题相对应的模式的技术。换句话说,主题建模用于发现给定文档集中的抽象主题或隐藏结构。

Generally documents are grouped into topics and topic modeling is a technique to identify the patterns in a text that corresponds to a particular topic. In other words, topic modeling is used to uncover abstract themes or hidden structure in a given set of documents.

你可以在以下情况下使用主题建模:

You can use topic modeling in following scenarios −

Text Classification

主题建模可以改进分类,因为它将相似的单词组合在一起,而不是将每个单词单独用作一个特征。

Classification can be improved by topic modeling because it groups similar words together rather than using each word separately as a feature.

Recommender Systems

我们可以通过使用相似性度量来构建推荐系统。

We can build recommender systems by using similarity measures.

Topic Modeling Algorithms

我们可以使用以下算法来实现主题建模 −

We can implement topic modeling by using the following algorithms −

Latent Dirichlet Allocation(LDA) − 它是使用概率图形模型来实现主题建模的最流行算法之一。

Latent Dirichlet Allocation(LDA) − It is one of the most popular algorithm that uses the probabilistic graphical models for implementing topic modeling.

Latent Semantic Analysis(LDA) or Latent Semantic Indexing(LSI) − 它基于线性代数,并在文档术语矩阵上使用 SVD(奇异值分解)的概念。

Latent Semantic Analysis(LDA) or Latent Semantic Indexing(LSI) − It is based upon Linear Algebra and uses the concept of SVD (Singular Value Decomposition) on document term matrix.

Non-Negative Matrix Factorization (NMF) − 它也基于线性代数,如 LDA。

Non-Negative Matrix Factorization (NMF) − It is also based upon Linear Algebra as like LDA.

上面提到的算法将具有以下元素 −

The above mentioned algorithms would have the following elements −

  1. Number of topics: Parameter

  2. Document-Word Matrix: Input

  3. WTM (Word Topic Matrix) & TDM (Topic Document Matrix): Output

Python Web Scraping - Dynamic Websites

Introduction

网络抓取是一项复杂的任务,如果网站是动态的,则复杂性会增加。根据联合国全球网络可访问性审计,超过 70% 的网站本质上是动态的,并且它们的功能依赖于 JavaScript。

Web scraping is a complex task and the complexity multiplies if the website is dynamic. According to United Nations Global Audit of Web Accessibility more than 70% of the websites are dynamic in nature and they rely on JavaScript for their functionalities.

Dynamic Website Example

让我们看一个动态网站的示例,并了解为什么很难抓取。在这里,我们将以从名为 http://example.webscraping.com/places/default/search. 的网站中搜索为例。但是我们如何判断该网站的性质是动态的?可以从以下 Python 脚本的输出中进行判断,该脚本将尝试从上述网页中抓取数据 −

Let us look at an example of a dynamic website and know about why it is difficult to scrape. Here we are going to take example of searching from a website named http://example.webscraping.com/places/default/search. But how can we say that this website is of dynamic nature? It can be judged from the output of following Python script which will try to scrape data from above mentioned webpage −

import re
import urllib.request
response = urllib.request.urlopen('http://example.webscraping.com/places/default/search')
html = response.read()
text = html.decode()
re.findall('(.*?)',text)

Output

[ ]

以上输出表明示例抓取器无法提取信息,因为我们试图查找的 <div> 元素为空。

The above output shows that the example scraper failed to extract information because the <div> element we are trying to find is empty.

Approaches for Scraping data from Dynamic Websites

我们已经看到,由于数据通过 JavaScript 动态加载,因此,抓取器无法抓取动态网站中的信息。在这样的情况下,我们可以使用以下两种技术从依赖 JavaScript 的动态网站中抓取数据−

We have seen that the scraper cannot scrape the information from a dynamic website because the data is loaded dynamically with JavaScript. In such cases, we can use the following two techniques for scraping data from dynamic JavaScript dependent websites −

  1. Reverse Engineering JavaScript

  2. Rendering JavaScript

Reverse Engineering JavaScript

名为逆向工程的过程很有用,可以让我们了解网页如何动态加载数据。

The process called reverse engineering would be useful and lets us understand how data is loaded dynamically by web pages.

为此,我们需要为特定 URL 点击 inspect element 标签。接下来,我们将单击 NETWORK 标签来查找为该网页发出的所有请求,包括具有 /ajax 路径的 search.json。我们可以借助以下 Python 脚本访问 AJAX 数据(而不是通过浏览器或通过 NETWORK 访问),也可以使用此脚本:

For doing this, we need to click the inspect element tab for a specified URL. Next, we will click NETWORK tab to find all the requests made for that web page including search.json with a path of /ajax. Instead of accessing AJAX data from browser or via NETWORK tab, we can do it with the help of following Python script too −

import requests
url=requests.get('http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=a')
url.json()

Example

上面的脚本允许我们使用 Python json 方法访问 JSON 响应。同样,我们可以下载原始字符串响应,并使用 Python 中的 json.loads 方法加载响应。我们可以借助以下 Python 脚本执行此操作。它将通过搜索字母“a”,然后迭代 JSON 响应的结果页面,基本上抓取所有国家/地区。

The above script allows us to access JSON response by using Python json method. Similarly we can download the raw string response and by using python’s json.loads method, we can load it too. We are doing this with the help of following Python script. It will basically scrape all of the countries by searching the letter of the alphabet ‘a’ and then iterating the resulting pages of the JSON responses.

import requests
import string
PAGE_SIZE = 15
url = 'http://example.webscraping.com/ajax/' + 'search.json?page={}&page_size={}&search_term=a'
countries = set()
for letter in string.ascii_lowercase:
   print('Searching with %s' % letter)
   page = 0
   while True:
   response = requests.get(url.format(page, PAGE_SIZE, letter))
   data = response.json()
   print('adding %d records from the page %d' %(len(data.get('records')),page))
   for record in data.get('records'):countries.add(record['country'])
   page += 1
   if page >= data['num_pages']:
      break
   with open('countries.txt', 'w') as countries_file:
   countries_file.write('n'.join(sorted(countries)))

在运行上面的脚本后,我们将获得以下输出,并且记录将保存到名为 countries.txt 的文件中。

After running the above script, we will get the following output and the records would be saved in the file named countries.txt.

Output

Searching with a
adding 15 records from the page 0
adding 15 records from the page 1
...

Rendering JavaScript

在上一个部分中,我们对网页执行了逆向工程,了解了 API 的工作原理,以及我们如何使用它来在一单个请求中检索结果。但是,在进行逆向工程时,我们可能会遇到以下困难−

In the previous section, we did reverse engineering on web page that how API worked and how we can use it to retrieve the results in single request. However, we can face following difficulties while doing reverse engineering −

  1. Sometimes websites can be very difficult. For example, if the website is made with advanced browser tool such as Google Web Toolkit (GWT), then the resulting JS code would be machine-generated and difficult to understand and reverse engineer.

  2. Some higher level frameworks like React.js can make reverse engineering difficult by abstracting already complex JavaScript logic.

解决上述困难的方法是使用浏览器渲染引擎,该引擎可以解析 HTML、应用 CSS 格式并执行 JavaScript 以显示网页。

The solution to the above difficulties is to use a browser rendering engine that parses HTML, applies the CSS formatting and executes JavaScript to display a web page.

Example

在此示例中,我们将使用一个知名的 Python 模块 Selenium 来渲染 Java Script。以下 Python 代码将借助 Selenium 渲染一个网页−

In this example, for rendering Java Script we are going to use a familiar Python module Selenium. The following Python code will render a web page with the help of Selenium −

首先,我们需要从 selenium 中导入 webdriver,如下所示 −

First, we need to import webdriver from selenium as follows −

from selenium import webdriver

现在,提供我们根据要求下载的 web driver 的路径 −

Now, provide the path of web driver which we have downloaded as per our requirement −

path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
driver = webdriver.Chrome(executable_path = path)

现在,提供我们想要在现在由我们的 Python 脚本控制的 web 浏览器中打开的 url。

Now, provide the url which we want to open in that web browser now controlled by our Python script.

driver.get('http://example.webscraping.com/search')

现在,我们可以使用搜索工具箱的 ID 为要选择的元素进行设置。

Now, we can use ID of the search toolbox for setting the element to select.

driver.find_element_by_id('search_term').send_keys('.')

接下来,我们可以使用 Java 脚本将选择框内容设置为如下 −

Next, we can use java script to set the select box content as follows −

js = "document.getElementById('page_size').options[1].text = '100';"
driver.execute_script(js)

下面一行代码显示,搜索已准备好点击网页 −

The following line of code shows that search is ready to be clicked on the web page −

driver.find_element_by_id('search').click()

下一行代码显示,它将等待 45 秒以完成 AJAX 请求。

Next line of code shows that it will wait for 45 seconds for completing the AJAX request.

driver.implicitly_wait(45)

现在,为了选择国家链接,我们可以使用 CSS 选择器,如下所示 −

Now, for selecting country links, we can use the CSS selector as follows −

links = driver.find_elements_by_css_selector('#results a')

现在可以提取每个链接的文本以创建国家列表 −

Now the text of each link can be extracted for creating the list of countries −

countries = [link.text for link in links]
print(countries)
driver.close()

Python Web Scraping - Form based Websites

在上一章,我们已经看到了抓取动态网站。在本章中,让我们了解对通过用户输入内容构建的网站进行抓取,即基于表单的网站。

In the previous chapter, we have seen scraping dynamic websites. In this chapter, let us understand scraping of websites that work on user based inputs, that is form based websites.

Introduction

如今,WWW(万维网)正在走向社交媒体以及用户生成的内容。那么问题来了,我们如何才能访问超出登录界面之外的此类信息?为此,我们需要处理表单和登录。

These days WWW (World Wide Web) is moving towards social media as well as usergenerated contents. So the question arises how we can access such kind of information that is beyond login screen? For this we need to deal with forms and logins.

在前面的章节中,我们使用 HTTP GET 方法来请求信息,但在本章中,我们将使用 HTTP POST 方法,该方法将信息推送到 Web 服务器进行存储和分析。

In previous chapters, we worked with HTTP GET method to request information but in this chapter we will work with HTTP POST method that pushes information to a web server for storage and analysis.

Interacting with Login forms

在使用互联网时,您肯定已经多次与登录表单进行了交互。它们可能非常简单,例如只包括很少几个 HTML 字段、一个提交按钮和一个操作页面,或者它们可能很复杂并且具有其他一些字段,例如出于安全原因,电子邮件、留言和验证码。

While working on Internet, you must have interacted with login forms many times. They may be very simple like including only a very few HTML fields, a submit button and an action page or they may be complicated and have some additional fields like email, leave a message along with captcha for security reasons.

在本节中,我们将借助 Python 请求库来处理简单的提交表单。

In this section, we are going to deal with a simple submit form with the help of Python requests library.

首先,我们需要导入请求库,如下所示 −

First, we need to import requests library as follows −

import requests

现在,我们需要提供登录表单字段的信息。

Now, we need to provide the information for the fields of login form.

parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’}

在下一行代码中,我们需要提供执行表单操作的 URL。

In next line of code, we need to provide the URL on which action of the form would happen.

r = requests.post(“enter the URL”, data = parameters)
print(r.text)

运行脚本后,脚本将返回操作发生的页面的内容。

After running the script, it will return the content of the page where action has happened.

假设你想用表单提交一张图片,那么使用 requests.post() 很容易。你可以借助以下 Python 脚本理解它:

Suppose if you want to submit any image with the form, then it is very easy with requests.post(). You can understand it with the help of following Python script −

import requests
file = {‘Uploadfile’: open(’C:\Usres\desktop\123.png’,‘rb’)}
r = requests.post(“enter the URL”, files = file)
print(r.text)

Loading Cookies from the Web Server

cookie 有时又称 Web cookie 或 Internet cookie,是从网站发送的一小段数据,我们的计算机将其存储在 Web 浏览器中的一个文件中。

A cookie, sometimes called web cookie or internet cookie, is a small piece of data sent from a website and our computer stores it in a file located inside our web browser.

关于登录表单处理,cookie 可以分为两种类型。我们已经解决了第一种,它允许我们提交信息给网站;第二种则允许我们在整个网站访问期间保持在一个永久的“已登录”状态。对于第二种表单,网站使用 cookie 来跟踪谁登录了,谁没有登录。

In the context of dealings with login forms, cookies can be of two types. One, we dealt in the previous section, that allows us to submit information to a website and second which lets us to remain in a permanent “logged-in” state throughout our visit to the website. For the second kind of forms, websites use cookies to keep track of who is logged in and who is not.

What do cookies do?

如今,大多数网站都使用 cookie 进行跟踪。我们可以借助以下步骤理解 cookie 的工作原理:

These days most of the websites are using cookies for tracking. We can understand the working of cookies with the help of following steps −

Step 1 - 首先,站点将验证我们的登录凭据并将其存储到浏览器的 cookie 中。此 cookie 通常包含由服务器生成的令牌、超时和跟踪信息。

Step 1 − First, the site will authenticate our login credentials and stores it in our browser’s cookie. This cookie generally contains a server-generated toke, time-out and tracking information.

Step 2 - 其次,网站将 cookie 用作身份验证的证明。在每次访问网站时,都会显示此身份验证。

Step 2 − Next, the website will use the cookie as a proof of authentication. This authentication is always shown whenever we visit the website.

cookie 对 Web 爬虫来说非常麻烦,因为如果 Web 爬虫不跟踪 cookie,提交的表单会发回,并且在下一页中似乎从未登录过。使用 Python requests 库可以非常轻松地跟踪 cookie,如下所示:

Cookies are very problematic for web scrapers because if web scrapers do not keep track of the cookies, the submitted form is sent back and at the next page it seems that they never logged in. It is very easy to track the cookies with the help of Python requests library, as shown below −

import requests
parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’}
r = requests.post(“enter the URL”, data = parameters)

在上述代码行中,URL 将是充当登录表单处理器的页面。

In the above line of code, the URL would be the page which will act as the processor for the login form.

print(‘The cookie is:’)
print(r.cookies.get_dict())
print(r.text)

在运行上述脚本后,我们将从上一次请求的结果中检索 cookie。

After running the above script, we will retrieve the cookies from the result of last request.

cookie 还有一个问题,即网站有时会在不发出警告的情况下频繁地修改 cookie。我们可以使用以下方法处理此类情况:

There is another issue with cookies that sometimes websites frequently modify cookies without warning. Such kind of situation can be dealt with requests.Session() as follows −

import requests
session = requests.Session()
parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’}
r = session.post(“enter the URL”, data = parameters)

在上述代码行中,URL 将是充当登录表单处理器的页面。

In the above line of code, the URL would be the page which will act as the processor for the login form.

print(‘The cookie is:’)
print(r.cookies.get_dict())
print(r.text)

请注意,你可以轻松了解带会话和不带会话的脚本之间的区别。

Observe that you can easily understand the difference between script with session and without session.

Automating forms with Python

在本节中,我们将处理一个 Python 模块,名为 Mechanize,此模块将减少我们的工作并自动执行填充表单的过程。

In this section we are going to deal with a Python module named Mechanize that will reduce our work and automate the process of filling up forms.

Mechanize module

Mechanize 模块为我们提供一个高级界面来与表单进行交互。在开始使用它之前,我们需要使用以下命令安装它:

Mechanize module provides us a high-level interface to interact with forms. Before starting using it we need to install it with the following command −

pip install mechanize

请注意,此命令仅适用于 Python 2.x。

Note that it would work only in Python 2.x.

Example

在此示例中,我们将自动化填写一个具有两个字段(即电子邮件和密码)的登录表单的过程:

In this example, we are going to automate the process of filling a login form having two fields namely email and password −

import mechanize
brwsr = mechanize.Browser()
brwsr.open(Enter the URL of login)
brwsr.select_form(nr = 0)
brwsr['email'] = ‘Enter email’
brwsr['password'] = ‘Enter password’
response = brwsr.submit()
brwsr.submit()

你可以非常轻松地理解上述代码。首先,我们导入了 mechanize 模块。然后创建了一个 Mechanize 浏览器对象。然后,我们导航到登录 URL 并选择了表单。之后,直接将名称和值传递给浏览器对象。

The above code is very easy to understand. First, we imported mechanize module. Then a Mechanize browser object has been created. Then, we navigated to the login URL and selected the form. After that, names and values are passed directly to the browser object.

Python Web Scraping - Processing CAPTCHA

在本节中,让我们了解如何执行网络抓取和处理 CAPTCHA,它用于测试用户是人还是机器人。

In this chapter, let us understand how to perform web scraping and processing CAPTCHA that is used for testing a user for human or robot.

What is CAPTCHA?

CAPTCHA 的全称是 Completely Automated Public Turing test to tell Computers and Humans Apart ,这清楚地表明它是一种测试,用于确定用户是否是人。

The full form of CAPTCHA is Completely Automated Public Turing test to tell Computers and Humans Apart, which clearly suggests that it is a test to determine whether the user is human or not.

CAPTCHA 是一幅扭曲的图像,计算机程序通常难以检测,但人类可以设法理解。大多数网站使用 CAPTCHA 来防止机器人交互。

A CAPTCHA is a distorted image which is usually not easy to detect by computer program but a human can somehow manage to understand it. Most of the websites use CAPTCHA to prevent bots from interacting.

Loading CAPTCHA with Python

假设我们想在网站上注册,并且表单带有 CAPTCHA,那么在加载 CAPTCHA 图像之前,我们需要了解表单所需的具体信息。借助下一个 Python 脚本,我们可以了解名为 http://example.webscrapping.com. 的网站上注册表单的表单要求。

Suppose we want to do registration on a website and there is form with CAPTCHA, then before loading the CAPTCHA image we need to know about the specific information required by the form. With the help of next Python script we can understand the form requirements of registration form on website named http://example.webscrapping.com.

import lxml.html
import urllib.request as urllib2
import pprint
import http.cookiejar as cookielib
def form_parsing(html):
   tree = lxml.html.fromstring(html)
   data = {}
   for e in tree.cssselect('form input'):
      if e.get('name'):
         data[e.get('name')] = e.get('value')
   return data
REGISTER_URL = '<a target="_blank" rel="nofollow"
   href="http://example.webscraping.com/user/register">http://example.webscraping.com/user/register'</a>
ckj = cookielib.CookieJar()
browser = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckj))
html = browser.open(
   '<a target="_blank" rel="nofollow"
      href="http://example.webscraping.com/places/default/user/register?_next">
      http://example.webscraping.com/places/default/user/register?_next</a> = /places/default/index'
).read()
form = form_parsing(html)
pprint.pprint(form)

在上述 Python 脚本中,我们首先使用 lxml python 模块定义一个函数来解析表单,然后它将打印如下表单要求:

In the above Python script, first we defined a function that will parse the form by using lxml python module and then it will print the form requirements as follows −

{
   '_formkey': '5e306d73-5774-4146-a94e-3541f22c95ab',
   '_formname': 'register',
   '_next': '/places/default/index',
   'email': '',
   'first_name': '',
   'last_name': '',
   'password': '',
   'password_two': '',
   'recaptcha_response_field': None
}

您可以从上面的输出中检查,除了 recpatcha_response_field 之外的所有信息都是可以理解且直接的。现在问题是如何处理这些复杂的信息并下载 CAPTCHA。借助 pillow Python 库可以执行此操作,如下所示:

You can check from the above output that all the information except recpatcha_response_field are understandable and straightforward. Now the question arises that how we can handle this complex information and download CAPTCHA. It can be done with the help of pillow Python library as follows;

Pillow Python Package

Pillow 是 Python 图像库的一个分支,具有用于处理图像的有用函数。它可以通过以下命令安装:

Pillow is a fork of the Python Image library having useful functions for manipulating images. It can be installed with the help of following command −

pip install pillow

在下一个示例中,我们将使用它来加载 CAPTCHA:

In the next example we will use it for loading the CAPTCHA −

from io import BytesIO
import lxml.html
from PIL import Image
def load_captcha(html):
   tree = lxml.html.fromstring(html)
   img_data = tree.cssselect('div#recaptcha img')[0].get('src')
   img_data = img_data.partition(',')[-1]
   binary_img_data = img_data.decode('base64')
   file_like = BytesIO(binary_img_data)
   img = Image.open(file_like)
   return img

上述 python 脚本正在使用 pillow python 程序包并定义一个用于加载 CAPTCHA 图像的函数。必须将其与上一个脚本中定义的 form_parser() 函数一起使用,以获取有关注册表单的信息。此脚本将以有用的格式保存 CAPTCHA 图像,以后可将其提取为字符串。

The above python script is using pillow python package and defining a function for loading CAPTCHA image. It must be used with the function named form_parser() that is defined in the previous script for getting information about the registration form. This script will save the CAPTCHA image in a useful format which further can be extracted as string.

OCR: Extracting Text from Image using Python

在以有用的格式加载 CAPTCHA 之后,我们借助光学字符识别 (OCR) 来提取它,这是从图像中提取文本的过程。为此,我们将使用开源 Tesseract OCR 引擎。它可以通过以下命令安装:

After loading the CAPTCHA in a useful format, we can extract it with the help of Optical Character Recognition (OCR), a process of extracting text from the images. For this purpose, we are going to use open source Tesseract OCR engine. It can be installed with the help of following command −

pip install pytesseract

Example

在此,我们将扩展上述 Python 脚本,该脚本通过使用 Pillow Python 程序包加载 CAPTCHA,如下所示:

Here we will extend the above Python script, which loaded the CAPTCHA by using Pillow Python Package, as follows −

import pytesseract
img = get_captcha(html)
img.save('captcha_original.png')
gray = img.convert('L')
gray.save('captcha_gray.png')
bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
bw.save('captcha_thresholded.png')

以上 Python 脚本将以黑白模式读取验证码,从而清晰易于传递给 Tesseract 如下所示

The above Python script will read the CAPTCHA in black and white mode which would be clear and easy to pass to tesseract as follows −

pytesseract.image_to_string(bw)

运行以上脚本后,我们将获得注册表单的验证码作为输出

After running the above script we will get the CAPTCHA of registration form as the output.

Python Web Scraping - Testing with Scrapers

本章介绍如何在 Python 中使用网络抓取进行测试。

This chapter explains how to perform testing using web scrapers in Python.

Introduction

在大型网络项目中,网站后端的自动化测试会定期进行,但前端测试往往会被跳过。其主要原因在于,网站的编程就像各种标记和编程语言的网络。我们可以针对一种语言编写单元测试,但如果交互是使用另一种语言完成的,就会变得具有挑战性。这就是说,我们必须拥有一套测试,以确保我们的代码按预期执行。

In large web projects, automated testing of website’s backend is performed regularly but the frontend testing is skipped often. The main reason behind this is that the programming of websites is just like a net of various markup and programming languages. We can write unit test for one language but it becomes challenging if the interaction is being done in another language. That is why we must have suite of tests to make sure that our code is performing as per our expectation.

Testing using Python

当我们谈论测试时,意思是说单元测试。在深入探究使用 Python 进行测试之前,我们必须了解单元测试。以下是单元测试的一些特征:

When we are talking about testing, it means unit testing. Before diving deep into testing with Python, we must know about unit testing. Following are some of the characteristics of unit testing −

  1. At-least one aspect of the functionality of a component would be tested in each unit test.

  2. Each unit test is independent and can also run independently.

  3. Unit test does not interfere with success or failure of any other test.

  4. Unit tests can run in any order and must contain at least one assertion.

Unittest − Python Module

名为 Unittest 的 Python 模块用于单元测试,该模块附带所有标准 Python 安装。我们只需要导入它,其余的就是 unittest.TestCase 类的任务,它将执行以下操作:

Python module named Unittest for unit testing is comes with all the standard Python installation. We just need to import it and rest is the task of unittest.TestCase class which will do the followings −

  1. SetUp and tearDown functions are provided by unittest.TestCase class. These functions can run before and after each unit test.

  2. It also provides assert statements to allow tests to pass or fail.

  3. It runs all the functions that begin with test_ as unit test.

Example

在这个示例中,我们将结合 unittest 使用网络抓取。我们将测试 Wikipedia 页面,以搜索字符串“Python”。它主要会执行两个测试:第一个测试是,标题页面是与搜索字符串(即“Python”)相同或不同;第二个测试是,页面包含一个内容块 div。

In this example we are going to combine web scraping with unittest. We will test Wikipedia page for searching string ‘Python’. It will basically do two tests, first weather the title page is same as the search string i.e.‘Python’ or not and second test makes sure that the page has a content div.

首先,我们将导入必需的 Python 模块。我们使用 BeautifulSoup 进行网络抓取,当然我们也使用 unittest 进行测试。

First, we will import the required Python modules. We are using BeautifulSoup for web scraping and of course unittest for testing.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import unittest

现在,我们需要定义一个类,它将扩展 unittest.TestCase。全局对象 bs 将会在所有测试之间共享。由 unittest 指定的函数 setUpClass 将会完成它。这里我们将定义两个函数,一个用于测试标题页面,另一个用于测试页面内容。

Now we need to define a class which will extend unittest.TestCase. Global object bs would be shared between all tests. A unittest specified function setUpClass will accomplish it. Here we will define two functions, one for testing the title page and other for testing the page content.

class Test(unittest.TestCase):
   bs = None
   def setUpClass():
      url = '<a target="_blank" rel="nofollow" href="https://en.wikipedia.org/wiki/Python">https://en.wikipedia.org/wiki/Python'</a>
      Test.bs = BeautifulSoup(urlopen(url), 'html.parser')
   def test_titleText(self):
      pageTitle = Test.bs.find('h1').get_text()
      self.assertEqual('Python', pageTitle);
   def test_contentExists(self):
      content = Test.bs.find('div',{'id':'mw-content-text'})
      self.assertIsNotNone(content)
if __name__ == '__main__':
   unittest.main()

在运行完上述脚本后,我们将获得以下输出:

After running the above script we will get the following output −

----------------------------------------------------------------------
Ran 2 tests in 2.773s

OK
An exception has occurred, use %tb to see the full traceback.

SystemExit: False

D:\ProgramData\lib\site-packages\IPython\core\interactiveshell.py:2870:
UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
 warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

Testing with Selenium

让我们讨论如何使用 Python Selenium 进行测试。这也被称为 Selenium 测试。Python unittestSelenium 并没有太多共同点。我们知道 Selenium 将标准的 Python 命令发送到不同的浏览器,尽管它们的浏览器设计存在差异性。请记住,我们已经在之前的章节中安装了 Selenium 并使用过它。我们将在 Selenium 中创建测试脚本,并将其用于自动化。

Let us discuss how to use Python Selenium for testing. It is also called Selenium testing. Both Python unittest and Selenium do not have much in common. We know that Selenium sends the standard Python commands to different browsers, despite variation in their browser’s design. Recall that we already installed and worked with Selenium in previous chapters. Here we will create test scripts in Selenium and use it for automation.

Example

在此 Python 脚本的帮助下,我们为 Facebook 登录页面的自动化创建了测试脚本。你可以修改该示例来自动化你所选择的其他表单和登录,但其概念是一样的。

With the help of next Python script, we are creating test script for the automation of Facebook Login page. You can modify the example for automating other forms and logins of your choice, however the concept would be same.

首先要连接到 Web 浏览器,我们要从 selenium 模块导入 webdriver −

First for connecting to web browser, we will import webdriver from selenium module −

from selenium import webdriver

现在,我们需要从 selenium 模块导入 Keys。

Now, we need to import Keys from selenium module.

from selenium.webdriver.common.keys import Keys

接下来我们需要提供用于登录 Facebook 帐户的用户名和密码

Next we need to provide username and password for login into our facebook account

user = "gauravleekha@gmail.com"
pwd = ""

接下来,提供 Chrome 的 Web 驱动程序路径。

Next, provide the path to web driver for Chrome.

path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
driver = webdriver.Chrome(executable_path=path)
driver.get("http://www.facebook.com")

现在,我们将使用 assert 关键字来验证条件。

Now we will verify the conditions by using assert keyword.

assert "Facebook" in driver.title

通过下面的代码行,我们正在将值发送到电子邮件部分。我们在这里通过其 id 进行搜索,但我们可以通过按 name driver.find_element_by_name("email") 进行搜索来执行此操作。

With the help of following line of code we are sending values to the email section. Here we are searching it by its id but we can do it by searching it by name as driver.find_element_by_name("email").

element = driver.find_element_by_id("email")
element.send_keys(user)

通过下面的代码行,我们正在将值发送到密码部分。我们在这里通过其 id 进行搜索,但我们可以通过按 name driver.find_element_by_name("pass") 进行搜索来执行此操作。

With the help of following line of code we are sending values to the password section. Here we are searching it by its id but we can do it by searching it by name as driver.find_element_by_name("pass").

element = driver.find_element_by_id("pass")
element.send_keys(pwd)

下一行代码用于在电子邮件和密码字段中插入值后按回车键/登录。

Next line of code is used to press enter/login after inserting the values in email and password field.

element.send_keys(Keys.RETURN)

现在,我们将关闭浏览器。

Now we will close the browser.

driver.close()

运行完上述脚本后,Chrome Web 浏览器将会打开,并且您可以看到电子邮件和密码正在被插入并单击登录按钮。

After running the above script, Chrome web browser will be opened and you can see email and password is being inserted and clicked on login button.

facebook login

Comparison: unittest or Selenium

unittest 和 selenium 的比较很困难,因为如果您想使用大型测试套件,那么需要 unites 的句法灵活性。另一方面,如果您要测试网站灵活性,那么 Selenium 测试将是我们的首选。但是如果我们可以将它们结合起来呢。我们可以将 selenium 导入 Python unittest 并充分利用两者。Selenium 可用于获取有关网站的信息,unittest 可评估该信息是否符合通过测试的标准。

The comparison of unittest and selenium is difficult because if you want to work with large test suites, the syntactical rigidity of unites is required. On the other hand, if you are going to test website flexibility then Selenium test would be our first choice. But what if we can combine both of them. We can import selenium into Python unittest and get the best of both. Selenium can be used to get information about a website and unittest can evaluate whether that information meets the criteria for passing the test or not.

例如,我们对上述 Python 脚本重新编制,以便通过结合两者自动执行 Facebook 登录,如下所示 −

For example, we are rewriting the above Python script for automation of Facebook login by combining both of them as follows −

import unittest
from selenium import webdriver

class InputFormsCheck(unittest.TestCase):
   def setUp(self):
      self.driver = webdriver.Chrome(r'C:\Users\gaurav\Desktop\chromedriver')
      def test_singleInputField(self):
      user = "gauravleekha@gmail.com"
      pwd = ""
      pageUrl = "http://www.facebook.com"
      driver=self.driver
      driver.maximize_window()
      driver.get(pageUrl)
      assert "Facebook" in driver.title
      elem = driver.find_element_by_id("email")
      elem.send_keys(user)
      elem = driver.find_element_by_id("pass")
      elem.send_keys(pwd)
      elem.send_keys(Keys.RETURN)
   def tearDown(self):
      self.driver.close()
if __name__ == "__main__":
   unittest.main()