Python Web Scraping 简明教程

Python Web Scraping - Introduction

Web 抓取是从 Web 中提取信息的自动化过程。本章将向您提供有关 Web 抓取、它与 Web 爬网的比较以及为什么您应该选择 Web 抓取的深入了解。您还将了解 Web 抓取工具的组成部分和工作原理。

Web scraping is an automatic process of extracting information from web. This chapter will give you an in-depth idea of web scraping, its comparison with web crawling, and why you should opt for web scraping. You will also learn about the components and working of a web scraper.

What is Web Scraping?

“抓取”一词的字典含义是指从 Web 上获取某些内容。这里出现两个问题:我们可以从网络上获取什么以及如何获取。

The dictionary meaning of word ‘Scrapping’ implies getting something from the web. Here two questions arise: What we can get from the web and How to get that.

第一个问题的答案是 ‘data’ 。数据对于任何程序员来说都是必不可少的,每个编程项目的基本要求是大量有用的数据。

The answer to the first question is ‘data’. Data is indispensable for any programmer and the basic requirement of every programming project is the large amount of useful data.

第二个问题的答案有点棘手,因为获取数据的方法有很多。通常,我们可能从数据库或数据文件和其他来源获取数据。但是,如果我们需要大量可在线获取的数据怎么办?获取此类数据的一种方法是手动搜索(在 Web 浏览器中点击)并保存(复制粘贴到电子表格或文件)所需的数据。这种方法相当乏味且耗时。获取此类数据的另一种方法是使用 web scraping

The answer to the second question is a bit tricky, because there are lots of ways to get data. In general, we may get data from a database or data file and other sources. But what if we need large amount of data that is available online? One way to get such kind of data is to manually search (clicking away in a web browser) and save (copy-pasting into a spreadsheet or file) the required data. This method is quite tedious and time consuming. Another way to get such data is using web scraping.

Web scraping ,也称为 web data miningweb harvesting ,是构建一个代理的过程,该代理可以自动从 Web 中提取、解析、下载和组织有用的信息。换句话说,我们可以说,Web 抓取软件将自动加载和提取来自多个网站的数据,而不是手动从网站中保存数据,并根据我们的要求进行提取。

Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, we can say that instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

Origin of Web Scraping

Web 抓取的起源是屏幕抓取,它用于集成非基于 Web 的应用程序或本机 Windows 应用程序。最初,屏幕抓取是在万维网 (WWW) 广泛使用之前使用的,但它无法扩展 WWW。这使得有必要将屏幕抓取方法自动化, ‘Web Scraping’ 技术应运而生。

The origin of web scraping is screen scrapping, which was used to integrate non-web based applications or native windows applications. Originally screen scraping was used prior to the wide use of World Wide Web (WWW), but it could not scale up WWW expanded. This made it necessary to automate the approach of screen scraping and the technique called ‘Web Scraping’ came into existence.

Web Crawling v/s Web Scraping

术语 Web 爬网和抓取经常互换使用,因为它们的基本概念是提取数据。但是,它们是彼此不同的。我们可以从它们的定义中理解基本差异。

The terms Web Crawling and Scraping are often used interchangeably as the basic concept of them is to extract data. However, they are different from each other. We can understand the basic difference from their definitions.

网络爬取基本上用于使用机器人(又名爬虫)对页面上的信息进行索引。它也称为 indexing 。另一方面,网络抓取是一种使用机器人(又名抓取器)提取信息自动化方式。它也称为 data extraction

Web crawling is basically used to index the information on the page using bots aka crawlers. It is also called indexing. On the hand, web scraping is an automated way of extracting the information using bots aka scrapers. It is also called data extraction.

为了理解这两个术语之间的差异,让我们看看下面给出的比较表 -

To understand the difference between these two terms, let us look into the comparison table given hereunder −

Web Crawling

Web Scraping

Refers to downloading and storing the contents of a large number of websites.

Refers to extracting individual data elements from the website by using a site-specific structure.

Mostly done on large scale.

Can be implemented at any scale.

Yields generic information.

Yields specific information.

Used by major search engines like Google, Bing, Yahoo. Googlebot is an example of a web crawler.

The information extracted using web scraping can be used to replicate in some other website or can be used to perform data analysis. For example the data elements can be names, address, price etc.

Uses of Web Scraping

使用网络抓取的原因和用途和万维网的用途一样无穷无尽。网络抓取器可以像人类一样执行任何操作,比如在线订餐、为你扫描在线购物网站,并在它们可用时购买比赛门票。下面讨论了网络抓取的一些重要用途 -

The uses and reasons for using web scraping are as endless as the uses of the World Wide Web. Web scrapers can do anything like ordering online food, scanning online shopping website for you and buying ticket of a match the moment they are available etc. just like a human can do. Some of the important uses of web scraping are discussed here −

  1. E-commerce Websites − Web scrapers can collect the data specially related to the price of a specific product from various e-commerce websites for their comparison.

  2. Content Aggregators − Web scraping is used widely by content aggregators like news aggregators and job aggregators for providing updated data to their users.

  3. Marketing and Sales Campaigns − Web scrapers can be used to get the data like emails, phone number etc. for sales and marketing campaigns.

  4. Search Engine Optimization (SEO) − Web scraping is widely used by SEO tools like SEMRush, Majestic etc. to tell business how they rank for search keywords that matter to them.

  5. Data for Machine Learning Projects − Retrieval of data for machine learning projects depends upon web scraping.

Data for Research - 研究人员可以通过这种自动化过程节省时间,为其研究工作收集有用的数据。

Data for Research − Researchers can collect useful data for the purpose of their research work by saving their time by this automated process.

Components of a Web Scraper

网络抓取器包括以下组件 -

A web scraper consists of the following components −

Web Crawler Module

网络抓取器的一个非常必要的组件网络爬虫模块,用于通过对 URL 发出 HTTP 或 HTTPS 请求在目标网站上导航。抓取器下载非结构化数据(HTML 内容)并将其传递给提取器(下一个模块)。

A very necessary component of web scraper, web crawler module, is used to navigate the target website by making HTTP or HTTPS request to the URLs. The crawler downloads the unstructured data (HTML contents) and passes it to extractor, the next module.

Extractor

提取器处理提取的 HTML 内容并将数据提取为半结构化格式。这也称为解析器模块,并使用正则表达式、HTML 解析、DOM 解析或人工智能等不同的解析技术来运行。

The extractor processes the fetched HTML content and extracts the data into semistructured format. This is also called as a parser module and uses different parsing techniques like Regular expression, HTML Parsing, DOM parsing or Artificial Intelligence for its functioning.

Data Transformation and Cleaning Module

提取的上述数据不适合直接使用。它必须通过一些清理模块,以便我们能使用它。字符串操作或正则表达式等方法可用于此目的。请注意,提取和转换也可以一步完成。

The data extracted above is not suitable for ready use. It must pass through some cleaning module so that we can use it. The methods like String manipulation or regular expression can be used for this purpose. Note that extraction and transformation can be performed in a single step also.

Storage Module

在提取数据之后,我们需要根据我们的需求来存储数据。存储模块将以标准格式输出数据,该数据可以存储在数据库或 JSON 或 CSV 格式中。

After extracting the data, we need to store it as per our requirement. The storage module will output the data in a standard format that can be stored in a database or JSON or CSV format.

Working of a Web Scraper

网络抓取可定义为用于下载多个网页的内容并从中提取数据的一个软件或脚本。

Web scraper may be defined as a software or script used to download the contents of multiple web pages and extracting data from it.

web scraper

我们从上图的图表中,可以简单几个步骤了解网络抓取的工作原理。

We can understand the working of a web scraper in simple steps as shown in the diagram given above.

Step 1: Downloading Contents from Web Pages

在此步骤中,网络抓取将从多个网页中下载所请求的内容。

In this step, a web scraper will download the requested contents from multiple web pages.

Step 2: Extracting Data

网站上的数据是 HTML 格式,且大部分是非结构化的。因此,在此步骤中,网络抓取将从下载的内容中解析并提取结构化的数据。

The data on websites is HTML and mostly unstructured. Hence, in this step, web scraper will parse and extract structured data from the downloaded contents.

Step 3: Storing the Data

此处,网络抓取将存储和保存以 CSV、JSON 或者数据库等格式提取的数据。

Here, a web scraper will store and save the extracted data in any of the format like CSV, JSON or in database.

Step 4: Analyzing the Data

在完成所有这些步骤后,网络抓取将分析所获得的数据。

After all these steps are successfully done, the web scraper will analyze the data thus obtained.