Python Web Scraping 简明教程

Legality of Web Scraping

使用 Python,我们可以抓取任何网站或网页上的特定元素,但你知道这是合法的还是非法的吗?在抓取任何网站之前,我们必须了解 Web 抓取的合法性。本章将解释与 Web 抓取的合法性相关的内容。

With Python, we can scrape any website or particular elements of a web page but do you have any idea whether it is legal or not? Before scraping any website we must have to know about the legality of web scraping. This chapter will explain the concepts related to legality of web scraping.

Introduction

一般来说,如果你要将抓取到的数据用于个人用途,那么可能没有任何问题。但如果你要重新发布这些数据,那么在这样做之前,你应该向所有人请求下载,或者对政策以及你将要抓取的数据做一些背景调查。

Generally, if you are going to use the scraped data for personal use, then there may not be any problem. But if you are going to republish that data, then before doing the same you should make download request to the owner or do some background research about policies as well about the data you are going to scrape.

Research Required Prior to Scraping

如果你以一个网站为目标从其抓取数据,我们需要了解它的规模和结构。以下是我们开始抓取之前需要分析的一些文件。

If you are targeting a website for scraping data from it, we need to understand its scale and structure. Following are some of the files which we need to analyze before starting web scraping.

Analyzing robots.txt

实际上,大多数发布商都允许程序员在一定程度上抓取他们的网站。换句话说,发布商希望抓取网站的特定部分。为了定义这一点,网站必须提出一些规则,说明哪些部分可以抓取,哪些部分不能抓取。这些规则在名为 robots.txt 的文件中定义。

Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.

robots.txt 是人类可读的文件,用于识别允许抓取和不允许抓取网站部分的部分。robots.txt 文件没有标准格式,网站的发布商可以根据他们的需求进行修改。我们可以通过在网站的 url 后面加上斜杠和 robots.txt 来检查特定网站的 robots.txt 文件。例如,如果我们想检查 Google.com 的文件,那么我们需要键入 https://www.google.com/robots.txt ,我们将得到以下内容:

robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type https://www.google.com/robots.txt and we will get something as follows −

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
and so on……..

网站 robots.txt 文件中定义的一些最常见的规则如下:

Some of the most common rules that are defined in a website’s robots.txt file are as follows −

   User-agent: BadCrawler
Disallow: /

以上规则表示 robots.txt 文件要求具有 BadCrawler 用户代理的抓取工具程序不要抓取他们的网站。

The above rule means the robots.txt file asks a crawler with BadCrawler user agent not to crawl their website.

User-agent: *
Crawl-delay: 5
Disallow: /trap

以上规则表示 robots.txt 文件会将带有所有用户代理的抓取工具程序下载请求之间的延迟设定为 5 秒,以避免服务器过载。 /trap 链接将尝试阻止恶意爬虫,这些爬虫会跟随不允许的链接。网站发布者可以根据他们的要求定义更多规则。这里讨论了其中一些规则:

The above rule means the robots.txt file delays a crawler for 5 seconds between download requests for all user-agents for avoiding overloading server. The /trap link will try to block malicious crawlers who follow disallowed links. There are many more rules that can be defined by the publisher of the website as per their requirements. Some of them are discussed here −

Analyzing Sitemap files

如果你想抓取一个网站以获取更新的信息,你该怎么做?你将抓取每个网页以获取更新的信息,但这会增加该特定网站的服务器流量。这就是网站提供站点地图文件帮助抓取工具程序在无需抓取每个网页的情况下找到更新内容的原因。站点地图标准定义在 http://www.sitemaps.org/protocol.html

What you supposed to do if you want to crawl a website for updated information? You will crawl every web page for getting that updated information, but this will increase the server traffic of that particular website. That is why websites provide sitemap files for helping the crawlers to locate updating content without needing to crawl every web page. Sitemap standard is defined at http://www.sitemaps.org/protocol.html.

Content of Sitemap file

以下是 robot.txt 文件中发现的 https://www.microsoft.com/robots.txt 站点地图文件的内容:

The following is the content of sitemap file of https://www.microsoft.com/robots.txt that is discovered in robot.txt file −

Sitemap: https://www.microsoft.com/en-us/explore/msft_sitemap_index.xml
Sitemap: https://www.microsoft.com/learning/sitemap.xml
Sitemap: https://www.microsoft.com/en-us/licensing/sitemap.xml
Sitemap: https://www.microsoft.com/en-us/legal/sitemap.xml
Sitemap: https://www.microsoft.com/filedata/sitemaps/RW5xN8
Sitemap: https://www.microsoft.com/store/collections.xml
Sitemap: https://www.microsoft.com/store/productdetailpages.index.xml
Sitemap: https://www.microsoft.com/en-us/store/locations/store-locationssitemap.xml

以上内容显示,站点地图列出了网站上的 URL,并进一步允许管理员指定一些其他信息,例如,每个 URL 的上次更新日期、内容更改、与其他 URL 的相关性重要性等。

The above content shows that the sitemap lists the URLs on website and further allows a webmaster to specify some additional information like last updated date, change of contents, importance of URL with relation to others etc. about each URL.

What is the Size of Website?

网站的大小是否会影响我们的抓取方式?即网站的网页数量是否会影响我们的抓取方式?当然是的。因为如果我们抓取的网页数量较少,那么效率就不是一个严重的问题,但假设我们的网站有数百万个网页,例如 Microsoft.com,那么按顺序下载每个网页需要几个月的时间,那么效率将成为一个严重的问题。

Is the size of a website, i.e. the number of web pages of a website affects the way we crawl? Certainly yes. Because if we have less number of web pages to crawl, then the efficiency would not be a serious issue, but suppose if our website has millions of web pages, for example Microsoft.com, then downloading each web page sequentially would take several months and then efficiency would be a serious concern.

Checking Website’s Size

通过检查 Google 抓取工具程序结果的大小,我们可以估计一个网站的大小。在进行 Google 搜索时,我们可以使用关键词 site 来过滤我们的结果。例如,估计 https://authoraditiagarwal.com/ 的大小如下:

By checking the size of result of Google’s crawler, we can have an estimate of the size of a website. Our result can be filtered by using the keyword site while doing the Google search. For example, estimating the size of https://authoraditiagarwal.com/ is given below −

checking the size

你可以看到大约有 60 个结果,这意味着它不是一个大网站,抓取不会导致效率问题。

You can see there are around 60 results which mean it is not a big website and crawling would not lead the efficiency issue.

Which technology is used by website?

另一个重要问题是网站使用的技术是否会影响我们的抓取方式?是的,它会影响。但是我们如何检查一个网站使用的技术?有一个名为 builtwith 的 Python 库,借助它,我们可以找出网站使用的技术。

Another important question is whether the technology used by website affects the way we crawl? Yes, it affects. But how we can check about the technology used by a website? There is a Python library named builtwith with the help of which we can find out about the technology used by a website.

Example

在这个示例中,我们将在 Python 库 builtwith 的帮助下检查网站 https://authoraditiagarwal.com 使用的技术。但在使用此库之前,我们需要按照以下步骤进行安装:

In this example we are going to check the technology used by the website https://authoraditiagarwal.com with the help of Python library builtwith. But before using this library, we need to install it as follows −

(base) D:\ProgramData>pip install builtwith
Collecting builtwith
   Downloading
https://files.pythonhosted.org/packages/9b/b8/4a320be83bb3c9c1b3ac3f9469a5d66e0
2918e20d226aa97a3e86bddd130/builtwith-1.3.3.tar.gz
Requirement already satisfied: six in d:\programdata\lib\site-packages (from
builtwith) (1.10.0)
Building wheels for collected packages: builtwith
   Running setup.py bdist_wheel for builtwith ... done
   Stored in directory:
C:\Users\gaurav\AppData\Local\pip\Cache\wheels\2b\00\c2\a96241e7fe520e75093898b
f926764a924873e0304f10b2524
Successfully built builtwith
Installing collected packages: builtwith
Successfully installed builtwith-1.3.3

现在,借助以下简单的代码行,我们可以检查特定网站使用的技术:

Now, with the help of following simple line of codes we can check the technology used by a particular website −

In [1]: import builtwith
In [2]: builtwith.parse('http://authoraditiagarwal.com')
Out[2]:
{'blogs': ['PHP', 'WordPress'],
   'cms': ['WordPress'],
   'ecommerce': ['WooCommerce'],
   'font-scripts': ['Font Awesome'],
   'javascript-frameworks': ['jQuery'],
   'programming-languages': ['PHP'],
   'web-servers': ['Apache']}

Who is the owner of website?

网站的所有者也很重要,因为如果所有人以封锁抓取工具程序出名,那么爬虫在从网站抓取数据时必须小心。有一个名为 Whois 的协议,借助它,我们可以找出网站的所有者。

The owner of the website also matters because if the owner is known for blocking the crawlers, then the crawlers must be careful while scraping the data from website. There is a protocol named Whois with the help of which we can find out about the owner of the website.

Example

在这个示例中,我们将在 Whois 的帮助下检查网站 say microsoft.com 的所有者。但在使用此库之前,我们需要按照以下步骤进行安装:

In this example we are going to check the owner of the website say microsoft.com with the help of Whois. But before using this library, we need to install it as follows −

(base) D:\ProgramData>pip install python-whois
Collecting python-whois
   Downloading
https://files.pythonhosted.org/packages/63/8a/8ed58b8b28b6200ce1cdfe4e4f3bbc8b8
5a79eef2aa615ec2fef511b3d68/python-whois-0.7.0.tar.gz (82kB)
   100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 92kB 164kB/s
Requirement already satisfied: future in d:\programdata\lib\site-packages (from
python-whois) (0.16.0)
Building wheels for collected packages: python-whois
   Running setup.py bdist_wheel for python-whois ... done
   Stored in directory:
C:\Users\gaurav\AppData\Local\pip\Cache\wheels\06\cb\7d\33704632b0e1bb64460dc2b
4dcc81ab212a3d5e52ab32dc531
Successfully built python-whois
Installing collected packages: python-whois
Successfully installed python-whois-0.7.0

现在,借助以下简单的代码行,我们可以检查特定网站使用的技术:

Now, with the help of following simple line of codes we can check the technology used by a particular website −

In [1]: import whois
In [2]: print (whois.whois('microsoft.com'))
{
   "domain_name": [
      "MICROSOFT.COM",
      "microsoft.com"
   ],
   -------
   "name_servers": [
      "NS1.MSFT.NET",
      "NS2.MSFT.NET",
      "NS3.MSFT.NET",
      "NS4.MSFT.NET",
      "ns3.msft.net",
      "ns1.msft.net",
      "ns4.msft.net",
      "ns2.msft.net"
   ],
   "emails": [
      "abusecomplaints@markmonitor.com",
      "domains@microsoft.com",
      "msnhst@microsoft.com",
      "whoisrelay@markmonitor.com"
   ],
}