Scrapy 简明教程

Scrapy - Quick Guide

Scrapy - Overview

Scrapy 是一个使用 Python 编写的高速、开放源代码 Web 爬取框架,它用于借助基于 XPath 的选择器从网页提取数据。

Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.

Scrapy 最初于 2008 年 6 月 26 日发布,并获得 BSD 许可,并在 2015 年 6 月发布了 1.0 里程碑版本。

Scrapy was first released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June 2015.

Why Use Scrapy?

  1. It is easier to build and scale large crawling projects.

  2. It has a built-in mechanism called Selectors, for extracting the data from websites.

  3. It handles the requests asynchronously and it is fast.

  4. It automatically adjusts crawling speed using Auto-throttling mechanism.

  5. Ensures developer accessibility.

Features of Scrapy

  1. Scrapy is an open source and free to use web crawling framework.

  2. Scrapy generates feed exports in formats such as JSON, CSV, and XML.

  3. Scrapy has built-in support for selecting and extracting data from sources either by XPath or CSS expressions.

  4. Scrapy based on crawler, allows extracting data from the web pages automatically.

Advantages

  1. Scrapy is easily extensible, fast, and powerful.

  2. It is a cross-platform application framework (Windows, Linux, Mac OS and BSD).

  3. Scrapy requests are scheduled and processed asynchronously.

  4. Scrapy comes with built-in service called Scrapyd which allows to upload projects and control spiders using JSON web service.

  5. It is possible to scrap any website, though that website does not have API for raw data access.

Disadvantages

  1. Scrapy is only for Python 2.7.

  2. Installation is different for different operating systems.

Scrapy - Environment

在本章中,我们将讨论如何安装和设置 Scrapy。Scrapy 必须与 Python 一起安装。

In this chapter, we will discuss how to install and set up Scrapy. Scrapy must be installed with Python.

可以通过 pip 安装 Scrapy。要安装,运行以下命令 −

Scrapy can be installed by using pip. To install, run the following command −

pip install Scrapy

Windows

Note - 在 Windows 操作系统上不支持 Python 3。

Note − Python 3 is not supported on Windows OS.

Step 1 - 从 Python 安装 Python 2.7

Step 1 − Install Python 2.7 from Python

通过将以下路径添加到 PATH 来设置环境变量 −

Set environmental variables by adding the following paths to the PATH −

C:\Python27\;C:\Python27\Scripts\;

你可以使用以下命令来检查 Python 版本 −

You can check the Python version using the following command −

python --version

Step 2 - 安装 OpenSSL

Step 2 − Install OpenSSL.

在你的环境变量中添加 C:\OpenSSL-Win32\bin。

Add C:\OpenSSL-Win32\bin in your environmental variables.

Note - OpenSSL 预装在除 Windows 以外的所有操作系统中。

Note − OpenSSL comes preinstalled in all operating systems except Windows.

Step 3 - 安装 Visual C++ 2008 再发行包。

Step 3 − Install Visual C++ 2008 redistributables.

Step 4 - 安装 pywin32

Step 4 − Install pywin32.

Step 5 - 为早于 2.7.9 的 Python 安装 pip

Step 5 − Install pip for Python versions older than 2.7.9.

你可以使用下面的命令检查 pip 版本 -

You can check the pip version using the following command −

pip --version

Step 6 - 要安装 scrapy,运行以下命令 -

Step 6 − To install scrapy, run the following command −

pip install Scrapy

Anaconda

如果你计算机上已安装了 anacondaminiconda ,运行下面的命令使用 conda 安装 Scrapy -

If you have anaconda or miniconda installed on your machine, run the below command to install Scrapy using conda −

conda install -c scrapinghub scrapy

Scrapinghub 公司支持 Linux、Windows 和 OS X 的官方 conda 包。

Scrapinghub company supports official conda packages for Linux, Windows, and OS X.

Note - 如果你在使用 pip 安装时遇到问题,建议通过上面的命令安装 Scrapy。

Note − It is recommended to install Scrapy using the above command if you have issues installing via pip.

Ubuntu 9.10 or Above

最新版的 Python 已预先安装在 Ubuntu 操作系统上。使用由 Scrapinghub 提供的 Ubuntu 包 aptgettable。要使用这些包 -

The latest version of Python is pre-installed on Ubuntu OS. Use the Ubuntu packages aptgettable provided by Scrapinghub. To use the packages −

Step 1 - 你需要将用于对 Scrapy 包进行签名的 GPG 密钥导入 APT 密钥环 -

Step 1 − You need to import the GPG key used to sign Scrapy packages into APT keyring −

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7

Step 2 - 接下来,使用下面的命令创建 /etc/apt/sources.list.d/scrapy.list 文件 -

Step 2 − Next, use the following command to create /etc/apt/sources.list.d/scrapy.list file −

echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee
/etc/apt/sources.list.d/scrapy.list

Step 3 - 更新包列表并安装 scrapy -

Step 3 − Update package list and install scrapy −

sudo apt-get update && sudo apt-get install scrapy

Archlinux

你可以使用下面的命令从 AUR Scrapy 包安装 Scrapy -

You can install Scrapy from AUR Scrapy package using the following command −

yaourt -S scrapy

Mac OS X

使用下面的命令安装 Xcode 命令行工具 -

Use the following command to install Xcode command line tools −

xcode-select --install

安装一个新的更新版本而不是使用系统 Python,该版本不会与系统中的其他部分发生冲突。

Instead of using system Python, install a new updated version that doesn’t conflict with the rest of your system.

Step 1 - 安装 homebrew

Step 1 − Install homebrew.

Step 2 - 设置环境 PATH 变量,以指定 homebrew 包应在系统包之前使用 -

Step 2 − Set environmental PATH variable to specify that homebrew packages should be used before system packages −

echo "export PATH = /usr/local/bin:/usr/local/sbin:$PATH" >> ~/.bashrc

Step 3 - 要确保已经完成更改,使用以下命令重新加载 .bashrc -

Step 3 − To make sure the changes are done, reload .bashrc using the following command −

source ~/.bashrc

Step 4 - 接下来,使用下面的命令安装 Python -

Step 4 − Next, install Python using the following command −

brew install python

Step 5 - 使用下面的命令安装 Scrapy -

Step 5 − Install Scrapy using the following command −

pip install Scrapy

Scrapy - Command Line Tools

Description

Scrapy 命令行工具用于控制 Scrapy,通常称为 'Scrapy tool' 。它包括针对各个对象的一组参数和选项的命令。

The Scrapy command line tool is used for controlling Scrapy, which is often referred to as 'Scrapy tool'. It includes the commands for various objects with a group of arguments and options.

Configuration Settings

Scrapy 会在 scrapy.cfg 文件中查找配置设定。以下是几个位置 −

Scrapy will find configuration settings in the scrapy.cfg file. Following are a few locations −

  1. C:\scrapy(project folder)\scrapy.cfg in the system

  2. ~/.config/scrapy.cfg ($XDG_CONFIG_HOME) and ~/.scrapy.cfg ($HOME) for global settings

  3. You can find the scrapy.cfg inside the root of the project.

Scrapy 还可以使用以下环境变量进行配置 −

Scrapy can also be configured using the following environment variables −

  1. SCRAPY_SETTINGS_MODULE

  2. SCRAPY_PROJECT

  3. SCRAPY_PYTHON_SHELL

Default Structure Scrapy Project

以下结构展示了 Scrapy 项目的默认文件结构。

The following structure shows the default file structure of the Scrapy project.

scrapy.cfg                - Deploy the configuration file
project_name/             - Name of the project
   _init_.py
   items.py               - It is project's items file
   pipelines.py           - It is project's pipelines file
   settings.py            - It is project's settings file
   spiders                - It is the spiders directory
      _init_.py
      spider_name.py
      . . .

scrapy.cfg 文件是项目根目录,包括项目名称和项目设定。例如 −

The scrapy.cfg file is a project root directory, which includes the project name with the project settings. For instance −

[settings]
default = [name of the project].settings

[deploy]
#url = http://localhost:6800/
project = [name of the project]

Using Scrapy Tool

Scrapy 工具提供了一些使用方法和可用命令,如下所示 −

Scrapy tool provides some usage and available commands as follows −

Scrapy X.Y  - no active project
Usage:
   scrapy  [options] [arguments]
Available commands:
   crawl      It puts spider (handle the URL) to work for crawling data
   fetch      It fetches the response from the given URL

Creating a Project

您可以使用以下命令在 Scrapy 中创建项目 −

You can use the following command to create the project in Scrapy −

scrapy startproject project_name

这会创建名为 project_name 的项目目录。然后,使用以下命令转到新创建的项目 −

This will create the project called project_name directory. Next, go to the newly created project, using the following command −

cd  project_name

Controlling Projects

您可以使用 Scrapy 工具控制项目和管理它们,还可以使用以下命令创建新的爬虫 −

You can control the project and manage them using the Scrapy tool and also create the new spider, using the following command −

scrapy genspider mydomain mydomain.com

crawl 等命令必须在 Scrapy 项目内使用。您将在下一部分中了解哪些命令必须在 Scrapy 项目内运行。

The commands such as crawl, etc. must be used inside the Scrapy project. You will come to know which commands must run inside the Scrapy project in the coming section.

Scrapy 包含一些内置命令,可用于您的项目。要查看可用命令的列表,请使用以下命令 −

Scrapy contains some built-in commands, which can be used for your project. To see the list of available commands, use the following command −

scrapy -h

当您运行以下命令时,Scrapy 将显示可用命令的列表,如下所示 −

When you run the following command, Scrapy will display the list of available commands as listed −

  1. fetch − It fetches the URL using Scrapy downloader.

  2. runspider − It is used to run self-contained spider without creating a project.

  3. settings − It specifies the project setting value.

  4. shell − It is an interactive scraping module for the given URL.

  5. startproject − It creates a new Scrapy project.

  6. version − It displays the Scrapy version.

  7. view − It fetches the URL using Scrapy downloader and show the contents in a browser.

你可以获得一些与项目相关的命令,如下所示−

You can have some project related commands as listed −

  1. crawl − It is used to crawl data using the spider.

  2. check − It checks the items returned by the crawled command.

  3. list − It displays the list of available spiders present in the project.

  4. edit − You can edit the spiders by using the editor.

  5. parse − It parses the given URL with the spider.

  6. bench − It is used to run quick benchmark test (Benchmark tells how many number of pages can be crawled per minute by Scrapy).

Custom Project Commands

您可以使用 COMMANDS_MODULE 在 Scrapy 项目中设置自定义项目命令。它在设置中包含一个默认的空字符串。您可以添加以下自定义命令−

You can build a custom project command with COMMANDS_MODULE setting in Scrapy project. It includes a default empty string in the setting. You can add the following custom command −

COMMANDS_MODULE = 'mycmd.commands'

可以使用 setup.py 文件中 scrapy.commands 部分添加 Scrapy 命令,如下所示−

Scrapy commands can be added using the scrapy.commands section in the setup.py file shown as follows −

from setuptools import setup, find_packages

setup(name = 'scrapy-module_demo',
   entry_points = {
      'scrapy.commands': [
         'cmd_demo = my_module.commands:CmdDemo',
      ],
   },
)

上面的代码在 setup.py 文件中添加了 cmd_demo 命令。

The above code adds cmd_demo command in the setup.py file.

Scrapy - Spiders

Description

Spider 是一个类,负责定义如何通过网站跟踪链接并从页面中提取信息。

Spider is a class responsible for defining how to follow the links through a website and extract the information from the pages.

Scrapy 的默认爬虫如下 −

The default spiders of Scrapy are as follows −

scrapy.Spider

它是一个爬虫,每个其他爬虫都必须从中继承。它具有以下类 −

It is a spider from which every other spiders must inherit. It has the following class −

class scrapy.spiders.Spider

下表显示了 scrapy.Spider 类的字段 −

The following table shows the fields of scrapy.Spider class −

Sr.No

Field & Description

1

name It is the name of your spider.

2

allowed_domains It is a list of domains on which the spider crawls.

3

start_urls It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from.

4

custom_settings These are the settings, when running the spider, will be overridden from project wide configuration.

5

crawler It is an attribute that links to Crawler object to which the spider instance is bound.

6

settings These are the settings for running a spider.

7

logger It is a Python logger used to send log messages.

8

from_crawler(crawler,*args,*kwargs)* It is a class method, which creates your spider. The parameters are − crawler − A crawler to which the spider instance will be bound. args(list) − These arguments are passed to the method init(). kwargs(dict) − These keyword arguments are passed to the method init().

9

start_requests() When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method.

10

make_requests_from_url(url) It is a method used to convert urls to requests.

11

parse(response) This method processes the response and returns scrapped data following more URLs.

12

log(message[,level,component]) It is a method that sends a log message through spiders logger.

13

closed(reason) This method is called when the spider closes.

Spider Arguments

Spider 参数用于指定开始 URL,并使用 crawl 命令与 -a 选项一起传递,如下所示 −

Spider arguments are used to specify start URLs and are passed using crawl command with -a option, shown as follows −

scrapy crawl first_scrapy -a group = accessories

以下代码演示了 Spider 如何接收参数 −

The following code demonstrates how a spider receives arguments −

import scrapy

class FirstSpider(scrapy.Spider):
   name = "first"

   def __init__(self, group = None, *args, **kwargs):
      super(FirstSpider, self).__init__(*args, **kwargs)
      self.start_urls = ["http://www.example.com/group/%s" % group]

Generic Spiders

你可以使用通用 Spider 对你的 Spider 进行子类化。它们的目的是根据特定规则遵循网站上的所有链接,以便从所有页面中提取数据。

You can use generic spiders to subclass your spiders from. Their aim is to follow all links on the website based on certain rules to extract data from all pages.

对于以下 Spider 中使用的示例,让我们假设我们有一个包含以下字段的项目 −

For the examples used in the following spiders, let’s assume we have a project with the following fields −

import scrapy
from scrapy.item import Item, Field

class First_scrapyItem(scrapy.Item):
   product_title = Field()
   product_link = Field()
   product_description = Field()

CrawlSpider

CrawlSpider 定义了一组要遵循的规则,以便遵循链接并提取多个页面。它具有以下类 −

CrawlSpider defines a set of rules to follow the links and scrap more than one page. It has the following class −

class scrapy.spiders.CrawlSpider

下面是 CrawlSpider 类的属性 −

Following are the attributes of CrawlSpider class −

rules

这是一个规则对象的列表,它定义了爬虫如何遵循链接。

It is a list of rule objects that defines how the crawler follows the link.

下表显示了 CrawlSpider 类的规则 −

The following table shows the rules of CrawlSpider class −

Sr.No

Rule & Description

1

LinkExtractor It specifies how spider follows the links and extracts the data.

2

callback It is to be called after each page is scraped.

3

follow It specifies whether to continue following links or not.

parse_start_url(response)

它可以通过允许解析初始响应来返回项目或请求对象。

It returns either item or request object by allowing to parse initial responses.

Note − 确保在编写规则时,将 parse 函数重命名为除 parse 之外的其他名称,因为 parse 函数由 CrawlSpider 用于实现其逻辑。

Note − Make sure you rename parse function other than parse while writing the rules because the parse function is used by CrawlSpider to implement its logic.

让我们看一看以下示例,其中 Spider 开始爬取 demoexample.com 的主页,收集所有页面、链接,并使用 parse_items 方法进行解析 −

Let’s take a look at the following example, where spider starts crawling demoexample.com’s home page, collecting all pages, links, and parses with the parse_items method −

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class DemoSpider(CrawlSpider):
   name = "demo"
   allowed_domains = ["www.demoexample.com"]
   start_urls = ["http://www.demoexample.com"]

   rules = (
      Rule(LinkExtractor(allow =(), restrict_xpaths = ("//div[@class = 'next']",)),
         callback = "parse_item", follow = True),
   )

   def parse_item(self, response):
      item = DemoItem()
      item["product_title"] = response.xpath("a/text()").extract()
      item["product_link"] = response.xpath("a/@href").extract()
      item["product_description"] = response.xpath("div[@class = 'desc']/text()").extract()
      return items

XMLFeedSpider

它是从 XML feed 进行抓取并迭代节点的爬虫的基础类。它有以下类:

It is the base class for spiders that scrape from XML feeds and iterates over nodes. It has the following class −

class scrapy.spiders.XMLFeedSpider

下表显示用于设置迭代器和标签名的类属性:

The following table shows the class attributes used to set an iterator and a tag name −

Sr.No

Attribute & Description

1

iterator It defines the iterator to be used. It can be either iternodes, html or xml. Default is iternodes.

2

itertag It is a string with node name to iterate.

3

namespaces It is defined by list of (prefix, uri) tuples that automatically registers namespaces using register_namespace() method.

4

adapt_response(response) It receives the response and modifies the response body as soon as it arrives from spider middleware, before spider starts parsing it.

5

parse_node(response,selector) It receives the response and a selector when called for each node matching the provided tag name. Note − Your spider won’t work if you don’t override this method.

6

process_results(response,results) It returns a list of results and response returned by the spider.

CSVFeedSpider

它遍历每行,接收 CSV 文件作为响应,并调用 parse_row() 方法。它有以下类:

It iterates through each of its rows, receives a CSV file as a response, and calls parse_row() method. It has the following class −

class scrapy.spiders.CSVFeedSpider

下表显示了关于 CSV 文件的设置选项:

The following table shows the options that can be set regarding the CSV file −

Sr.No

Option & Description

1

delimiter It is a string containing a comma(',') separator for each field.

2

quotechar It is a string containing quotation mark('"') for each field.

3

headers It is a list of statements from where the fields can be extracted.

4

parse_row(response,row) It receives a response and each row along with a key for header.

CSVFeedSpider Example

from scrapy.spiders import CSVFeedSpider
from demoproject.items import DemoItem

class DemoSpider(CSVFeedSpider):
   name = "demo"
   allowed_domains = ["www.demoexample.com"]
   start_urls = ["http://www.demoexample.com/feed.csv"]
   delimiter = ";"
   quotechar = "'"
   headers = ["product_title", "product_link", "product_description"]

   def parse_row(self, response, row):
      self.logger.info("This is row: %r", row)
      item = DemoItem()
      item["product_title"] = row["product_title"]
      item["product_link"] = row["product_link"]
      item["product_description"] = row["product_description"]
      return item

SitemapSpider

SitemapSpider 在 Sitemaps 的帮助下,通过从 robots.txt 定位 URL 来爬取网站。它具有以下类:

SitemapSpider with the help of Sitemaps crawl a website by locating the URLs from robots.txt. It has the following class −

class scrapy.spiders.SitemapSpider

下表显示了 SitemapSpider 的字段:

The following table shows the fields of SitemapSpider −

Sr.No

Field & Description

1

sitemap_urls A list of URLs which you want to crawl pointing to the sitemaps.

2

sitemap_rules It is a list of tuples (regex, callback), where regex is a regular expression, and callback is used to process URLs matching a regular expression.

3

sitemap_follow It is a list of sitemap’s regexes to follow.

4

sitemap_alternate_links Specifies alternate links to be followed for a single url.

SitemapSpider Example

以下 SitemapSpider 处理所有 URL −

The following SitemapSpider processes all the URLs −

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider):
   urls = ["http://www.demoexample.com/sitemap.xml"]

   def parse(self, response):
      # You can scrap items here

以下 SitemapSpider 使用回调处理一些 URL −

The following SitemapSpider processes some URLs with callback −

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider):
   urls = ["http://www.demoexample.com/sitemap.xml"]

   rules = [
      ("/item/", "parse_item"),
      ("/group/", "parse_group"),
   ]

   def parse_item(self, response):
      # you can scrap item here

   def parse_group(self, response):
      # you can scrap group here

以下代码显示 robots.txt 中网址为 /sitemap_company 的 sitemap。

The following code shows sitemaps in the robots.txt whose url has /sitemap_company

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider):
   urls = ["http://www.demoexample.com/robots.txt"]
   rules = [
      ("/company/", "parse_company"),
   ]
   sitemap_follow = ["/sitemap_company"]

   def parse_company(self, response):
      # you can scrap company here

您甚至可以将 SitemapSpider 与其他 URL 结合使用,如下面的命令所示。

You can even combine SitemapSpider with other URLs as shown in the following command.

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider):
   urls = ["http://www.demoexample.com/robots.txt"]
   rules = [
      ("/company/", "parse_company"),
   ]

   other_urls = ["http://www.demoexample.com/contact-us"]
   def start_requests(self):
      requests = list(super(DemoSpider, self).start_requests())
      requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
      return requests

   def parse_company(self, response):
      # you can scrap company here...

   def parse_other(self, response):
      # you can scrap other here...

Scrapy - Selectors

Description

当您抓取网页时,您需要通过使用名为 selectors 的机制提取 HTML 源的特定部分,通过使用 XPath 或 CSS 表达式实现。选择器建立在 lxml 库之上,该库处理 Python 语言中的 XML 和 HTML。

When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.

使用以下代码片段定义选择器的不同概念 −

Use the following code snippet to define different concepts of selectors −

<html>
   <head>
      <title>My Website</title>
   </head>

   <body>
      <span>Hello world!!!</span>
      <div class = 'links'>
         <a href = 'one.html'>Link 1<img src = 'image1.jpg'/></a>
         <a href = 'two.html'>Link 2<img src = 'image2.jpg'/></a>
         <a href = 'three.html'>Link 3<img src = 'image3.jpg'/></a>
      </div>
   </body>
</html>

Constructing Selectors

您可以通过传递 textTextResponse 对象来构建选择器类实例。基于提供的输入类型,选择器选择以下规则 −

You can construct the selector class instances by passing the text or TextResponse object. Based on the provided input type, the selector chooses the following rules −

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

使用上述代码,您可以从文本中构建为 −

Using the above code, you can construct from the text as −

Selector(text = body).xpath('//span/text()').extract()

它会显示结果为 −

It will display the result as −

[u'Hello world!!!']

您可以从响应中构建为 −

You can construct from the response as −

response = HtmlResponse(url = 'http://mysite.com', body = body)
Selector(response = response).xpath('//span/text()').extract()

它会显示结果为 −

It will display the result as −

[u'Hello world!!!']

Using Selectors

使用上述简单代码片段,您可以构建 XPath 来选择标题标签中定义的文本,如下所示 −

Using the above simple code snippet, you can construct the XPath for selecting the text which is defined in the title tag as shown below −

>>response.selector.xpath('//title/text()')

现在,您可以使用 .extract() 方法提取文本数据,如下所示 −

Now, you can extract the textual data using the .extract() method shown as follows −

>>response.xpath('//title/text()').extract()

它会产生结果为 −

It will produce the result as −

[u'My Website']

您可以显示所有元素的名称,如下所示 −

You can display the name of all elements shown as follows −

>>response.xpath('//div[@class = "links"]/a/text()').extract()

它会显示元素为 −

It will display the elements as −

Link 1
Link 2
Link 3

如果您想要提取第一个元素,那么使用 .extract_first() 方法,如下所示 −

If you want to extract the first element, then use the method .extract_first(), shown as follows −

>>response.xpath('//div[@class = "links"]/a/text()').extract_first()

它会显示元素为 −

It will display the element as −

Link 1

Nesting Selectors

使用上述代码,您可以嵌套选择器,以使用 .xpath() 方法显示页面链接和图片源,如下所示 −

Using the above code, you can nest the selectors to display the page link and image source using the .xpath() method, shown as follows −

links = response.xpath('//a[contains(@href, "image")]')

for index, link in enumerate(links):
   args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
   print 'The link %d pointing to url %s and image %s' % args

它会显示结果为 −

It will display the result as −

Link 1 pointing to url [u'one.html'] and image [u'image1.jpg']
Link 2 pointing to url [u'two.html'] and image [u'image2.jpg']
Link 3 pointing to url [u'three.html'] and image [u'image3.jpg']

Selectors Using Regular Expressions

Scrapy 允许使用正则表达式提取数据,它使用 .re() 方法。从以上的 HTML 代码中,我们将提取图片名称,如下所示:

Scrapy allows to extract the data using regular expressions, which uses the .re() method. From the above HTML code, we will extract the image names shown as follows −

>>response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

上面的行显示图片名称如下:

The above line displays the image names as −

[u'Link 1',
u'Link 2',
u'Link 3']

Using Relative XPaths

当您使用 XPaths(它始于 / )时,嵌套选择器和 XPath 与文档的绝对路径相关,而不是选择器的相对路径。

When you are working with XPaths, which starts with the /, nested selectors and XPath are related to absolute path of the document, and not the relative path of the selector.

如果您想提取 <p> 元素,首先获取所有 div 元素:

If you want to extract the <p> elements, then first gain all div elements −

>>mydiv = response.xpath('//div')

接下来,您可以通过在 XPath 前加上一个点来提取里面的所有 'p' 元素,如下面的 .//p 所示:

Next, you can extract all the 'p' elements inside, by prefixing the XPath with a dot as .//p as shown below −

>>for p in mydiv.xpath('.//p').extract()

Using EXSLT Extensions

EXSLT 是一个社区,它为 XSLT(可扩展样式表语言转换)发布扩展,它将 XML 文档转换为 XHTML 文档。您可以使用 XPath 表达式中注册的命名空间中的 EXSLT 扩展,如下表所示:

The EXSLT is a community that issues the extensions to the XSLT (Extensible Stylesheet Language Transformations) which converts XML documents to XHTML documents. You can use the EXSLT extensions with the registered namespace in the XPath expressions as shown in the following table −

Sr.No

Prefix & Usage

Namespace

1

re regular expressions

http://exslt.org/regexp/index.html

2

set set manipulation

http://exslt.org/set/index.html

您可以在上一部分检查使用正则表达式提取数据的简单代码格式。

You can check the simple code format for extracting data using regular expressions in the previous section.

在将 XPath 与 Scrapy 选择器一起使用时,有一些 XPath 技巧很有用。有关详细信息,请单击此 link

There are some XPath tips, which are useful when using XPath with Scrapy selectors. For more information, click this link.

Scrapy - Items

Description

Scrapy 进程可用于从诸如使用爬虫的网页等来源提取数据。Scrapy 使用 Item 类生成输出,其对象用于收集抓取的数据。

Scrapy process can be used to extract the data from sources such as web pages using the spiders. Scrapy uses Item class to produce the output whose objects are used to gather the scraped data.

Declaring Items

你可以使用类定义语法以及如下所示的字段对象来声明项目 -

You can declare the items using the class definition syntax along with the field objects shown as follows −

import scrapy
class MyProducts(scrapy.Item):
   productName = Field()
   productLink = Field()
   imageURL = Field()
   price = Field()
   size = Field()

Item Fields

项目字段用于显示每个字段的元数据。由于字段对象上的值没有限制,可访问的元数据键不会包含元数据的任何参考列表。字段对象用于指定所有字段元数据,你可以根据项目中的要求指定任何其他字段键。可以使用 Item.fields 属性访问字段对象。

The item fields are used to display the metadata for each field. As there is no limitation of values on the field objects, the accessible metadata keys does not ontain any reference list of the metadata. The field objects are used to specify all the field metadata and you can specify any other field key as per your requirement in the project. The field objects can be accessed using the Item.fields attribute.

Working with Items

在你使用项目时可以定义一些常见的函数。有关详细信息,请点击此处的 link

There are some common functions which can be defined when you are working with the items. For more information, click this link.

Extending Items

可以通过声明原始项目的子类来扩展项目。例如 -

The items can be extended by stating the subclass of the original item. For instance −

class MyProductDetails(Product):
   original_rate = scrapy.Field(serializer = str)
   discount_rate = scrapy.Field()

你可以使用现有的字段元数据通过添加更多值或更改现有值来扩展字段元数据,如下面的代码所示 -

You can use the existing field metadata to extend the field metadata by adding more values or changing the existing values as shown in the following code −

class MyProductPackage(Product):
   name = scrapy.Field(Product.fields['name'], serializer = serializer_demo)

Item Objects

可以使用以下类指定项目对象,该类从给定的参数中提供新的已初始化项目 -

The item objects can be specified using the following class which provides the new initialized item from the given argument −

class scrapy.item.Item([arg])

Item 提供构造函数的副本,并提供一个由项目中的字段给出的额外属性。

The Item provides a copy of the constructor and provides an extra attribute that is given by the items in the fields.

Field Objects

可以使用以下类指定字段对象,其中 Field 类不会发出额外的进程或属性 -

The field objects can be specified using the following class in which the Field class doesn’t issue the additional process or attributes −

class scrapy.item.Field([arg])

Scrapy - Item Loaders

Description

项目加载器提供了一种便捷的方法来填充从网站抓取的项目。

Item loaders provide a convenient way to fill the items that are scraped from the websites.

Declaring Item Loaders

项目加载器的声明类似于项目。

The declaration of Item Loaders is like Items.

例如 -

For example −

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

class DemoLoader(ItemLoader):
   default_output_processor = TakeFirst()
   title_in = MapCompose(unicode.title)
   title_out = Join()
   size_in = MapCompose(unicode.strip)
   # you can continue scraping here

在上面的代码中,您可以看到使用 _in 后缀声明输入处理器,并使用 _out 后缀声明输出处理器。

In the above code, you can see that input processors are declared using _in suffix and output processors are declared using _out suffix.

ItemLoader.default_input_processorItemLoader.default_output_processor 属性用于声明默认输入/输出处理器。

The ItemLoader.default_input_processor and ItemLoader.default_output_processor attributes are used to declare default input/output processors.

Using Item Loaders to Populate Items

要使用项目加载器,首先使用类字典对象实例化或没有类字典对象实例化项目,该项目使用 ItemLoader.default_item_class 属性中指定的项目类。

To use Item Loader, first instantiate with dict-like object or without one where the item uses Item class specified in ItemLoader.default_item_class attribute.

  1. You can use selectors to collect values into the Item Loader.

  2. You can add more values in the same item field, where Item Loader will use an appropriate handler to add these values.

以下代码展示使用项目加载器填充项目的的过程:

The following code demonstrates how items are populated using Item Loaders −

from scrapy.loader import ItemLoader
from demoproject.items import Demo

def parse(self, response):
   l = ItemLoader(item = Product(), response = response)
   l.add_xpath("title", "//div[@class = 'product_title']")
   l.add_xpath("title", "//div[@class = 'product_name']")
   l.add_xpath("desc", "//div[@class = 'desc']")
   l.add_css("size", "div#size]")
   l.add_value("last_updated", "yesterday")
   return l.load_item()

如上所示,有两个不同的 XPath,由 add_xpath() 方法抽取 title 域:

As shown above, there are two different XPaths from which the title field is extracted using add_xpath() method −

1. //div[@class = "product_title"]
2. //div[@class = "product_name"]

此后,类似的请求用于 desc 域。通过 add_css() 方法抽取大小数据,然后通过 add_value() 方法使用值“昨天”填充 last_updated

Thereafter, a similar request is used for desc field. The size data is extracted using add_css() method and last_updated is filled with a value "yesterday" using add_value() method.

一旦收集到所有数据,调用 ItemLoader.load_item() 方法,它将返回填充了通过 add_xpath()add_css()add_value() 方法抽取的数据的项目。

Once all the data is collected, call ItemLoader.load_item() method which returns the items filled with data extracted using add_xpath(), add_css() and add_value() methods.

Input and Output Processors

项目加载器的每个域都包含一个输入处理器和一个输出处理器。

Each field of an Item Loader contains one input processor and one output processor.

  1. When data is extracted, input processor processes it and its result is stored in ItemLoader.

  2. Next, after collecting the data, call ItemLoader.load_item() method to get the populated Item object.

  3. Finally, you can assign the result of the output processor to the item.

以下代码展示如何调用特定域的输入和输出处理器:

The following code demonstrates how to call input and output processors for a specific field −

l = ItemLoader(Product(), some_selector)
l.add_xpath("title", xpath1) # [1]
l.add_xpath("title", xpath2) # [2]
l.add_css("title", css)      # [3]
l.add_value("title", "demo") # [4]
return l.load_item()         # [5]

Line 1 - 从 xpath1 抽取标题数据并传递通过输入处理器,然后收集其结果并存储在 ItemLoader 中。

Line 1 − The data of title is extracted from xpath1 and passed through the input processor and its result is collected and stored in ItemLoader.

Line 2 - 同样,从 xpath2 抽取标题并传递通过相同的输入处理器,然后将其结果添加到为 [1] 收集的数据中。

Line 2 − Similarly, the title is extracted from xpath2 and passed through the same input processor and its result is added to the data collected for [1].

Line 3 - 从 css 选择器抽取标题并传递通过相同的输入处理器,然后将结果添加到为 [1] 和 [2] 收集的数据中。

Line 3 − The title is extracted from css selector and passed through the same input processor and the result is added to the data collected for [1] and [2].

Line 4 - 接下来,分配值“demo”并传递通过输入处理器。

Line 4 − Next, the value "demo" is assigned and passed through the input processors.

Line 5 - 最后,从所有域内部收集数据并传递至输出处理器,最后的值分配给项目。

Line 5 − Finally, the data is collected internally from all the fields and passed to the output processor and the final value is assigned to the Item.

Declaring Input and Output Processors

输入和输出处理器在项目加载器的定义中声明。除此外,它们还可以 Item Field 元数据中指定。

The input and output processors are declared in the ItemLoader definition. Apart from this, they can also be specified in the Item Field metadata.

例如 -

For example −

import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags

def filter_size(value):
   if value.isdigit():
      return value

class Item(scrapy.Item):
   name = scrapy.Field(
      input_processor = MapCompose(remove_tags),
      output_processor = Join(),
   )
   size = scrapy.Field(
      input_processor = MapCompose(remove_tags, filter_price),
      output_processor = TakeFirst(),
   )

>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item = Product())
>>> il.add_value('title', [u'Hello', u'<strong>world</strong>'])
>>> il.add_value('size', [u'<span>100 kg</span>'])
>>> il.load_item()

其显示的输出为:

It displays an output as −

{'title': u'Hello world', 'size': u'100 kg'}

Item Loader Context

项目加载器上下文是 input 和 output 处理器共享的任意键值 dict。

The Item Loader Context is a dict of arbitrary key values shared among input and output processors.

例如,假设您具有函数 parse_length:

For example, assume you have a function parse_length −

def parse_length(text, loader_context):
   unit = loader_context.get('unit', 'cm')

   # You can write parsing code of length here
   return parsed_length

通过接收 loader_context 参数,Item Loader 会知道可以接收 Item Loader 上下文。有几种方法可以更改 Item Loader 上下文的 value −

By receiving loader_context arguements, it tells the Item Loader it can receive Item Loader context. There are several ways to change the value of Item Loader context −

  1. Modify current active Item Loader context −

loader = ItemLoader (product)
loader.context ["unit"] = "mm"
  1. On Item Loader instantiation −

loader = ItemLoader(product, unit = "mm")
  1. On Item Loader declaration for input/output processors that instantiates with Item Loader context −

class ProductLoader(ItemLoader):
   length_out = MapCompose(parse_length, unit = "mm")

ItemLoader Objects

它是一个对象,返回一个 Item Loader,以填充给定的项目。它有以下类 −

It is an object which returns a new item loader to populate the given item. It has the following class −

class scrapy.loader.ItemLoader([item, selector, response, ]**kwargs)

下表显示 ItemLoader 对象的参数 −

The following table shows the parameters of ItemLoader objects −

Sr.No

Parameter & Description

1

item It is the item to populate by calling add_xpath(), add_css() or add_value().

2

selector It is used to extract data from websites.

3

response It is used to construct selector using default_selector_class.

下表显示 ItemLoader 对象的方法 −

Following table shows the methods of ItemLoader objects −

Sr.No

Method & Description

Example

1

get_value(value, *processors, **kwargs) By a given processor and keyword arguments, the value is processed by get_value() method.

>>> from scrapy.loader.processors import TakeFirst >>> loader.get_value(u’title: demoweb', TakeFirst(), unicode.upper, re = 'title: (.+)') 'DEMOWEB`

2

add_value(field_name, value, *processors, **kwargs) It processes the value and adds to the field where it is first passed through get_value by giving processors and keyword arguments before passing through field input processor.

loader.add_value('title', u’DVD') loader.add_value('colors', [u’black', u’white']) loader.add_value('length', u'80') loader.add_value('price', u'2500')

3

replace_value(field_name, value, *processors, **kwargs) It replaces the collected data with a new value.

loader.replace_value('title', u’DVD') loader.replace_value('colors', [u’black', u’white']) loader.replace_value('length', u'80') loader.replace_value('price', u'2500')

4

get_xpath(xpath, *processors, **kwargs) It is used to extract unicode strings by giving processors and keyword arguments by receiving XPath.

# HTML code: <div class = "item-name">DVD</div> loader.get_xpath("//div[@class = 'item-name']")

# HTML code: <div id = "length">the length is 45cm</div> loader.get_xpath("//div[@id = 'length']", TakeFirst(), re = "the length is (.*)")

5

add_xpath(field_name, xpath, *processors, **kwargs) It receives XPath to the field which extracts unicode strings.

# HTML code: <div class = "item-name">DVD</div> loader.add_xpath('name', '//div [@class = "item-name"]')

# HTML code: <div id = "length">the length is 45cm</div> loader.add_xpath('length', '//div[@id = "length"]', re = 'the length is (.*)')

6

replace_xpath(field_name, xpath, *processors, **kwargs) It replaces the collected data using XPath from sites.

# HTML code: <div class = "item-name">DVD</div> loader.replace_xpath('name', '

# HTML code: <div id = "length">the length is 45cm</div> loader.replace_xpath('length', '

7

get_css(css, *processors, **kwargs) It receives CSS selector used to extract the unicode strings.

loader.get_css("div.item-name") loader.get_css("div#length", TakeFirst(), re = "the length is (.*)")

8

add_css(field_name, css, *processors, **kwargs) It is similar to add_value() method with one difference that it adds CSS selector to the field.

loader.add_css('name', 'div.item-name') loader.add_css('length', 'div#length', re = 'the length is (.*)')

9

replace_css(field_name, css, *processors, **kwargs) It replaces the extracted data using CSS selector.

loader.replace_css('name', 'div.item-name') loader.replace_css('length', 'div#length', re = 'the length is (.*)')

10

load_item() When the data is collected, this method fills the item with collected data and returns it.

def parse(self, response): l = ItemLoader(item = Product(), response = response) l.add_xpath('title', '// div[@class = "product_title"]') loader.load_item()

11

nested_xpath(xpath) It is used to create nested loaders with an XPath selector.

loader = ItemLoader(item = Item()) loader.add_xpath('social', ' a[@class = "social"]/@href') loader.add_xpath('email', ' a[@class = "email"]/@href')

12

nested_css(css) It is used to create nested loaders with a CSS selector.

loader = ItemLoader(item = Item()) loader.add_css('social', 'a[@class = "social"]/@href') loader.add_css('email', 'a[@class = "email"]/@href')

下表显示了 ItemLoader 对象的属性:

Following table shows the attributes of ItemLoader objects −

Sr.No

Attribute & Description

1

item It is an object on which the Item Loader performs parsing.

2

context It is the current context of Item Loader that is active.

3

default_item_class It is used to represent the items, if not given in the constructor.

4

default_input_processor The fields which don’t specify input processor are the only ones for which default_input_processors are used.

5

default_output_processor The fields which don’t specify the output processor are the only ones for which default_output_processors are used.

6

default_selector_class It is a class used to construct the selector, if it is not given in the constructor.

7

selector It is an object that can be used to extract the data from sites.

Nested Loaders

它用于在解析文档子部分中的值时创建嵌套加载器。如果你不创建嵌套加载器,你需要为要提取的每个值指定完整的 XPath 或 CSS。

It is used to create nested loaders while parsing the values from the subsection of a document. If you don’t create nested loaders, you need to specify full XPath or CSS for each value that you want to extract.

例如,假设数据正在从头文件中提取 −

For instance, assume that the data is being extracted from a header page −

<header>
   <a class = "social" href = "http://facebook.com/whatever">facebook</a>
   <a class = "social" href = "http://twitter.com/whatever">twitter</a>
   <a class = "email" href = "mailto:someone@example.com">send mail</a>
</header>

接下来,你可以通过向头文件中添加相关值来用头选择器创建嵌套加载器 −

Next, you can create a nested loader with header selector by adding related values to the header −

loader = ItemLoader(item = Item())
header_loader = loader.nested_xpath('//header')
header_loader.add_xpath('social', 'a[@class = "social"]/@href')
header_loader.add_xpath('email', 'a[@class = "email"]/@href')
loader.load_item()

Reusing and extending Item Loaders

项目加载器旨在减轻在项目获取更多爬虫时出现的维护问题。

Item Loaders are designed to relieve the maintenance which becomes a fundamental problem when your project acquires more spiders.

例如,假设某个网站的产品名称包含三个破折号(例如 --DVD---)。如果你不希望它出现在最终产品名称中,你可以通过重复使用默认的产品项目加载器来移除那些破折号,如下面的代码所示 −

For instance, assume that a site has their product name enclosed in three dashes (e.g. --DVD---). You can remove those dashes by reusing the default Product Item Loader, if you don’t want it in the final product names as shown in the following code −

from scrapy.loader.processors import MapCompose
from demoproject.ItemLoaders import DemoLoader

def strip_dashes(x):
   return x.strip('-')

class SiteSpecificLoader(DemoLoader):
   title_in = MapCompose(strip_dashes, DemoLoader.title_in)

Available Built-in Processors

以下是常用的内置处理器 −

Following are some of the commonly used built-in processors −

class scrapy.loader.processors.Identity

它返回原始值而不做任何更改。例如 −

It returns the original value without altering it. For example −

>>> from scrapy.loader.processors import Identity
>>> proc = Identity()
>>> proc(['a', 'b', 'c'])
['a', 'b', 'c']

class scrapy.loader.processors.TakeFirst

它从接收值列表中返回第一个非空/非空值。例如 −

It returns the first value that is non-null/non-empty from the list of received values. For example −

>>> from scrapy.loader.processors import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'a', 'b', 'c'])
'a'

class scrapy.loader.processors.Join(separator = u' ')

它返回连接到分隔符的值。默认分隔符是 u' ',它相当于函数 u' '.join 。例如 −

It returns the value attached to the separator. The default separator is u' ' and it is equivalent to the function u' '.join. For example −

>>> from scrapy.loader.processors import Join
>>> proc = Join()
>>> proc(['a', 'b', 'c'])
u'a b c'
>>> proc = Join('<br>')
>>> proc(['a', 'b', 'c'])
u'a<br>b<br>c'

class scrapy.loader.processors.Compose(*functions, **default_loader_context)

它由一个处理器定义,其中它的每个输入值都传递到第一个函数,该函数的结果传递到第二个函数,依此类推,直到最后函数返回最终值作为输出。

It is defined by a processor where each of its input value is passed to the first function, and the result of that function is passed to the second function and so on, till lthe ast function returns the final value as output.

例如 -

For example −

>>> from scrapy.loader.processors import Compose
>>> proc = Compose(lambda v: v[0], str.upper)
>>> proc(['python', 'scrapy'])
'PYTHON'

class scrapy.loader.processors.MapCompose(*functions, **default_loader_context)

这是一个处理器,输入值会在其中进行迭代,第一个函数会应用到每个元素。接下来,这些函数调用的结果连接在一起,形成可迭代的新项,然后应用到第二个函数,依此类推,知道最后一个函数。

It is a processor where the input value is iterated and the first function is applied to each element. Next, the result of these function calls are concatenated to build new iterable that is then applied to the second function and so on, till the last function.

例如 -

For example −

>>> def filter_scrapy(x):
   return None if x == 'scrapy' else x

>>> from scrapy.loader.processors import MapCompose
>>> proc = MapCompose(filter_scrapy, unicode.upper)
>>> proc([u'hi', u'everyone', u'im', u'pythonscrapy'])
[u'HI, u'IM', u'PYTHONSCRAPY']

class scrapy.loader.processors.SelectJmes(json_path)

这个类使用提供的 json 路径查询值并返回输出。

This class queries the value using the provided json path and returns the output.

例如 -

For example −

>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose
>>> proc = SelectJmes("hello")
>>> proc({'hello': 'scrapy'})
'scrapy'
>>> proc({'hello': {'scrapy': 'world'}})
{'scrapy': 'world'}

以下是通过导入 json 查询值的代码 −

Following is the code, which queries the value by importing json −

>>> import json
>>> proc_single_json_str = Compose(json.loads, SelectJmes("hello"))
>>> proc_single_json_str('{"hello": "scrapy"}')
u'scrapy'
>>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('hello')))
>>> proc_json_list('[{"hello":"scrapy"}, {"world":"env"}]')
[u'scrapy']

Scrapy - Shell

Description

Scrapy shell 可用于使用无错误的代码来抓取数据,而无需使用爬虫。Scrapy shell 的主要目的是测试已提取的代码、XPath 或 CSS 表达式。它还有助于指定您要从中抓取数据的网页。

Scrapy shell can be used to scrap the data with error free code, without the use of spider. The main purpose of Scrapy shell is to test the extracted code, XPath, or CSS expressions. It also helps specify the web pages from which you are scraping the data.

Configuring the Shell

可以通过安装 IPython (用于交互式计算)控制台来配置 shell,这是一个功能强大的交互式 shell,它提供了自动完成、着色输出等功能。

The shell can be configured by installing the IPython (used for interactive computing) console, which is a powerful interactive shell that gives the auto completion, colorized output, etc.

如果您在 Unix 平台上工作,那么最好安装 IPython。如果 IPython 不可访问,您还可以使用 bpython

If you are working on the Unix platform, then it’s better to install the IPython. You can also use bpython, if IPython is inaccessible.

您可以通过设置名为 SCRAPY_PYTHON_SHELL 的环境变量或按如下方式定义 scrapy.cfg 文件来配置 shell:

You can configure the shell by setting the environment variable called SCRAPY_PYTHON_SHELL or by defining the scrapy.cfg file as follows −

[settings]
shell = bpython

Launching the Shell

可以使用以下命令启动 Scrapy shell:

Scrapy shell can be launched using the following command −

scrapy shell <url>

url 指定要抓取数据的 URL。

The url specifies the URL for which the data needs to be scraped.

Using the Shell

shell 提供一些附加快捷方式和 Scrapy 对象,如下表所述:

The shell provides some additional shortcuts and Scrapy objects as described in the following table −

Available Shortcuts

shell 在项目中提供了以下可用快捷方式:

Shell provides the following available shortcuts in the project −

Sr.No

Shortcut & Description

1

shelp() It provides the available objects and shortcuts with the help option.

2

fetch(request_or_url) It collects the response from the request or URL and associated objects will get updated properly.

3

view(response) You can view the response for the given request in the local browser for observation and to display the external link correctly, it appends a base tag to the response body.

Available Scrapy Objects

Shell 在 project 中提供了以下可用的 Scrapy 对象 −

Shell provides the following available Scrapy objects in the project −

Sr.No

Object & Description

1

crawler It specifies the current crawler object.

2

spider If there is no spider for present URL, then it will handle the URL or spider object by defining the new spider.

3

request It specifies the request object for the last collected page.

4

response It specifies the response object for the last collected page.

5

settings It provides the current Scrapy settings.

Example of Shell Session

让我们尝试爬取 scrapy.org 网站,然后按照说明开始从 reddit.com 爬取数据。

Let us try scraping scrapy.org site and then begin to scrap the data from reddit.com as described.

在继续之前,首先我们将按照以下命令启动 shell −

Before moving ahead, first we will launch the shell as shown in the following command −

scrapy shell 'http://scrapy.org' --nolog

Scrapy 将在使用以上网址时显示可用的对象 −

Scrapy will display the available objects while using the above URL −

[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
[s]   item       {}
[s]   request    <GET http://scrapy.org >
[s]   response   <200 http://scrapy.org >
[s]   settings   <scrapy.settings.Settings object at 0x2bfd650>
[s]   spider     <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s]   shelp()           Provides available objects and shortcuts with help option
[s]   fetch(req_or_url) Collects the response from the request or URL and associated
objects will get update
[s]   view(response)    View the response for the given request

接下来,开始使用对象,如下所示 −

Next, begin with the working of objects, shown as follows −

>> response.xpath('//title/text()').extract_first()
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'
>> fetch("http://reddit.com")
[s] Available Scrapy objects:
[s]   crawler
[s]   item       {}
[s]   request
[s]   response   <200 https://www.reddit.com/>
[s]   settings
[s]   spider
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>> response.xpath('//title/text()').extract()
[u'reddit: the front page of the internet']
>> request = request.replace(method="POST")
>> fetch(request)
[s] Available Scrapy objects:
[s]   crawler
...

Invoking the Shell from Spiders to Inspect Responses

只有当您希望获取该响应时,您才能检查从爬虫处理的响应。

You can inspect the responses which are processed from the spider, only if you are expecting to get that response.

例如 −

For instance −

import scrapy

class SpiderDemo(scrapy.Spider):
   name = "spiderdemo"
   start_urls = [
      "http://mysite.com",
      "http://mysite1.org",
      "http://mysite2.net",
   ]

   def parse(self, response):
      # You can inspect one specific response
      if ".net" in response.url:
         from scrapy.shell import inspect_response
         inspect_response(response, self)

如以上代码所示,您可以在爬虫中调用 shell 来使用以下函数检查响应 −

As shown in the above code, you can invoke the shell from spiders to inspect the responses using the following function −

scrapy.shell.inspect_response

现在运行爬虫,您将得到以下屏幕 −

Now run the spider, and you will get the following screen −

2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None)
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None)
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None)
[s] Available Scrapy objects:
[s]   crawler
...
>> response.url
'http://mysite2.org'

您可以通过以下代码检查提取的代码是否有效 −

You can examine whether the extracted code is working using the following code −

>> response.xpath('//div[@class = "val"]')

它会以下形式显示输出

It displays the output as

[]

以上行仅显示了一个空白输出。现在您可以调用 shell 来检查响应,如下所示 −

The above line has displayed only a blank output. Now you can invoke the shell to inspect the response as follows −

>> view(response)

它会显示以下响应

It displays the response as

True

Scrapy - Item Pipeline

Description

Item Pipeline 是一种对收集到的项目进行处理的方法。当一个项目被发送到项目管道时,它会被一个爬虫程序抓取出来,然后使用几个按顺序执行的组件进行处理。

Item Pipeline is a method where the scrapped items are processed. When an item is sent to the Item Pipeline, it is scraped by a spider and processed using several components, which are executed sequentially.

每当接收到一个项目时,就会决定采取以下两种操作之一 −

Whenever an item is received, it decides either of the following action −

  1. Keep processing the item.

  2. Drop it from pipeline.

  3. Stop processing the item.

项目管道通常用于以下目的 −

Item pipelines are generally used for the following purposes −

  1. Storing scraped items in database.

  2. If the received item is repeated, then it will drop the repeated item.

  3. It will check whether the item is with targeted fields.

  4. Clearing HTML data.

Syntax

您可以使用以下方法编写项目管道 −

You can write the Item Pipeline using the following method −

process_item(self, item, spider)

上述方法包含以下参数 −

The above method contains following parameters −

  1. Item (item object or dictionary) − It specifies the scraped item.

  2. spider (spider object) − The spider which scraped the item.

你也可以使用下表中给出的各种方法-

You can use additional methods given in the following table −

Sr.No

Method & Description

Parameters

1

open_spider(self, spider) It is selected when spider is opened.

spider (spider object) − It refers to the spider which was opened.

2

close_spider(self, spider) It is selected when spider is closed.

spider (spider object) − It refers to the spider which was closed.

3

from_crawler(cls, crawler) With the help of crawler, the pipeline can access the core components such as signals and settings of Scrapy.

crawler (Crawler object) − It refers to the crawler that uses this pipeline.

Example

以下是用于不同概念的 Item 管道的范例。

Following are the examples of item pipeline used in different concepts.

Dropping Items with No Tag

在以下代码中,管道平衡了那些不包含增值税(excludes_vat 属性)的 Item 的(price)属性并忽略不带价签的 Item.-

In the following code, the pipeline balances the (price) attribute for those items that do not include VAT (excludes_vat attribute) and ignore those items which do not have a price tag −

from Scrapy.exceptions import DropItem
class PricePipeline(object):
   vat = 2.25

   def process_item(self, item, spider):
      if item['price']:
         if item['excludes_vat']:
            item['price'] = item['price'] * self.vat
            return item
         else:
            raise DropItem("Missing price in %s" % item)

Writing Items to a JSON File

以下代码将把所有 spider 抓取的所有 Item 存储到单个 items.jl 文件中,其中包含每个以 JSON 格式序列化的行里的一个 Item。 JsonWriterPipeline 类在代码中用于展示如何编写 Item 管道-

The following code will store all the scraped items from all spiders into a single items.jl file, which contains one item per line in a serialized form in JSON format. The JsonWriterPipeline class is used in the code to show how to write item pipeline −

import json

class JsonWriterPipeline(object):
   def __init__(self):
      self.file = open('items.jl', 'wb')

   def process_item(self, item, spider):
      line = json.dumps(dict(item)) + "\n"
      self.file.write(line)
      return item

Writing Items to MongoDB

你可以在 Scrapy 设置中指定 MongoDB 地址和数据库名称,并且 MongoDB 集合可以以 Item 类命名。以下代码描述了如何正确使用 from_crawler() 方法来收集资源-

You can specify the MongoDB address and database name in Scrapy settings and MongoDB collection can be named after the item class. The following code describes how to use from_crawler() method to collect the resources properly −

import pymongo

class MongoPipeline(object):
   collection_name = 'Scrapy_list'

   def __init__(self, mongo_uri, mongo_db):
      self.mongo_uri = mongo_uri
      self.mongo_db = mongo_db

   @classmethod
   def from_crawler(cls, crawler):
      return cls(
         mongo_uri = crawler.settings.get('MONGO_URI'),
         mongo_db = crawler.settings.get('MONGO_DB', 'lists')
      )

   def open_spider(self, spider):
      self.client = pymongo.MongoClient(self.mongo_uri)
      self.db = self.client[self.mongo_db]

   def close_spider(self, spider):
      self.client.close()

   def process_item(self, item, spider):
      self.db[self.collection_name].insert(dict(item))
      return item

Duplicating Filters

过滤器会检查重复的 Item, 并且会删除已经处理的 Item。在以下代码中,我们为我们的 Item 已经使用一个唯一的 id, 但是 spider 返回了多个带有相同 id 的 Item -

A filter will check for the repeated items and it will drop the already processed items. In the following code, we have used a unique id for our items, but spider returns many items with the same id −

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):
   def __init__(self):
      self.ids_seen = set()

   def process_item(self, item, spider):
      if item['id'] in self.ids_seen:
         raise DropItem("Repeated items found: %s" % item)
      else:
         self.ids_seen.add(item['id'])
         return item

Activating an Item Pipeline

你可以通过将它的类添加到 ITEM_PIPELINES 设置中来激活一个 Item Pipeline 组件,如下面的代码所示。你可以按照它们运行的顺序给类分配整数值(顺序可以从低值到高值),并且其值在 0 到 1000 范围内。

You can activate an Item Pipeline component by adding its class to the ITEM_PIPELINES setting as shown in the following code. You can assign integer values to the classes in the order in which they run (the order can be lower valued to higher valued classes) and values will be in the 0-1000 range.

ITEM_PIPELINES = {
   'myproject.pipelines.PricePipeline': 100,
   'myproject.pipelines.JsonWriterPipeline': 600,
}

Scrapy - Feed exports

Description

数据提要导出是一种存储从网站抓取数据的技术,即生成 "export file"

Feed exports is a method of storing the data scraped from the sites, that is generating a "export file".

Serialization Formats

通过使用多种序列化格式和存储后端,数据提要导出可使用项目导出器并根据抓取的项目生成提要。

Using multiple serialization formats and storage backends, Feed Exports use Item exporters and generates a feed with scraped items.

下表展示受支持的格式:

The following table shows the supported formats−

Sr.No

Format & Description

1

JSON FEED_FORMAT is json Exporter used is class scrapy.exporters.JsonItemExporter

2

JSON lines FEED_FROMAT is jsonlines Exporter used is class scrapy.exporters.JsonLinesItemExporter

3

CSV FEED_FORMAT is CSV Exporter used is class scrapy.exporters.CsvItemExporter

4

XML FEED_FORMAT is xml Exporter used is class scrapy.exporters.XmlItemExporter

通过使用 FEED_EXPORTERS 设置,受支持的格式还可以得到扩展 −

Using FEED_EXPORTERS settings, the supported formats can also be extended −

Sr.No

Format & Description

1

Pickle FEED_FORMAT is pickel Exporter used is class scrapy.exporters.PickleItemExporter

2

Marshal FEED_FORMAT is marshal Exporter used is class scrapy.exporters.MarshalItemExporter

Storage Backends

存储后端定义了在何处存储使用 URI 的数据提要。

Storage backend defines where to store the feed using URI.

下表展示了受支持的存储后端 −

Following table shows the supported storage backends −

Sr.No

Storage Backend & Description

1

Local filesystem URI scheme is file and it is used to store the feeds.

2

FTP URI scheme is ftp and it is used to store the feeds.

3

S3 URI scheme is S3 and the feeds are stored on Amazon S3. External libraries botocore or boto are required.

4

Standard output URI scheme is stdout and the feeds are stored to the standard output.

Storage URI Parameters

以下是存储 URL 的参数,在创建数据提要时替换它 −

Following are the parameters of storage URL, which gets replaced while the feed is being created −

  1. %(time)s: This parameter gets replaced by a timestamp.

  2. %(name)s: This parameter gets replaced by spider name.

Settings

下表显示了用于配置 Feed 导出的设置:

Following table shows the settings using which Feed exports can be configured −

Sr.No

Setting & Description

1

FEED_URI It is the URI of the export feed used to enable feed exports.

2

FEED_FORMAT It is a serialization format used for the feed.

3

FEED_EXPORT_FIELDS It is used for defining fields which needs to be exported.

4

FEED_STORE_EMPTY It defines whether to export feeds with no items.

5

FEED_STORAGES It is a dictionary with additional feed storage backends.

6

FEED_STORAGES_BASE It is a dictionary with built-in feed storage backends.

7

FEED_EXPORTERS It is a dictionary with additional feed exporters.

8

FEED_EXPORTERS_BASE It is a dictionary with built-in feed exporters.

Scrapy - Requests and Responses

Description

Scrapy 可使用 RequestResponse 对象抓取网站。请求对象经过系统,使用蜘蛛执行请求并在返回响应对象时返回到请求。

Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.

Request Objects

请求对象是一个生成响应的 HTTP 请求。它具有以下类:

The request object is a HTTP request that generates a response. It has the following class −

class scrapy.http.Request(url[, callback, method = 'GET', headers, body, cookies, meta,
   encoding = 'utf-8', priority = 0, dont_filter = False, errback])

下表显示请求对象的各个参数:

Following table shows the parameters of Request objects −

Sr.No

Parameter & Description

1

url It is a string that specifies the URL request.

2

callback It is a callable function which uses the response of the request as first parameter.

3

method It is a string that specifies the HTTP method request.

4

headers It is a dictionary with request headers.

5

body It is a string or unicode that has a request body.

6

cookies It is a list containing request cookies.

7

meta It is a dictionary that contains values for metadata of the request.

8

encoding It is a string containing utf-8 encoding used to encode URL.

9

priority It is an integer where the scheduler uses priority to define the order to process requests.

10

dont_filter It is a boolean specifying that the scheduler should not filter the request.

11

errback It is a callable function to be called when an exception while processing a request is raised.

Passing Additional Data to Callback Functions

请求的回调函数在以其第一个参数下载响应时调用。

The callback function of a request is called when the response is downloaded as its first parameter.

例如 -

For example −

def parse_page1(self, response):
   return scrapy.Request("http://www.something.com/some_page.html",
      callback = self.parse_page2)

def parse_page2(self, response):
   self.logger.info("%s page visited", response.url)

如果您想要将参数传递给可调用函数并在第二个回调中接收这些参数,则可以使用 Request.meta 属性,如下例所示:

You can use Request.meta attribute, if you want to pass arguments to callable functions and receive those arguments in the second callback as shown in the following example −

def parse_page1(self, response):
   item = DemoItem()
   item['foremost_link'] = response.url
   request = scrapy.Request("http://www.something.com/some_page.html",
      callback = self.parse_page2)
   request.meta['item'] = item
   return request

def parse_page2(self, response):
   item = response.meta['item']
   item['other_link'] = response.url
   return item

Using errbacks to Catch Exceptions in Request Processing

当在处理请求时引发异常时,调用 errback 函数。

The errback is a callable function to be called when an exception while processing a request is raised.

以下示例演示了这一点−

The following example demonstrates this −

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class DemoSpider(scrapy.Spider):
   name = "demo"
   start_urls = [
      "http://www.httpbin.org/",              # HTTP 200 expected
      "http://www.httpbin.org/status/404",    # Webpage not found
      "http://www.httpbin.org/status/500",    # Internal server error
      "http://www.httpbin.org:12345/",        # timeout expected
      "http://www.httphttpbinbin.org/",       # DNS error expected
   ]

   def start_requests(self):
      for u in self.start_urls:
         yield scrapy.Request(u, callback = self.parse_httpbin,
         errback = self.errback_httpbin,
         dont_filter=True)

   def parse_httpbin(self, response):
      self.logger.info('Recieved response from {}'.format(response.url))
      # ...

   def errback_httpbin(self, failure):
      # logs failures
      self.logger.error(repr(failure))

      if failure.check(HttpError):
         response = failure.value.response
         self.logger.error("HttpError occurred on %s", response.url)

      elif failure.check(DNSLookupError):
         request = failure.request
         self.logger.error("DNSLookupError occurred on %s", request.url)

      elif failure.check(TimeoutError, TCPTimedOutError):
         request = failure.request
         self.logger.error("TimeoutError occurred on %s", request.url)

Request.meta Special Keys

request.meta 特殊键是 Scrapy 识别的特殊 meta 键列表。

The request.meta special keys is a list of special meta keys identified by Scrapy.

下表显示 Request.meta 的一些键:

Following table shows some of the keys of Request.meta −

Sr.No

Key & Description

1

dont_redirect It is a key when set to true, does not redirect the request based on the status of the response.

2

dont_retry It is a key when set to true, does not retry the failed requests and will be ignored by the middleware.

3

handle_httpstatus_list It is a key that defines which response codes per-request basis can be allowed.

4

handle_httpstatus_all It is a key used to allow any response code for a request by setting it to true.

5

dont_merge_cookies It is a key used to avoid merging with the existing cookies by setting it to true.

6

cookiejar It is a key used to keep multiple cookie sessions per spider.

7

dont_cache It is a key used to avoid caching HTTP requests and response on each policy.

8

redirect_urls It is a key which contains URLs through which the requests pass.

9

bindaddress It is the IP of the outgoing IP address that can be used to perform the request.

10

dont_obey_robotstxt It is a key when set to true, does not filter the requests prohibited by the robots.txt exclusion standard, even if ROBOTSTXT_OBEY is enabled.

11

download_timeout It is used to set timeout (in secs) per spider for which the downloader will wait before it times out.

12

download_maxsize It is used to set maximum size (in bytes) per spider, which the downloader will download.

13

proxy Proxy can be set for Request objects to set HTTP proxy for the use of requests.

Request Subclasses

你可以通过对 request 类进行子类化来实现自己的自定义功能。内置请求子类如下 −

You can implement your own custom functionality by subclassing the request class. The built-in request subclasses are as follows −

FormRequest Objects

FormRequest 类通过扩展基本请求来处理 HTML 表单。它具有以下类 −

The FormRequest class deals with HTML forms by extending the base request. It has the following class −

class scrapy.http.FormRequest(url[,formdata, callback, method = 'GET', headers, body,
   cookies, meta, encoding = 'utf-8', priority = 0, dont_filter = False, errback])

以下是参数 −

Following is the parameter −

formdata − 这是一个具有 HTML 表单数据的字典,该字典被分配给请求正文。

formdata − It is a dictionary having HTML form data that is assigned to the body of the request.

Note − 其余参数与 request 类相同,并在 Request Objects 部分中进行了说明。

Note − Remaining parameters are the same as request class and is explained in Request Objects section.

除了请求方法之外, FormRequest 对象还支持以下类方法 −

The following class methods are supported by FormRequest objects in addition to request methods −

classmethod from_response(response[, formname = None, formnumber = 0, formdata = None,
   formxpath = None, formcss = None, clickdata = None, dont_click = False, ...])

下表显示了上面类的参数 -

The following table shows the parameters of the above class −

Sr.No

Parameter & Description

1

response It is an object used to pre-populate the form fields using HTML form of response.

2

formname It is a string where the form having name attribute will be used, if specified.

3

formnumber It is an integer of forms to be used when there are multiple forms in the response.

4

formdata It is a dictionary of fields in the form data used to override.

5

formxpath It is a string when specified, the form matching the xpath is used.

6

formcss It is a string when specified, the form matching the css selector is used.

7

clickdata It is a dictionary of attributes used to observe the clicked control.

8

dont_click The data from the form will be submitted without clicking any element, when set to true.

Examples

以下是部分请求用例 -

Following are some of the request usage examples −

Using FormRequest to send data via HTTP POST

Using FormRequest to send data via HTTP POST

下面的代码演示了在蜘蛛中重复 HTML 表单 POST 时,如何返回 FormRequest 对象 -

The following code demonstrates how to return FormRequest object when you want to duplicate HTML form POST in your spider −

return [FormRequest(url = "http://www.something.com/post/action",
   formdata = {'firstname': 'John', 'lastname': 'dave'},
   callback = self.after_post)]

Using FormRequest.from_response() to simulate a user login

Using FormRequest.from_response() to simulate a user login

通常,网站使用元素通过它们提供预先填充的表单字段。

Normally, websites use elements through which it provides pre-populated form fields.

当你希望这些字段在抓取时自动填充时,可以使用 FormRequest.form_response() 方法。

The FormRequest.form_response() method can be used when you want these fields to be automatically populate while scraping.

以下示例演示了这一点。

The following example demonstrates this.

import scrapy
class DemoSpider(scrapy.Spider):
   name = 'demo'
   start_urls = ['http://www.something.com/users/login.php']
   def parse(self, response):
      return scrapy.FormRequest.from_response(
         response,
         formdata = {'username': 'admin', 'password': 'confidential'},
         callback = self.after_login
      )

   def after_login(self, response):
      if "authentication failed" in response.body:
         self.logger.error("Login failed")
         return
      # You can continue scraping here

Response Objects

这是一个表征 HTTP 响应的对象,它被输入到需要处理的爬虫中。它有以下类 -

It is an object indicating HTTP response that is fed to the spiders to process. It has the following class −

class scrapy.http.Response(url[, status = 200, headers, body, flags])

下表显示了响应对象的参数 -

The following table shows the parameters of Response objects −

Sr.No

Parameter & Description

1

url It is a string that specifies the URL response.

2

status It is an integer that contains HTTP status response.

3

headers It is a dictionary containing response headers.

4

body It is a string with response body.

5

flags It is a list containing flags of response.

Response Subclasses

你可以通过对响应类进行子类化来实现你自定义的功能。内置的响应子类如下:

You can implement your own custom functionality by subclassing the response class. The built-in response subclasses are as follows −

TextResponse objects

TextResponse objects

TextResponse 对象被用于二进制数据,例如图像、声音等,它有能力编码基本的响应类。它包含以下类:

TextResponse objects are used for binary data such as images, sounds, etc. which has the ability to encode the base Response class. It has the following class −

class scrapy.http.TextResponse(url[, encoding[,status = 200, headers, body, flags]])

以下是参数 −

Following is the parameter −

encoding − 这是一个用于对响应进行编码的字符串。

encoding − It is a string with encoding that is used to encode a response.

Note − 剩余的参数与响应类相同,在 Response Objects 部分中对其进行了说明。

Note − Remaining parameters are same as response class and is explained in Response Objects section.

下表显示了 TextResponse 对象在响应方法之外支持的属性:

The following table shows the attributes supported by TextResponse object in addition to response methods −

Sr.No

Attribute & Description

1

text It is a response body, where response.text can be accessed multiple times.

2

encoding It is a string containing encoding for response.

3

selector It is an attribute instantiated on first access and uses response as target.

下表显示了 TextResponse 对象在响应方法之外支持的方法:

The following table shows the methods supported by TextResponse objects in addition to response methods −

Sr.No

Method & Description

1

xpath (query) It is a shortcut to TextResponse.selector.xpath(query).

2

css (query) It is a shortcut to TextResponse.selector.css(query).

3

body_as_unicode() It is a response body available as a method, where response.text can be accessed multiple times.

HtmlResponse Objects

它是一个支持编码和自动发现的对象,通过查看 HTML 的 meta httpequiv 属性进行此操作。它的参数与响应类相同,在响应对象部分中对此进行了说明。它包含以下类:

It is an object that supports encoding and auto-discovering by looking at the meta httpequiv attribute of HTML. Its parameters are the same as response class and is explained in Response objects section. It has the following class −

class scrapy.http.HtmlResponse(url[,status = 200, headers, body, flags])

XmlResponse Objects

它是一个通过查看 XML 行支持编码和自动发现的对象。它的参数与响应类相同,在响应对象部分中对此进行了说明。它包含以下类:

It is an object that supports encoding and auto-discovering by looking at the XML line. Its parameters are the same as response class and is explained in Response objects section. It has the following class −

class scrapy.http.XmlResponse(url[, status = 200, headers, body, flags])

Description

顾名思义,链接提取器是可以用于使用 scrapy.http.Response 对象从网页中提取链接的对象。在 Scrapy 中,有内建的提取器,例如 scrapy.linkextractors 导入 LinkExtractor 。您可以通过实现简单界面根据需要自定义您自己的链接提取器。

As the name itself indicates, Link Extractors are the objects that are used to extract links from web pages using scrapy.http.Response objects. In Scrapy, there are built-in extractors such as scrapy.linkextractors import LinkExtractor. You can customize your own link extractor according to your needs by implementing a simple interface.

每个链接提取器都有一个称为 extract_links 的公共方法,它包括一个 Response 对象并返回一个 scrapy.link.Link 对象列表。您只能实例化链接提取器一次,并多次调用 extract_links 方法以使用不同的响应来提取链接。CrawlSpiderclass 使用链接提取器和一组规则,其主要目的是提取链接。

Every link extractor has a public method called extract_links which includes a Response object and returns a list of scrapy.link.Link objects. You can instantiate the link extractors only once and call the extract_links method various times to extract links with different responses. The CrawlSpiderclass uses link extractors with a set of rules whose main purpose is to extract links.

通常,链接提取器与 Scrapy 一起分组,并在 scrapy.linkextractors 模块中提供。默认情况下,链接提取器将是 LinkExtractor,其功能与 LxmlLinkExtractor 相同 −

Normally link extractors are grouped with Scrapy and are provided in scrapy.linkextractors module. By default, the link extractor will be LinkExtractor which is equal in functionality with LxmlLinkExtractor −

from scrapy.linkextractors import LinkExtractor

LxmlLinkExtractor

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow = (), deny = (),
   allow_domains = (), deny_domains = (), deny_extensions = None, restrict_xpaths = (),
   restrict_css = (), tags = ('a', 'area'), attrs = ('href', ),
   canonicalize = True, unique = True, process_value = None)

LxmlLinkExtractor 是一个强烈推荐的链接提取器,因为它具有方便的过滤选项,并且与 lxml 的强大 HTMLParser 一起使用。

The LxmlLinkExtractor is a highly recommended link extractor, because it has handy filtering options and it is used with lxml’s robust HTMLParser.

Sr.No

Parameter & Description

1

allow (a regular expression (or list of)) It allows a single expression or group of expressions that should match the url which is to be extracted. If it is not mentioned, it will match all the links.

2

deny (a regular expression (or list of)) It blocks or excludes a single expression or group of expressions that should match the url which is not to be extracted. If it is not mentioned or left empty, then it will not eliminate the undesired links.

3

allow_domains (str or list) It allows a single string or list of strings that should match the domains from which the links are to be extracted.

4

deny_domains (str or list) It blocks or excludes a single string or list of strings that should match the domains from which the links are not to be extracted.

5

deny_extensions (list) It blocks the list of strings with the extensions when extracting the links. If it is not set, then by default it will be set to IGNORED_EXTENSIONS which contains predefined list in scrapy.linkextractors package.

6

restrict_xpaths (str or list) It is an XPath list region from where the links are to be extracted from the response. If given, the links will be extracted only from the text, which is selected by XPath.

7

restrict_css (str or list) It behaves similar to restrict_xpaths parameter which will extract the links from the CSS selected regions inside the response.

8

tags (str or list) A single tag or a list of tags that should be considered when extracting the links. By default, it will be (’a’, ’area’).

9

attrs (list) A single attribute or list of attributes should be considered while extracting links. By default, it will be (’href’,).

10

canonicalize (boolean) The extracted url is brought to standard form using scrapy.utils.url.canonicalize_url. By default, it will be True.

11

unique (boolean) It will be used if the extracted links are repeated.

12

process_value (callable) It is a function which receives a value from scanned tags and attributes. The value received may be altered and returned or else nothing will be returned to reject the link. If not used, by default it will be lambda x: x.

Example

以下代码用来提取链接 −

The following code is used to extract the links −

<a href = "javascript:goToPage('../other/page.html'); return false">Link text</a>

以下代码函数可用于 process_value −

The following code function can be used in process_value −

def process_value(val):
   m = re.search("javascript:goToPage\('(.*?)'", val)
   if m:
      return m.group(1)

Scrapy - Settings

Description

Scrapy 组件的行为可以通过 Scrapy 设置进行修改。如果你有多个 Scrapy 项目,设置还可以选择当前激活的 Scrapy 项目。

The behavior of Scrapy components can be modified using Scrapy settings. The settings can also select the Scrapy project that is currently active, in case you have multiple Scrapy projects.

Designating the Settings

当你抓取网站时,你必须通知 Scrapy 你正在使用哪些设置。为此,应该使用环境变量 SCRAPY_SETTINGS_MODULE ,其值应为 Python 路径语法。

You must notify Scrapy which setting you are using when you scrap a website. For this, environment variable SCRAPY_SETTINGS_MODULE should be used and its value should be in Python path syntax.

Populating the Settings

下表显示了一些可以填充设置的机制−

The following table shows some of the mechanisms by which you can populate the settings −

Sr.No

Mechanism & Description

1

Command line options Here, the arguments that are passed takes highest precedence by overriding other options. The -s is used to override one or more settings. scrapy crawl myspider -s LOG_FILE = scrapy.log

2

Settings per-spider Spiders can have their own settings that overrides the project ones by using attribute custom_settings. class DemoSpider(scrapy.Spider): name = 'demo' custom_settings = { 'SOME_SETTING': 'some value', }

3

Project settings module Here, you can populate your custom settings such as adding or modifying the settings in the settings.py file.

4

Default settings per-command Each Scrapy tool command defines its own settings in the default_settings attribute, to override the global default settings.

5

Default global settings These settings are found in the scrapy.settings.default_settings module.

Access Settings

它们可通过 self.settings 获得,并在初始化后设置在基础 Spider 中。

They are available through self.settings and set in the base spider after it is initialized.

以下示例演示了这一点。

The following example demonstrates this.

class DemoSpider(scrapy.Spider):
   name = 'demo'
   start_urls = ['http://example.com']
   def parse(self, response):
      print("Existing settings: %s" % self.settings.attributes.keys())

要在初始化 Spider 之前使用设置,您必须在 Spider 的 init () 方法中覆盖 from_crawler 方法。您可以通过传递给 from_crawler 方法的属性 scrapy.crawler.Crawler.settings 访问设置。

To use settings before initializing the spider, you must override from_crawler method in the init() method of your spider. You can access settings through attribute scrapy.crawler.Crawler.settings passed to from_crawler method.

以下示例演示了这一点。

The following example demonstrates this.

class MyExtension(object):
   def __init__(self, log_is_enabled = False):
      if log_is_enabled:
         print("Enabled log")
         @classmethod
   def from_crawler(cls, crawler):
      settings = crawler.settings
      return cls(settings.getbool('LOG_ENABLED'))

Rationale for Setting Names

设置名称被添加为它们配置的组件的前缀。例如,对于 robots.txt 扩展,设置名称可以是 ROBOTSTXT_ENABLED、 ROBOTSTXT_OBEY、ROBOTSTXT_CACHEDIR 等。

Setting names are added as a prefix to the component they configure. For example, for robots.txt extension, the setting names can be ROBOTSTXT_ENABLED, ROBOTSTXT_OBEY, ROBOTSTXT_CACHEDIR, etc.

Built-in Settings Reference

下表显示了 Scrapy 的内置设置−

The following table shows the built-in settings of Scrapy −

Sr.No

Setting & Description

1

AWS_ACCESS_KEY_ID It is used to access Amazon Web Services. Default value: None

2

AWS_SECRET_ACCESS_KEY It is used to access Amazon Web Services. Default value: None

3

BOT_NAME It is the name of bot that can be used for constructing User-Agent. Default value: 'scrapybot'

4

CONCURRENT_ITEMS Maximum number of existing items in the Item Processor used to process parallely. Default value: 100

5

CONCURRENT_REQUESTS Maximum number of existing requests which Scrapy downloader performs. Default value: 16

6

CONCURRENT_REQUESTS_PER_DOMAIN Maximum number of existing requests that perform simultaneously for any single domain. Default value: 8

7

CONCURRENT_REQUESTS_PER_IP Maximum number of existing requests that performs simultaneously to any single IP. Default value: 0

8

DEFAULT_ITEM_CLASS It is a class used to represent items. Default value: 'scrapy.item.Item'

9

DEFAULT_REQUEST_HEADERS It is a default header used for HTTP requests of Scrapy. Default value − { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9, /;q=0.8', 'Accept-Language': 'en', }

10

DEPTH_LIMIT The maximum depth for a spider to crawl any site. Default value: 0

11

DEPTH_PRIORITY It is an integer used to alter the priority of request according to the depth. Default value: 0

12

DEPTH_STATS It states whether to collect depth stats or not. Default value: True

13

DEPTH_STATS_VERBOSE This setting when enabled, the number of requests is collected in stats for each verbose depth. Default value: False

14

DNSCACHE_ENABLED It is used to enable DNS in memory cache. Default value: True

15

DNSCACHE_SIZE It defines the size of DNS in memory cache. Default value: 10000

16

DNS_TIMEOUT It is used to set timeout for DNS to process the queries. Default value: 60

17

DOWNLOADER It is a downloader used for the crawling process. Default value: 'scrapy.core.downloader.Downloader'

18

DOWNLOADER_MIDDLEWARES It is a dictionary holding downloader middleware and their orders. Default value: {}

19

DOWNLOADER_MIDDLEWARES_BASE It is a dictionary holding downloader middleware that is enabled by default. Default value − { 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, }

20

DOWNLOADER_STATS This setting is used to enable the downloader stats. Default value: True

21

DOWNLOAD_DELAY It defines the total time for downloader before it downloads the pages from the site. Default value: 0

22

DOWNLOAD_HANDLERS It is a dictionary with download handlers. Default value: {}

23

DOWNLOAD_HANDLERS_BASE It is a dictionary with download handlers that is enabled by default. Default value − { 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', }

24

DOWNLOAD_TIMEOUT It is the total time for downloader to wait before it times out. Default value: 180

25

DOWNLOAD_MAXSIZE It is the maximum size of response for the downloader to download. Default value: 1073741824 (1024MB)

26

DOWNLOAD_WARNSIZE It defines the size of response for downloader to warn. Default value: 33554432 (32MB)

27

DUPEFILTER_CLASS It is a class used for detecting and filtering of requests that are duplicate. Default value: 'scrapy.dupefilters.RFPDupeFilter'

28

DUPEFILTER_DEBUG This setting logs all duplicate filters when set to true. Default value: False

29

EDITOR It is used to edit spiders using the edit command. Default value: Depends on the environment

30

EXTENSIONS It is a dictionary having extensions that are enabled in the project. Default value: {}

31

EXTENSIONS_BASE It is a dictionary having built-in extensions. Default value: { 'scrapy.extensions.corestats.CoreStats': 0, }

32

FEED_TEMPDIR It is a directory used to set the custom folder where crawler temporary files can be stored.

33

ITEM_PIPELINES It is a dictionary having pipelines. Default value: {}

34

LOG_ENABLED It defines if the logging is to be enabled. Default value: True

35

LOG_ENCODING It defines the type of encoding to be used for logging. Default value: 'utf-8'

36

LOG_FILE It is the name of the file to be used for the output of logging. Default value: None

37

LOG_FORMAT It is a string using which the log messages can be formatted. Default value: '%(asctime)s [%(name)s] %(levelname)s: %(message)s'

38

LOG_DATEFORMAT It is a string using which date/time can be formatted. Default value: '%Y-%m-%d %H:%M:%S'

39

LOG_LEVEL It defines minimum log level. Default value: 'DEBUG'

40

LOG_STDOUT This setting if set to true, all your process output will appear in the log. Default value: False

41

MEMDEBUG_ENABLED It defines if the memory debugging is to be enabled. Default Value: False

42

MEMDEBUG_NOTIFY It defines the memory report that is sent to a particular address when memory debugging is enabled. Default value: []

43

MEMUSAGE_ENABLED It defines if the memory usage is to be enabled when a Scrapy process exceeds a memory limit. Default value: False

44

MEMUSAGE_LIMIT_MB It defines the maximum limit for the memory (in megabytes) to be allowed. Default value: 0

45

MEMUSAGE_CHECK_INTERVAL_SECONDS It is used to check the present memory usage by setting the length of the intervals. Default value: 60.0

46

MEMUSAGE_NOTIFY_MAIL It is used to notify with a list of emails when the memory reaches the limit. Default value: False

47

MEMUSAGE_REPORT It defines if the memory usage report is to be sent on closing each spider. Default value: False

48

MEMUSAGE_WARNING_MB It defines a total memory to be allowed before a warning is sent. Default value: 0

49

NEWSPIDER_MODULE It is a module where a new spider is created using genspider command. Default value: ''

50

RANDOMIZE_DOWNLOAD_DELAY It defines a random amount of time for a Scrapy to wait while downloading the requests from the site. Default value: True

51

REACTOR_THREADPOOL_MAXSIZE It defines a maximum size for the reactor threadpool. Default value: 10

52

REDIRECT_MAX_TIMES It defines how many times a request can be redirected. Default value: 20

53

REDIRECT_PRIORITY_ADJUST This setting when set, adjusts the redirect priority of a request. Default value: +2

54

RETRY_PRIORITY_ADJUST This setting when set, adjusts the retry priority of a request. Default value: -1

55

ROBOTSTXT_OBEY Scrapy obeys robots.txt policies when set to true. Default value: False

56

SCHEDULER It defines the scheduler to be used for crawl purpose. Default value: 'scrapy.core.scheduler.Scheduler'

57

SPIDER_CONTRACTS It is a dictionary in the project having spider contracts to test the spiders. Default value: {}

58

SPIDER_CONTRACTS_BASE It is a dictionary holding Scrapy contracts which is enabled in Scrapy by default. Default value − { 'scrapy.contracts.default.UrlContract' : 1, 'scrapy.contracts.default.ReturnsContract': 2, }

59

SPIDER_LOADER_CLASS It defines a class which implements SpiderLoader API to load spiders. Default value: 'scrapy.spiderloader.SpiderLoader'

60

SPIDER_MIDDLEWARES It is a dictionary holding spider middlewares. Default value: {}

61

SPIDER_MIDDLEWARES_BASE It is a dictionary holding spider middlewares that is enabled in Scrapy by default. Default value − { 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, }

62

SPIDER_MODULES It is a list of modules containing spiders which Scrapy will look for. Default value: []

63

STATS_CLASS It is a class which implements Stats Collector API to collect stats. Default value: 'scrapy.statscollectors.MemoryStatsCollector'

64

STATS_DUMP This setting when set to true, dumps the stats to the log. Default value: True

65

STATSMAILER_RCPTS Once the spiders finish scraping, Scrapy uses this setting to send the stats. Default value: []

66

TELNETCONSOLE_ENABLED It defines whether to enable the telnetconsole. Default value: True

67

TELNETCONSOLE_PORT It defines a port for telnet console. Default value: [6023, 6073]

68

TEMPLATES_DIR It is a directory containing templates that can be used while creating new projects. Default value: templates directory inside scrapy module

69

URLLENGTH_LIMIT It defines the maximum limit of the length for URL to be allowed for crawled URLs. Default value: 2083

70

USER_AGENT It defines the user agent to be used while crawling a site. Default value: "Scrapy/VERSION (+http://scrapy.org)"

有关其他 Scrapy 设置,请访问此 link

For other Scrapy settings, go to this link.

Scrapy - Exceptions

Description

不规则的事件称为异常。在 Scrapy 中,异常是由配置缺失、从项目管道删除项目等原因引发的。以下是 Scrapy 中提到的异常及其应用程序列表。

The irregular events are referred to as exceptions. In Scrapy, exceptions are raised due to reasons such as missing configuration, dropping item from the item pipeline, etc. Following is the list of exceptions mentioned in Scrapy and their application.

DropItem

项目管道利用此异常在任何阶段停止处理项目。它可以写成−

Item Pipeline utilizes this exception to stop processing of the item at any stage. It can be written as −

exception (scrapy.exceptions.DropItem)

CloseSpider

此异常用于停止使用回调请求的蜘蛛。它可以写成−

This exception is used to stop the spider using the callback request. It can be written as −

exception (scrapy.exceptions.CloseSpider)(reason = 'cancelled')

它包含一个名为原因 (str) 的参数,该参数指定关闭的原因。

It contains parameter called reason (str) which specifies the reason for closing.

例如,以下代码显示了此异常用法−

For instance, the following code shows this exception usage −

def parse_page(self, response):
   if 'Bandwidth exceeded' in response.body:
      raise CloseSpider('bandwidth_exceeded')

IgnoreRequest

此异常由调度程序或下载器中间件用于忽略请求。它可以写成−

This exception is used by scheduler or downloader middleware to ignore a request. It can be written as −

exception (scrapy.exceptions.IgnoreRequest)

NotConfigured

它表示缺少配置的情况,并且应该在组件构造函数中引发。

It indicates a missing configuration situation and should be raised in a component constructor.

exception (scrapy.exceptions.NotConfigured)

如果以下任何组件停用,都可能会引发此异常。

This exception can be raised, if any of the following components are disabled.

  1. Extensions

  2. Item pipelines

  3. Downloader middlewares

  4. Spider middlewares

NotSupported

如果某个特性或方法不受支持,将会引发此异常。可将其写为:

This exception is raised when any feature or method is not supported. It can be written as −

exception (scrapy.exceptions.NotSupported)

Scrapy - Create a Project

Description

要从网页中获取数据,首先需要创建你将存储代码的 Scrapy 项目。要创建新目录,运行以下命令:

To scrap the data from web pages, first you need to create the Scrapy project where you will be storing the code. To create a new directory, run the following command −

scrapy startproject first_scrapy

上面的代码将创建一个名为 first_scrapy 的目录,它将包含以下结构:

The above code will create a directory with name first_scrapy and it will contain the following structure −

first_scrapy/
scrapy.cfg            # deploy configuration file
first_scrapy/         # project's Python module, you'll import your code from here
__init__.py
items.py              # project items file
pipelines.py          # project pipelines file
settings.py           # project settings file
spiders/              # a directory where you'll later put your spiders
__init__.py

Scrapy - Define an Item

Description

项目是用于收集从网站获取的数据的容器。你必须通过定义你的项目来启动你的爬虫。要定义项目,编辑在目录 first_scrapy 下找到的 items.py 文件(自定义目录)。items.py 看起来如下:

Items are the containers used to collect the data that is scrapped from the websites. You must start your spider by defining your Item. To define items, edit items.py file found under directory first_scrapy (custom directory). The items.py looks like the following −

import scrapy

class First_scrapyItem(scrapy.Item):
   # define the fields for your item here like:
      # name = scrapy.Field()

MyItem 类继承了包含一系列 Scrapy 已经为我们构建的预定义对象的 Item。例如,如果你想从站点中提取名称、URL 和描述,则需要为这三个属性中的每一个定义字段。

The MyItem class inherits from Item containing a number of pre-defined objects that Scrapy has already built for us. For instance, if you want to extract the name, URL, and description from the sites, you need to define the fields for each of these three attributes.

因此,让我们添加我们要收集的那些项目:

Hence, let’s add those items that we want to collect −

from scrapy.item import Item, Field

class First_scrapyItem(scrapy.Item):
   name = scrapy.Field()
   url = scrapy.Field()
   desc = scrapy.Field()

Scrapy - First Spider

Description

Spider 是一个类,它定义了从哪里提取数据的初始 URL,以及如何遵循分页链接以及如何提取和解析 items.py 中定义的字段。Scrapy 提供不同类型的蜘蛛,每种蜘蛛都给出了一个特定目的。

Spider is a class that defines initial URL to extract the data from, how to follow pagination links and how to extract and parse the fields defined in the items.py. Scrapy provides different types of spiders each of which gives a specific purpose.

在 first_scrapy/spiders 目录下创建一个名为 "first_spider.py" 的文件,在那里我们可以告诉 Scrapy 如何找到我们正在寻找的确切数据。为此,你必须定义一些属性:

Create a file called "first_spider.py" under the first_scrapy/spiders directory, where we can tell Scrapy how to find the exact data we’re looking for. For this, you must define some attributes −

  1. name − It defines the unique name for the spider.

  2. allowed_domains − It contains the base URLs for the spider to crawl.

  3. start-urls − A list of URLs from where the spider starts crawling.

  4. parse() − It is a method that extracts and parses the scraped data.

以下代码演示了一个爬虫代码的样例 −

The following code demonstrates how a spider code looks like −

import scrapy

class firstSpider(scrapy.Spider):
   name = "first"
   allowed_domains = ["dmoz.org"]

   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]
   def parse(self, response):
      filename = response.url.split("/")[-2] + '.html'
      with open(filename, 'wb') as f:
         f.write(response.body)

Scrapy - Crawling

Description

要执行你的爬虫程序,在 first_scrapy 目录中运行以下命令 −

To execute your spider, run the following command within your first_scrapy directory −

scrapy crawl first

其中, first 是创建爬虫程序时指定的爬虫程序名称。

Where, first is the name of the spider specified while creating the spider.

爬虫程序爬取后,可以看到以下输出 −

Once the spider crawls, you can see the following output −

2016-08-09 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2016-08-09 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Spider opened
2016-08-09 18:13:08-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-09 18:13:09-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-09 18:13:09-0400 [scrapy] INFO: Closing spider (finished)

正如你可以在输出中看到的,对于每个 URL 都有一个日志行,该日志行(引用:无)指出 URL 是开始 URL,并且它们没有引用。接下来,你应该看到在 first_scrapy 目录中创建了两个新文件,命名为 Books.html 和 Resources.html。

As you can see in the output, for each URL there is a log line which (referer: None) states that the URLs are start URLs and they have no referrers. Next, you should see two new files named Books.html and Resources.html are created in your first_scrapy directory.

Scrapy - Extracting Items

Description

为了从网页中提取数据,Scrapy 使用了一种基于 XPathCSS 表达式的选择器技术。以下是 XPath 表达式的一些示例 −

For extracting data from web pages, Scrapy uses a technique called selectors based on XPath and CSS expressions. Following are some examples of XPath expressions −

  1. /html/head/title − This will select the <title> element, inside the <head> element of an HTML document.

  2. /html/head/title/text() − This will select the text within the same <title> element.

  3. //td − This will select all the elements from <td>.

  4. //div[@class = "slice"] − This will select all elements from div which contain an attribute class = "slice"

选择器有四个基本的方法,如下表所示 −

Selectors have four basic methods as shown in the following table −

Sr.No

Method & Description

1

extract() It returns a unicode string along with the selected data.

2

re() It returns a list of unicode strings, extracted when the regular expression was given as argument.

3

xpath() It returns a list of selectors, which represents the nodes selected by the xpath expression given as an argument.

4

css() It returns a list of selectors, which represents the nodes selected by the CSS expression given as an argument.

Using Selectors in the Shell

要使用内置的 Scrapy 外壳演示选择器,你需要在你的系统中安装 IPython 。这里的重要事项是,在运行 Scrapy 时,URL 应该包含在引号内;否则带有“&”字符的 URL 将不起作用。你可以使用以下命令在项目的顶级目录中启动外壳 −

To demonstrate the selectors with the built-in Scrapy shell, you need to have IPython installed in your system. The important thing here is, the URLs should be included within the quotes while running Scrapy; otherwise the URLs with '&' characters won’t work. You can start a shell by using the following command in the project’s top level directory −

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

一个外壳将如下所示 −

A shell will look like the following −

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>(referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x3636b50>
[s]   item       {}
[s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   settings   <scrapy.settings.Settings object at 0x3fadc50>
[s]   spider     <Spider 'default' at 0x3cebf50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]:

当外壳加载时,你可以分别使用 response.body 和 response.header 访问正文或标题。类似地,你可以使用 response.selector.xpath() 或 response.selector.css() 对响应运行查询。

When shell loads, you can access the body or header by using response.body and response.header respectively. Similarly, you can run queries on the response using response.selector.xpath() or response.selector.css().

例如 −

For instance −

In [1]: response.xpath('//title')
Out[1]: [<Selector xpath = '//title' data = u'<title>My Book - Scrapy'>]

In [2]: response.xpath('//title').extract()
Out[2]: [u'<title>My Book - Scrapy: Index: Chapters</title>']

In [3]: response.xpath('//title/text()')
Out[3]: [<Selector xpath = '//title/text()' data = u'My Book - Scrapy: Index:'>]

In [4]: response.xpath('//title/text()').extract()
Out[4]: [u'My Book - Scrapy: Index: Chapters']

In [5]: response.xpath('//title/text()').re('(\w+):')
Out[5]: [u'Scrapy', u'Index', u'Chapters']

Extracting the Data

要从一个普通 HTML 站点中提取数据,我们必须检查该站点的源代码来获取 XPath。检查后,你会看到数据将位于 ul 标签中。选择 li 标签内的元素。

To extract data from a normal HTML site, we have to inspect the source code of the site to get XPaths. After inspecting, you can see that the data will be in the ul tag. Select the elements within li tag.

以下代码行显示了不同类型数据的提取 −

The following lines of code shows extraction of different types of data −

对于 li 标签中选择数据 −

For selecting data within li tag −

response.xpath('//ul/li')

对于选择描述 −

For selecting descriptions −

response.xpath('//ul/li/text()').extract()

对于选择网站标题 −

For selecting site titles −

response.xpath('//ul/li/a/text()').extract()

对于选择网站链接 −

For selecting site links −

response.xpath('//ul/li/a/@href').extract()

以下代码证明了上述提取器的使用 −

The following code demonstrates the use of above extractors −

import scrapy

class MyprojectSpider(scrapy.Spider):
   name = "project"
   allowed_domains = ["dmoz.org"]

   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]
   def parse(self, response):
      for sel in response.xpath('//ul/li'):
         title = sel.xpath('a/text()').extract()
         link = sel.xpath('a/@href').extract()
         desc = sel.xpath('text()').extract()
         print title, link, desc

Scrapy - Using an Item

Description

Item 对象是 Python 的正规字典。我们可以使用以下语法来访问类的属性 −

Item objects are the regular dicts of Python. We can use the following syntax to access the attributes of the class −

>>> item = DmozItem()
>>> item['title'] = 'sample title'
>>> item['title']
'sample title'

将以上代码添加到以下示例 −

Add the above code to the following example −

import scrapy

from tutorial.items import DmozItem

class MyprojectSpider(scrapy.Spider):
   name = "project"
   allowed_domains = ["dmoz.org"]

   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]
   def parse(self, response):
      for sel in response.xpath('//ul/li'):
         item = DmozItem()
         item['title'] = sel.xpath('a/text()').extract()
         item['link'] = sel.xpath('a/@href').extract()
         item['desc'] = sel.xpath('text()').extract()
         yield item

以上爬虫的输出将是 −

The output of the above spider will be −

[scrapy] DEBUG: Scraped from <200
http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
   {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text,
      ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
   'link': [u'http://gnosis.cx/TPiP/'],
   'title': [u'Text Processing in Python']}
[scrapy] DEBUG: Scraped from <200
http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
   {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192,
      has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and
      SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
   'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
   'title': [u'XML Processing with Python']}

Description

在本章中,我们将学习如何提取我们感兴趣的页面的链接,跟踪它们并从该页面提取数据。为此,我们需要对我们的 previous code 作出如下更改:

In this chapter, we’ll study how to extract the links of the pages of our interest, follow them and extract data from that page. For this, we need to make the following changes in our previous code shown as follows −

import scrapy
from tutorial.items import DmozItem

class MyprojectSpider(scrapy.Spider):
   name = "project"
   allowed_domains = ["dmoz.org"]

   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/",
   ]
   def parse(self, response):
      for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
         url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback = self.parse_dir_contents)

   def parse_dir_contents(self, response):
      for sel in response.xpath('//ul/li'):
         item = DmozItem()
         item['title'] = sel.xpath('a/text()').extract()
         item['link'] = sel.xpath('a/@href').extract()
         item['desc'] = sel.xpath('text()').extract()
         yield item

上面的代码包含以下方法-

The above code contains the following methods −

  1. parse() − It will extract the links of our interest.

  2. response.urljoin − The parse() method will use this method to build a new url and provide a new request, which will be sent later to callback.

  3. parse_dir_contents() − This is a callback which will actually scrape the data of interest.

在此处,Scrapy 使用回叫机制来跟踪链接。利用该机制,可以设计更大的爬虫并且可以跟踪所需的链接以从不同的页面中抓取所需信息。常规方法将是回叫方法,该方法将提取数据项,查找链接来跟踪到下一页,然后针对相同的回叫提供请求。

Here, Scrapy uses a callback mechanism to follow links. Using this mechanism, the bigger crawler can be designed and can follow links of interest to scrape the desired data from different pages. The regular method will be callback method, which will extract the items, look for links to follow the next page, and then provide a request for the same callback.

以下示例生成一个循环,该循环将跟踪链接到下一页。

The following example produces a loop, which will follow the links to the next page.

def parse_articles_follow_next_page(self, response):
   for article in response.xpath("//article"):
      item = ArticleItem()

      ... extract article data here

      yield item

   next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
   if next_page:
      url = response.urljoin(next_page[0].extract())
      yield scrapy.Request(url, self.parse_articles_follow_next_page)

Scrapy - Scraped Data

Description

存储抓取数据的最佳方法是使用 Feed 导出,这确保了使用多种序列化格式正确地存储数据。JSON、JSON 行、CSV、XML 是序列化格式中现有的格式。可以通过以下命令存储数据 −

The best way to store scraped data is by using Feed exports, which makes sure that data is being stored properly using multiple serialization formats. JSON, JSON lines, CSV, XML are the formats supported readily in serialization formats. The data can be stored with the following command −

scrapy crawl dmoz -o data.json

此命令将创建一个 data.json 文件,其中包含 JSON 格式的抓取数据。此方法适用于小量数据。如果必须处理大量数据,那么我们可以使用 Item Pipeline。就像 data.json 文件,在项目在 tutorial/pipelines.py 中创建时会设置保留文件。

This command will create a data.json file containing scraped data in JSON. This technique holds good for small amount of data. If large amount of data has to be handled, then we can use Item Pipeline. Just like data.json file, a reserved file is set up when the project is created in tutorial/pipelines.py.

Scrapy - Logging

Description

Logging 表示使用内置日志系统进行的事件跟踪,并定义用于实现应用程序和库的功能和类。日志记录是一种可随时使用的材料,它可以与记录设置中列出的 Scrapy 设置配合使用。

Logging means tracking of events, which uses built-in logging system and defines functions and classes to implement applications and libraries. Logging is a ready-to-use material, which can work with Scrapy settings listed in Logging settings.

在运行命令时,Scrapy 会设置一些默认设置,并通过 scrapy.utils.log.configure_logging() 处理这些设置。

Scrapy will set some default settings and handle those settings with the help of scrapy.utils.log.configure_logging() when running commands.

Log levels

在 Python 中,一个日志消息具有五种不同的严重程度级别。以下列表以升序列出了标准日志消息 -

In Python, there are five different levels of severity on a log message. The following list shows the standard log messages in an ascending order −

  1. logging.DEBUG − for debugging messages (lowest severity)

  2. logging.INFO − for informational messages

  3. logging.WARNING − for warning messages

  4. logging.ERROR − for regular errors

  5. logging.CRITICAL − for critical errors (highest severity)

How to Log Messages

下面的代码显示使用 logging.info 级别记录消息。

The following code shows logging a message using logging.info level.

import logging
logging.info("This is an information")

上述记录消息可以用 logging.log 作为参数,显示如下:

The above logging message can be passed as an argument using logging.log shown as follows −

import logging
logging.log(logging.INFO, "This is an information")

现在,你还可以使用记录器用日志助手日志将消息包含起来以清楚地显示出日志消息,如下所示:

Now, you can also use loggers to enclose the message using the logging helpers logging to get the logging message clearly shown as follows −

import logging
logger = logging.getLogger()
logger.info("This is an information")

可以有多个记录器,可以通过使用 logging.getLogger 函数获取其名称来访问它们,显示如下。

There can be multiple loggers and those can be accessed by getting their names with the use of logging.getLogger function shown as follows.

import logging
logger = logging.getLogger('mycustomlogger')
logger.info("This is an information")

对于任何模块,可以使用 name 变量来使用自定义记录器,它包含了模块路径,如下所示:

A customized logger can be used for any module using the name variable which contains the module path shown as follows −

import logging
logger = logging.getLogger(__name__)
logger.info("This is an information")

Logging from Spiders

每个爬取器实例都拥有一个 logger ,并可以使用,如下所示:

Every spider instance has a logger within it and can used as follows −

import scrapy

class LogSpider(scrapy.Spider):
   name = 'logspider'
   start_urls = ['http://dmoz.com']
   def parse(self, response):
      self.logger.info('Parse function called on %s', response.url)

在上面的代码中,记录器是使用爬取器的名称创建的,但是你可以使用 Python 提供的任何自定义记录器,如下所示:

In the above code, the logger is created using the Spider’s name, but you can use any customized logger provided by Python as shown in the following code −

import logging
import scrapy

logger = logging.getLogger('customizedlogger')
class LogSpider(scrapy.Spider):
   name = 'logspider'
   start_urls = ['http://dmoz.com']

   def parse(self, response):
      logger.info('Parse function called on %s', response.url)

Logging Configuration

记录器无法自行显示它们发送的消息。因此,它们需要“处理器”来显示这些消息,而处理器会将这些消息重定向到各自的目的地,如文件、电子邮件和标准输出。

Loggers are not able to display messages sent by them on their own. So they require "handlers" for displaying those messages and handlers will be redirecting these messages to their respective destinations such as files, emails, and standard output.

根据下列设置,Scrapy 会为记录器配置处理器。

Depending on the following settings, Scrapy will configure the handler for logger.

Logging Settings

下列设置用于配置日志 −

The following settings are used to configure the logging −

  1. The LOG_FILE and LOG_ENABLED decide the destination for log messages.

  2. When you set the LOG_ENCODING to false, it won’t display the log output messages.

  3. The LOG_LEVEL will determine the severity order of the message; those messages with less severity will be filtered out.

  4. The LOG_FORMAT and LOG_DATEFORMAT are used to specify the layouts for all messages.

  5. When you set the LOG_STDOUT to true, all the standard output and error messages of your process will be redirected to log.

Command-line Options

可以通过传递命令行参数来覆盖 Scrapy 设置,如下表所示:

Scrapy settings can be overridden by passing command-line arguments as shown in the following table −

Sr.No

Command & Description

1

--logfile FILE Overrides LOG_FILE

2

--loglevel/-L LEVEL Overrides LOG_LEVEL

3

--nolog Sets LOG_ENABLED to False

scrapy.utils.log module

此函数可用于初始化 Scrapy 的默认日志记录。

This function can be used to initialize logging defaults for Scrapy.

scrapy.utils.log.configure_logging(settings = None, install_root_handler = True)

Sr.No

Parameter & Description

1

settings (dict, None) It creates and configures the handler for root logger. By default, it is None.

2

install_root_handler (bool) It specifies to install root logging handler. By default, it is True.

以上函数 −

The above function −

  1. Routes warnings and twisted loggings through Python standard logging.

  2. Assigns DEBUG to Scrapy and ERROR level to Twisted loggers.

  3. Routes stdout to log, if LOG_STDOUT setting is true.

可以使用 settings 参数覆盖默认选项。当未指定设置时,则使用默认值。当 install_root_handler 设为 true 时,可以为根日志记录器创建处理程序。如果将其设为 false,则不会设置任何日志输出。在使用 Scrapy 命令时,configure_logging 将自动调用,并且在运行自定义脚本时可以显式运行。

Default options can be overridden using the settings argument. When settings are not specified, then defaults are used. The handler can be created for root logger, when install_root_handler is set to true. If it is set to false, then there will not be any log output set. When using Scrapy commands, the configure_logging will be called automatically and it can run explicitly, while running the custom scripts.

若要手动配置日志记录输出,可以使用 logging.basicConfig() ,如下所示 −

To configure logging’s output manually, you can use logging.basicConfig() shown as follows −

import logging
from scrapy.utils.log import configure_logging

configure_logging(install_root_handler = False)
logging.basicConfig (
   filename = 'logging.txt',
   format = '%(levelname)s: %(your_message)s',
   level = logging.INFO
)

Scrapy - Stats Collection

Description

状态收集器是 Scrapy 提供的一个工具,用于收集以键/值形式表示的状态,并且是通过爬虫 API(爬虫提供对所有 Scrapy 核心组件的访问)进行访问的。状态收集器为每个爬虫提供一个状态表,状态收集器在爬虫打开时自动打开该表并在爬虫关闭时关闭该表。

Stats Collector is a facility provided by Scrapy to collect the stats in the form of key/values and it is accessed using the Crawler API (Crawler provides access to all Scrapy core components). The stats collector provides one stats table per spider in which the stats collector opens automatically when spider is opening and closes the stats collector when spider is closed.

Common Stats Collector Uses

以下代码使用 stats 属性访问该统计收集器。

The following code accesses the stats collector using stats attribute.

class ExtensionThatAccessStats(object):
   def __init__(self, stats):
      self.stats = stats

   @classmethod
   def from_crawler(cls, crawler):
      return cls(crawler.stats)

下表显示了可以使用各种选项和统计收集器 -

The following table shows various options can be used with stats collector −

Sr.No

Parameters

Description

1

stats.set_value('hostname', socket.gethostname())

It is used to set the stats value.

2

stats.inc_value('customized_count')

It increments the stat value.

3

stats.max_value('max_items_scraped', value)

You can set the stat value, only if greater than previous value.

4

stats.min_value('min_free_memory_percent', value)

You can set the stat value, only if lower than previous value.

5

stats.get_value('customized_count')

It fetches the stat value.

6

stats.get_stats() {'custom_count': 1, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)}

It fetches all the stats

Available Stats Collectors

Scrapy 提供不同类型的统计收集器,可以使用 STATS_CLASS 设置访问它。

Scrapy provides different types of stats collector which can be accessed using the STATS_CLASS setting.

MemoryStatsCollector

这是默认统计集合器,它维护用于抓取内容的每个蜘蛛的统计信息,而数据将存储在内存中。

It is the default Stats collector that maintains the stats of every spider which was used for scraping and the data will be stored in the memory.

class scrapy.statscollectors.MemoryStatsCollector

DummyStatsCollector

此统计集合器非常有效,不做任何事。可以使用 STATS_CLASS 设置对它进行设置,并用于禁用统计集合以提高性能。

This stats collector is very efficient which does nothing. This can be set using the STATS_CLASS setting and can be used to disable the stats collection in order to improve the performance.

class scrapy.statscollectors.DummyStatsCollector

Scrapy - Sending an E-mail

Description

Scrapy 可以使用其名为 Twisted non-blocking IO 的自有工具发送电子邮件,该工具远离爬虫程序的非阻塞 IO。你可以配置发送电子邮件的一些设置,并提供用于发送附件的简单 API。

Scrapy can send e-mails using its own facility called as Twisted non-blocking IO which keeps away from non-blocking IO of the crawler. You can configure the few settings of sending emails and provide simple API for sending attachments.

共有两种实例化 MailSender 的方法,如下表所示 −

There are two ways to instantiate the MailSender as shown in the following table −

Sr.No

Parameters

Method

1

from scrapy.mail import MailSender mailer = MailSender()

By using a standard constructor.

2

mailer = MailSender.from_settings(settings)

By using Scrapy settings object.

以下代码行发送不带附件的电子邮件 −

The following line sends an e-mail without attachments −

mailer.send(to = ["receiver@example.com"], subject = "subject data", body = "body data",
   cc = ["list@example.com"])

MailSender Class Reference

MailSender 类使用 Twisted non-blocking IO 从 Scrapy 发送电子邮件。

The MailSender class uses Twisted non-blocking IO for sending e-mails from Scrapy.

class scrapy.mail.MailSender(smtphost = None, mailfrom = None, smtpuser = None,
   smtppass = None, smtpport = None)

以下表格显示了 MailSender 类中使用的参数 −

The following table shows the parameters used in MailSender class −

Sr.No

Parameter & Description

1

smtphost (str) The SMTP host is used for sending the emails. If not, then MAIL_HOST setting will be used.

2

mailfrom (str) The address of receiver is used to send the emails. If not, then MAIL_FROM setting will be used.

3

smtpuser It specifies the SMTP user. If it is not used, then MAIL_USER setting will be used and there will be no SMTP validation if is not mentioned.

4

smtppass (str) It specifies the SMTP pass for validation.

5

smtpport (int) It specifies the SMTP port for connection.

6

smtptls (boolean) It implements using the SMTP STARTTLS.

7

smtpssl (boolean) It administers using a safe SSL connection.

参考中指定了 MailSender 类中具有以下两个方法。第一个方法,

Following two methods are there in the MailSender class reference as specified. First method,

classmethod from_settings(settings)

它通过使用 Scrapy 设置对象合并。它包含以下参数 −

It incorporates by using the Scrapy settings object. It contains the following parameter −

settings (scrapy.settings.Settings object) − 它被视为电子邮件接收者。

settings (scrapy.settings.Settings object) − It is treated as e-mail receiver.

另一种方法,

Another method,

send(to, subject, body, cc = None, attachs = (), mimetype = 'text/plain', charset = None)

下表包含了上述方法的参数 −

The following table contains the parameters of the above method −

Sr.No

Parameter & Description

1

to (list) It refers to the email receiver.

2

subject (str) It specifies the subject of the email.

3

cc (list) It refers to the list of receivers.

4

body (str) It refers to email body data.

5

attachs (iterable) It refers to the email’s attachment, mimetype of the attachment and name of the attachment.

6

mimetype (str) It represents the MIME type of the e-mail.

7

charset (str) It specifies the character encoding used for email contents.

Mail Settings

通过以下设置确保无需编写任何代码,即可使用项目中的 MailSender 类配置电子邮件。

The following settings ensure that without writing any code, we can configure an e-mail using the MailSender class in the project.

Sr.No

Settings & Description

Default Value

1

MAIL_FROM It refers to sender email for sending emails.

'scrapy@localhost'

2

MAIL_HOST It refers to SMTP host used for sending emails.

'localhost'

3

MAIL_PORT It specifies SMTP port to be used for sending emails.

25

4

MAIL_USER It refers to SMTP validation. There will be no validation, if this setting is set to disable.

None

5

MAIL_PASS It provides the password used for SMTP validation.

None

6

MAIL_TLS It provides the method of upgrading an insecure connection to a secure connection using SSL/TLS.

False

7

MAIL_SSL It implements the connection using a SSL encrypted connection.

False

Scrapy - Telnet Console

Description

Telnet 控制台是一个 Python 外壳,该外壳在 Scrapy 流程内部运行,用于检查和控制要运行的 Scrapy 流程。

Telnet console is a Python shell which runs inside Scrapy process and is used for inspecting and controlling a Scrapy running process.

Access Telnet Console

可以使用以下命令访问 telnet 控制台 −

The telnet console can be accessed using the following command −

telnet localhost 6023

基本上,telnet 控制台在 TELNETCONSOLE_PORT 中所述的 TCP 端口中列出。

Basically, telnet console is listed in TCP port, which is described in TELNETCONSOLE_PORT settings.

Variables

下表中所述的某些默认变量用作快捷方式 −

Some of the default variables given in the following table are used as shortcuts −

Sr.No

Shortcut & Description

1

crawler This refers to the Scrapy Crawler (scrapy.crawler.Crawler) object.

2

engine This refers to Crawler.engine attribute.

3

spider This refers to the spider which is active.

4

slot This refers to the engine slot.

5

extensions This refers to the Extension Manager (Crawler.extensions) attribute.

6

stats This refers to the Stats Collector (Crawler.stats) attribute.

7

setting This refers to the Scrapy settings object (Crawler.settings) attribute.

8

est This refers to print a report of the engine status.

9

prefs This refers to the memory for debugging.

10

p This refers to a shortcut to the pprint.pprint function.

11

hpy This refers to memory debugging.

Examples

以下是使用 Telnet 控制台说明的一些示例。

Following are some examples illustrated using Telnet Console.

Pause, Resume and Stop the Scrapy Engine

要暂停 Scrapy 引擎,请使用以下命令 -

To pause Scrapy engine, use the following command −

telnet localhost 6023
>>> engine.pause()
>>>

要恢复 Scrapy 引擎,请使用以下命令 -

To resume Scrapy engine, use the following command −

telnet localhost 6023
>>> engine.unpause()
>>>

要停止 Scrapy 引擎,请使用以下命令 -

To stop Scrapy engine, use the following command −

telnet localhost 6023
>>> engine.stop()
Connection closed by foreign host.

View Engine Status

Telnet 控制台使用 est() 方法检查 Scrapy 引擎状态,如下面的代码中所示 -

Telnet console uses est() method to check the status of Scrapy engine as shown in the following code −

telnet localhost 6023
>>> est()
Execution engine status

time()-engine.start_time                        : 8.62972998619
engine.has_capacity()                           : False
len(engine.downloader.active)                   : 16
engine.scraper.is_idle()                        : False
engine.spider.name                              : followall
engine.spider_is_idle(engine.spider)            : False
engine.slot.closing                             : False
len(engine.slot.inprogress)                     : 16
len(engine.slot.scheduler.dqs or [])            : 0
len(engine.slot.scheduler.mqs)                  : 92
len(engine.scraper.slot.queue)                  : 0
len(engine.scraper.slot.active)                 : 0
engine.scraper.slot.active_size                 : 0
engine.scraper.slot.itemproc_size               : 0
engine.scraper.slot.needs_backout()             : False

Telnet Console Signals

你可以使用 telnet 控制台信号在 telnet 本地命名空间添加、更新或删除变量。要执行此操作,你需要在处理程序中添加 telnet_vars 字典。

You can use the telnet console signals to add, update, or delete the variables in the telnet local namespace. To perform this action, you need to add the telnet_vars dict in your handler.

scrapy.extensions.telnet.update_telnet_vars(telnet_vars)

参数 -

Parameters −

telnet_vars (dict)

其中,dict 是一个包含 telnet 变量的字典。

Where, dict is a dictionary containing telnet variables.

Telnet Settings

下表显示控制 Telnet 控制台行为的设置 -

The following table shows the settings that control the behavior of Telnet Console −

Sr.No

Settings & Description

Default Value

1

TELNETCONSOLE_PORT This refers to port range for telnet console. If it is set to none, then the port will be dynamically assigned.

[6023, 6073]

2

TELNETCONSOLE_HOST This refers to the interface on which the telnet console should listen.

'127.0.0.1'

Scrapy - Web Services

Description

一个正在运行的 Scrapy 网络爬虫可以通过 JSON-RPC 控制。它通过 JSONRPC_ENABLED 设置启用。此服务通过 JSON-RPC 2.0 协议提供对主爬虫对象的访问。用于访问爬虫程序对象的端点为 −

A running Scrapy web crawler can be controlled via JSON-RPC. It is enabled by JSONRPC_ENABLED setting. This service provides access to the main crawler object via JSON-RPC 2.0 protocol. The endpoint for accessing the crawler object is −

http://localhost:6080/crawler

下表包含显示 Web 服务行为的一些设置 −

The following table contains some of the settings which show the behavior of web service −

Sr.No

Setting & Description

Default Value

1

JSONRPC_ENABLED This refers to the boolean, which decides the web service along with its extension will be enabled or not.

True

2

JSONRPC_LOGFILE This refers to the file used for logging HTTP requests made to the web service. If it is not set the standard Scrapy log will be used.

None

3

JSONRPC_PORT This refers to the port range for the web service. If it is set to none, then the port will be dynamically assigned.

[6080, 7030]

4

JSONRPC_HOST This refers to the interface the web service should listen on.

'127.0.0.1'