Scrapy 简明教程

Scrapy - Spiders

Description

Spider 是一个类，负责定义如何通过网站跟踪链接并从页面中提取信息。

Spider is a class responsible for defining how to follow the links through a website and extract the information from the pages.

Scrapy 的默认爬虫如下 −

The default spiders of Scrapy are as follows −

scrapy.Spider

它是一个爬虫，每个其他爬虫都必须从中继承。它具有以下类 −

It is a spider from which every other spiders must inherit. It has the following class −

class scrapy.spiders.Spider

下表显示了 scrapy.Spider 类的字段 −

The following table shows the fields of scrapy.Spider class −

Sr.No	Field & Description
1	name It is the name of your spider.
2	allowed_domains It is a list of domains on which the spider crawls.
3	start_urls It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from.
4	custom_settings These are the settings, when running the spider, will be overridden from project wide configuration.
5	crawler It is an attribute that links to Crawler object to which the spider instance is bound.
6	settings These are the settings for running a spider.
7	logger It is a Python logger used to send log messages.
8	*from_crawler(crawler,args,**kwargs) It is a class method, which creates your spider. The parameters are − crawler − A crawler to which the spider instance will be bound. args(list) − These arguments are passed to the method init(). kwargs(dict) − These keyword arguments are passed to the method init().
9	start_requests() When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method.
10	make_requests_from_url(url) It is a method used to convert urls to requests.
11	parse(response) This method processes the response and returns scrapped data following more URLs.
12	log(message[,level,component]) It is a method that sends a log message through spiders logger.
13	closed(reason) This method is called when the spider closes.

Spider Arguments

Spider 参数用于指定开始 URL，并使用 crawl 命令与 -a 选项一起传递，如下所示 −

Spider arguments are used to specify start URLs and are passed using crawl command with -a option, shown as follows −

scrapy crawl first_scrapy -a group = accessories

以下代码演示了 Spider 如何接收参数 −

The following code demonstrates how a spider receives arguments −

import scrapy

class FirstSpider(scrapy.Spider):
   name = "first"

   def __init__(self, group = None, *args, **kwargs):
      super(FirstSpider, self).__init__(*args, **kwargs)
      self.start_urls = ["http://www.example.com/group/%s" % group]

Generic Spiders

你可以使用通用 Spider 对你的 Spider 进行子类化。它们的目的是根据特定规则遵循网站上的所有链接，以便从所有页面中提取数据。

You can use generic spiders to subclass your spiders from. Their aim is to follow all links on the website based on certain rules to extract data from all pages.

对于以下 Spider 中使用的示例，让我们假设我们有一个包含以下字段的项目 −

For the examples used in the following spiders, let’s assume we have a project with the following fields −

import scrapy
from scrapy.item import Item, Field

class First_scrapyItem(scrapy.Item):
   product_title = Field()
   product_link = Field()
   product_description = Field()

CrawlSpider

CrawlSpider 定义了一组要遵循的规则，以便遵循链接并提取多个页面。它具有以下类 −

CrawlSpider defines a set of rules to follow the links and scrap more than one page. It has the following class −

class scrapy.spiders.CrawlSpider

下面是 CrawlSpider 类的属性 −

Following are the attributes of CrawlSpider class −

rules

这是一个规则对象的列表，它定义了爬虫如何遵循链接。

It is a list of rule objects that defines how the crawler follows the link.

下表显示了 CrawlSpider 类的规则 −

The following table shows the rules of CrawlSpider class −

Sr.No

Rule & Description

LinkExtractor It specifies how spider follows the links and extracts the data.

callback It is to be called after each page is scraped.

follow It specifies whether to continue following links or not.

parse_start_url(response)

它可以通过允许解析初始响应来返回项目或请求对象。

It returns either item or request object by allowing to parse initial responses.

Note − 确保在编写规则时，将 parse 函数重命名为除 parse 之外的其他名称，因为 parse 函数由 CrawlSpider 用于实现其逻辑。

Note − Make sure you rename parse function other than parse while writing the rules because the parse function is used by CrawlSpider to implement its logic.

让我们看一看以下示例，其中 Spider 开始爬取 demoexample.com 的主页，收集所有页面、链接，并使用 parse_items 方法进行解析 −

Let’s take a look at the following example, where spider starts crawling demoexample.com’s home page, collecting all pages, links, and parses with the parse_items method −

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class DemoSpider(CrawlSpider):
   name = "demo"
   allowed_domains = ["www.demoexample.com"]
   start_urls = ["http://www.demoexample.com"]

   rules = (
      Rule(LinkExtractor(allow =(), restrict_xpaths = ("//div[@class = 'next']",)),
         callback = "parse_item", follow = True),
   )

   def parse_item(self, response):
      item = DemoItem()
      item["product_title"] = response.xpath("a/text()").extract()
      item["product_link"] = response.xpath("a/@href").extract()
      item["product_description"] = response.xpath("div[@class = 'desc']/text()").extract()
      return items

XMLFeedSpider

它是从 XML feed 进行抓取并迭代节点的爬虫的基础类。它有以下类：

It is the base class for spiders that scrape from XML feeds and iterates over nodes. It has the following class −

class scrapy.spiders.XMLFeedSpider

下表显示用于设置迭代器和标签名的类属性：

The following table shows the class attributes used to set an iterator and a tag name −

Sr.No	Attribute & Description
1	iterator It defines the iterator to be used. It can be either iternodes, html or xml. Default is iternodes.
2	itertag It is a string with node name to iterate.
3	namespaces It is defined by list of (prefix, uri) tuples that automatically registers namespaces using register_namespace() method.
4	adapt_response(response) It receives the response and modifies the response body as soon as it arrives from spider middleware, before spider starts parsing it.
5	parse_node(response,selector) It receives the response and a selector when called for each node matching the provided tag name. Note − Your spider won’t work if you don’t override this method.
6	process_results(response,results) It returns a list of results and response returned by the spider.

CSVFeedSpider

它遍历每行，接收 CSV 文件作为响应，并调用 parse_row() 方法。它有以下类：

It iterates through each of its rows, receives a CSV file as a response, and calls parse_row() method. It has the following class −

class scrapy.spiders.CSVFeedSpider

下表显示了关于 CSV 文件的设置选项：

The following table shows the options that can be set regarding the CSV file −

Sr.No	Option & Description
1	delimiter It is a string containing a comma(',') separator for each field.
2	quotechar It is a string containing quotation mark('"') for each field.
3	headers It is a list of statements from where the fields can be extracted.
4	parse_row(response,row) It receives a response and each row along with a key for header.

CSVFeedSpider Example

from scrapy.spiders import CSVFeedSpider
from demoproject.items import DemoItem

class DemoSpider(CSVFeedSpider):
   name = "demo"
   allowed_domains = ["www.demoexample.com"]
   start_urls = ["http://www.demoexample.com/feed.csv"]
   delimiter = ";"
   quotechar = "'"
   headers = ["product_title", "product_link", "product_description"]

   def parse_row(self, response, row):
      self.logger.info("This is row: %r", row)
      item = DemoItem()
      item["product_title"] = row["product_title"]
      item["product_link"] = row["product_link"]
      item["product_description"] = row["product_description"]
      return item

SitemapSpider

SitemapSpider 在 Sitemaps 的帮助下，通过从 robots.txt 中查找 URL 来抓取网站。它有以下类：

SitemapSpider with the help of Sitemaps crawl a website by locating the URLs from robots.txt. It has the following class −

class scrapy.spiders.SitemapSpider

下表显示了 SitemapSpider 的字段：

The following table shows the fields of SitemapSpider −

Sr.No	Field & Description
1	sitemap_urls A list of URLs which you want to crawl pointing to the sitemaps.
2	sitemap_rules It is a list of tuples (regex, callback), where regex is a regular expression, and callback is used to process URLs matching a regular expression.
3	sitemap_follow It is a list of sitemap’s regexes to follow.
4	sitemap_alternate_links Specifies alternate links to be followed for a single url.

SitemapSpider Example

以下 SitemapSpider 处理所有 URL −

The following SitemapSpider processes all the URLs −

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider):
   urls = ["http://www.demoexample.com/sitemap.xml"]

   def parse(self, response):
      # You can scrap items here

以下 SitemapSpider 使用回调处理一些 URL −

The following SitemapSpider processes some URLs with callback −

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider):
   urls = ["http://www.demoexample.com/sitemap.xml"]

   rules = [
      ("/item/", "parse_item"),
      ("/group/", "parse_group"),
   ]

   def parse_item(self, response):
      # you can scrap item here

   def parse_group(self, response):
      # you can scrap group here

以下代码显示 robots.txt 中网址为 /sitemap_company 的 sitemap。

The following code shows sitemaps in the robots.txt whose url has /sitemap_company −

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider):
   urls = ["http://www.demoexample.com/robots.txt"]
   rules = [
      ("/company/", "parse_company"),
   ]
   sitemap_follow = ["/sitemap_company"]

   def parse_company(self, response):
      # you can scrap company here

您甚至可以将 SitemapSpider 与其他 URL 结合使用，如下面的命令所示。

You can even combine SitemapSpider with other URLs as shown in the following command.

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider):
   urls = ["http://www.demoexample.com/robots.txt"]
   rules = [
      ("/company/", "parse_company"),
   ]

   other_urls = ["http://www.demoexample.com/contact-us"]
   def start_requests(self):
      requests = list(super(DemoSpider, self).start_requests())
      requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
      return requests

   def parse_company(self, response):
      # you can scrap company here...

   def parse_other(self, response):
      # you can scrap other here...