Scrapy 简明教程

Scrapy - Item Pipeline

Description

Item Pipeline 是一种对收集到的项目进行处理的方法。当一个项目被发送到项目管道时,它会被一个爬虫程序抓取出来,然后使用几个按顺序执行的组件进行处理。

Item Pipeline is a method where the scrapped items are processed. When an item is sent to the Item Pipeline, it is scraped by a spider and processed using several components, which are executed sequentially.

每当接收到一个项目时,就会决定采取以下两种操作之一 −

Whenever an item is received, it decides either of the following action −

  1. Keep processing the item.

  2. Drop it from pipeline.

  3. Stop processing the item.

项目管道通常用于以下目的 −

Item pipelines are generally used for the following purposes −

  1. Storing scraped items in database.

  2. If the received item is repeated, then it will drop the repeated item.

  3. It will check whether the item is with targeted fields.

  4. Clearing HTML data.

Syntax

您可以使用以下方法编写项目管道 −

You can write the Item Pipeline using the following method −

process_item(self, item, spider)

上述方法包含以下参数 −

The above method contains following parameters −

  1. Item (item object or dictionary) − It specifies the scraped item.

  2. spider (spider object) − The spider which scraped the item.

你也可以使用下表中给出的各种方法-

You can use additional methods given in the following table −

Sr.No

Method & Description

Parameters

1

open_spider(self, spider) It is selected when spider is opened.

spider (spider object) − It refers to the spider which was opened.

2

close_spider(self, spider) It is selected when spider is closed.

spider (spider object) − It refers to the spider which was closed.

3

from_crawler(cls, crawler) With the help of crawler, the pipeline can access the core components such as signals and settings of Scrapy.

crawler (Crawler object) − It refers to the crawler that uses this pipeline.

Example

以下是用于不同概念的 Item 管道的范例。

Following are the examples of item pipeline used in different concepts.

Dropping Items with No Tag

在以下代码中,管道平衡了那些不包含增值税(excludes_vat 属性)的 Item 的(price)属性并忽略不带价签的 Item.-

In the following code, the pipeline balances the (price) attribute for those items that do not include VAT (excludes_vat attribute) and ignore those items which do not have a price tag −

from Scrapy.exceptions import DropItem
class PricePipeline(object):
   vat = 2.25

   def process_item(self, item, spider):
      if item['price']:
         if item['excludes_vat']:
            item['price'] = item['price'] * self.vat
            return item
         else:
            raise DropItem("Missing price in %s" % item)

Writing Items to a JSON File

以下代码将把所有 spider 抓取的所有 Item 存储到单个 items.jl 文件中,其中包含每个以 JSON 格式序列化的行里的一个 Item。 JsonWriterPipeline 类在代码中用于展示如何编写 Item 管道-

The following code will store all the scraped items from all spiders into a single items.jl file, which contains one item per line in a serialized form in JSON format. The JsonWriterPipeline class is used in the code to show how to write item pipeline −

import json

class JsonWriterPipeline(object):
   def __init__(self):
      self.file = open('items.jl', 'wb')

   def process_item(self, item, spider):
      line = json.dumps(dict(item)) + "\n"
      self.file.write(line)
      return item

Writing Items to MongoDB

你可以在 Scrapy 设置中指定 MongoDB 地址和数据库名称,并且 MongoDB 集合可以以 Item 类命名。以下代码描述了如何正确使用 from_crawler() 方法来收集资源-

You can specify the MongoDB address and database name in Scrapy settings and MongoDB collection can be named after the item class. The following code describes how to use from_crawler() method to collect the resources properly −

import pymongo

class MongoPipeline(object):
   collection_name = 'Scrapy_list'

   def __init__(self, mongo_uri, mongo_db):
      self.mongo_uri = mongo_uri
      self.mongo_db = mongo_db

   @classmethod
   def from_crawler(cls, crawler):
      return cls(
         mongo_uri = crawler.settings.get('MONGO_URI'),
         mongo_db = crawler.settings.get('MONGO_DB', 'lists')
      )

   def open_spider(self, spider):
      self.client = pymongo.MongoClient(self.mongo_uri)
      self.db = self.client[self.mongo_db]

   def close_spider(self, spider):
      self.client.close()

   def process_item(self, item, spider):
      self.db[self.collection_name].insert(dict(item))
      return item

Duplicating Filters

过滤器会检查重复的 Item, 并且会删除已经处理的 Item。在以下代码中,我们为我们的 Item 已经使用一个唯一的 id, 但是 spider 返回了多个带有相同 id 的 Item -

A filter will check for the repeated items and it will drop the already processed items. In the following code, we have used a unique id for our items, but spider returns many items with the same id −

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):
   def __init__(self):
      self.ids_seen = set()

   def process_item(self, item, spider):
      if item['id'] in self.ids_seen:
         raise DropItem("Repeated items found: %s" % item)
      else:
         self.ids_seen.add(item['id'])
         return item

Activating an Item Pipeline

你可以通过将它的类添加到 ITEM_PIPELINES 设置中来激活一个 Item Pipeline 组件,如下面的代码所示。你可以按照它们运行的顺序给类分配整数值(顺序可以从低值到高值),并且其值在 0 到 1000 范围内。

You can activate an Item Pipeline component by adding its class to the ITEM_PIPELINES setting as shown in the following code. You can assign integer values to the classes in the order in which they run (the order can be lower valued to higher valued classes) and values will be in the 0-1000 range.

ITEM_PIPELINES = {
   'myproject.pipelines.PricePipeline': 100,
   'myproject.pipelines.JsonWriterPipeline': 600,
}