Scrapy 简明教程

Scrapy - Item Loaders

Description

项目加载器提供了一种便捷的方法来填充从网站抓取的项目。

Item loaders provide a convenient way to fill the items that are scraped from the websites.

Declaring Item Loaders

项目加载器的声明类似于项目。

The declaration of Item Loaders is like Items.

例如 -

For example −

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

class DemoLoader(ItemLoader):
   default_output_processor = TakeFirst()
   title_in = MapCompose(unicode.title)
   title_out = Join()
   size_in = MapCompose(unicode.strip)
   # you can continue scraping here

在上面的代码中,您可以看到使用 _in 后缀声明输入处理器,并使用 _out 后缀声明输出处理器。

In the above code, you can see that input processors are declared using _in suffix and output processors are declared using _out suffix.

ItemLoader.default_input_processorItemLoader.default_output_processor 属性用于声明默认输入/输出处理器。

The ItemLoader.default_input_processor and ItemLoader.default_output_processor attributes are used to declare default input/output processors.

Using Item Loaders to Populate Items

要使用项目加载器,首先使用类字典对象实例化或没有类字典对象实例化项目,该项目使用 ItemLoader.default_item_class 属性中指定的项目类。

To use Item Loader, first instantiate with dict-like object or without one where the item uses Item class specified in ItemLoader.default_item_class attribute.

  1. You can use selectors to collect values into the Item Loader.

  2. You can add more values in the same item field, where Item Loader will use an appropriate handler to add these values.

以下代码展示使用项目加载器填充项目的的过程:

The following code demonstrates how items are populated using Item Loaders −

from scrapy.loader import ItemLoader
from demoproject.items import Demo

def parse(self, response):
   l = ItemLoader(item = Product(), response = response)
   l.add_xpath("title", "//div[@class = 'product_title']")
   l.add_xpath("title", "//div[@class = 'product_name']")
   l.add_xpath("desc", "//div[@class = 'desc']")
   l.add_css("size", "div#size]")
   l.add_value("last_updated", "yesterday")
   return l.load_item()

如上所示,有两个不同的 XPath,由 add_xpath() 方法抽取 title 域:

As shown above, there are two different XPaths from which the title field is extracted using add_xpath() method −

1. //div[@class = "product_title"]
2. //div[@class = "product_name"]

此后,类似的请求用于 desc 域。通过 add_css() 方法抽取大小数据,然后通过 add_value() 方法使用值“昨天”填充 last_updated

Thereafter, a similar request is used for desc field. The size data is extracted using add_css() method and last_updated is filled with a value "yesterday" using add_value() method.

一旦收集到所有数据,调用 ItemLoader.load_item() 方法,它将返回填充了通过 add_xpath()add_css()add_value() 方法抽取的数据的项目。

Once all the data is collected, call ItemLoader.load_item() method which returns the items filled with data extracted using add_xpath(), add_css() and add_value() methods.

Input and Output Processors

项目加载器的每个域都包含一个输入处理器和一个输出处理器。

Each field of an Item Loader contains one input processor and one output processor.

  1. When data is extracted, input processor processes it and its result is stored in ItemLoader.

  2. Next, after collecting the data, call ItemLoader.load_item() method to get the populated Item object.

  3. Finally, you can assign the result of the output processor to the item.

以下代码展示如何调用特定域的输入和输出处理器:

The following code demonstrates how to call input and output processors for a specific field −

l = ItemLoader(Product(), some_selector)
l.add_xpath("title", xpath1) # [1]
l.add_xpath("title", xpath2) # [2]
l.add_css("title", css)      # [3]
l.add_value("title", "demo") # [4]
return l.load_item()         # [5]

Line 1 - 从 xpath1 抽取标题数据并传递通过输入处理器,然后收集其结果并存储在 ItemLoader 中。

Line 1 − The data of title is extracted from xpath1 and passed through the input processor and its result is collected and stored in ItemLoader.

Line 2 - 同样,从 xpath2 抽取标题并传递通过相同的输入处理器,然后将其结果添加到为 [1] 收集的数据中。

Line 2 − Similarly, the title is extracted from xpath2 and passed through the same input processor and its result is added to the data collected for [1].

Line 3 - 从 css 选择器抽取标题并传递通过相同的输入处理器,然后将结果添加到为 [1] 和 [2] 收集的数据中。

Line 3 − The title is extracted from css selector and passed through the same input processor and the result is added to the data collected for [1] and [2].

Line 4 - 接下来,分配值“demo”并传递通过输入处理器。

Line 4 − Next, the value "demo" is assigned and passed through the input processors.

Line 5 - 最后,从所有域内部收集数据并传递至输出处理器,最后的值分配给项目。

Line 5 − Finally, the data is collected internally from all the fields and passed to the output processor and the final value is assigned to the Item.

Declaring Input and Output Processors

输入和输出处理器在项目加载器的定义中声明。除此外,它们还可以 Item Field 元数据中指定。

The input and output processors are declared in the ItemLoader definition. Apart from this, they can also be specified in the Item Field metadata.

例如 -

For example −

import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags

def filter_size(value):
   if value.isdigit():
      return value

class Item(scrapy.Item):
   name = scrapy.Field(
      input_processor = MapCompose(remove_tags),
      output_processor = Join(),
   )
   size = scrapy.Field(
      input_processor = MapCompose(remove_tags, filter_price),
      output_processor = TakeFirst(),
   )

>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item = Product())
>>> il.add_value('title', [u'Hello', u'<strong>world</strong>'])
>>> il.add_value('size', [u'<span>100 kg</span>'])
>>> il.load_item()

其显示的输出为:

It displays an output as −

{'title': u'Hello world', 'size': u'100 kg'}

Item Loader Context

项目加载器上下文是 input 和 output 处理器共享的任意键值 dict。

The Item Loader Context is a dict of arbitrary key values shared among input and output processors.

例如,假设您具有函数 parse_length:

For example, assume you have a function parse_length −

def parse_length(text, loader_context):
   unit = loader_context.get('unit', 'cm')

   # You can write parsing code of length here
   return parsed_length

通过接收 loader_context 参数,Item Loader 会知道可以接收 Item Loader 上下文。有几种方法可以更改 Item Loader 上下文的 value −

By receiving loader_context arguements, it tells the Item Loader it can receive Item Loader context. There are several ways to change the value of Item Loader context −

  1. Modify current active Item Loader context −

loader = ItemLoader (product)
loader.context ["unit"] = "mm"
  1. On Item Loader instantiation −

loader = ItemLoader(product, unit = "mm")
  1. On Item Loader declaration for input/output processors that instantiates with Item Loader context −

class ProductLoader(ItemLoader):
   length_out = MapCompose(parse_length, unit = "mm")

ItemLoader Objects

它是一个对象,返回一个 Item Loader,以填充给定的项目。它有以下类 −

It is an object which returns a new item loader to populate the given item. It has the following class −

class scrapy.loader.ItemLoader([item, selector, response, ]**kwargs)

下表显示 ItemLoader 对象的参数 −

The following table shows the parameters of ItemLoader objects −

Sr.No

Parameter & Description

1

item It is the item to populate by calling add_xpath(), add_css() or add_value().

2

selector It is used to extract data from websites.

3

response It is used to construct selector using default_selector_class.

下表显示 ItemLoader 对象的方法 −

Following table shows the methods of ItemLoader objects −

Sr.No

Method & Description

Example

1

get_value(value, *processors, **kwargs) By a given processor and keyword arguments, the value is processed by get_value() method.

>>> from scrapy.loader.processors import TakeFirst >>> loader.get_value(u’title: demoweb', TakeFirst(), unicode.upper, re = 'title: (.+)') 'DEMOWEB`

2

add_value(field_name, value, *processors, **kwargs) It processes the value and adds to the field where it is first passed through get_value by giving processors and keyword arguments before passing through field input processor.

loader.add_value('title', u’DVD') loader.add_value('colors', [u’black', u’white']) loader.add_value('length', u'80') loader.add_value('price', u'2500')

3

replace_value(field_name, value, *processors, **kwargs) It replaces the collected data with a new value.

loader.replace_value('title', u’DVD') loader.replace_value('colors', [u’black', u’white']) loader.replace_value('length', u'80') loader.replace_value('price', u'2500')

4

get_xpath(xpath, *processors, **kwargs) It is used to extract unicode strings by giving processors and keyword arguments by receiving XPath.

# HTML code: <div class = "item-name">DVD</div> loader.get_xpath("//div[@class = 'item-name']")

# HTML code: <div id = "length">the length is 45cm</div> loader.get_xpath("//div[@id = 'length']", TakeFirst(), re = "the length is (.*)")

5

add_xpath(field_name, xpath, *processors, **kwargs) It receives XPath to the field which extracts unicode strings.

# HTML code: <div class = "item-name">DVD</div> loader.add_xpath('name', '//div [@class = "item-name"]')

# HTML code: <div id = "length">the length is 45cm</div> loader.add_xpath('length', '//div[@id = "length"]', re = 'the length is (.*)')

6

replace_xpath(field_name, xpath, *processors, **kwargs) It replaces the collected data using XPath from sites.

# HTML code: <div class = "item-name">DVD</div> loader.replace_xpath('name', '

# HTML code: <div id = "length">the length is 45cm</div> loader.replace_xpath('length', '

7

get_css(css, *processors, **kwargs) It receives CSS selector used to extract the unicode strings.

loader.get_css("div.item-name") loader.get_css("div#length", TakeFirst(), re = "the length is (.*)")

8

add_css(field_name, css, *processors, **kwargs) It is similar to add_value() method with one difference that it adds CSS selector to the field.

loader.add_css('name', 'div.item-name') loader.add_css('length', 'div#length', re = 'the length is (.*)')

9

replace_css(field_name, css, *processors, **kwargs) It replaces the extracted data using CSS selector.

loader.replace_css('name', 'div.item-name') loader.replace_css('length', 'div#length', re = 'the length is (.*)')

10

load_item() When the data is collected, this method fills the item with collected data and returns it.

def parse(self, response): l = ItemLoader(item = Product(), response = response) l.add_xpath('title', '// div[@class = "product_title"]') loader.load_item()

11

nested_xpath(xpath) It is used to create nested loaders with an XPath selector.

loader = ItemLoader(item = Item()) loader.add_xpath('social', ' a[@class = "social"]/@href') loader.add_xpath('email', ' a[@class = "email"]/@href')

12

nested_css(css) It is used to create nested loaders with a CSS selector.

loader = ItemLoader(item = Item()) loader.add_css('social', 'a[@class = "social"]/@href') loader.add_css('email', 'a[@class = "email"]/@href')

下表显示了 ItemLoader 对象的属性:

Following table shows the attributes of ItemLoader objects −

Sr.No

Attribute & Description

1

item It is an object on which the Item Loader performs parsing.

2

context It is the current context of Item Loader that is active.

3

default_item_class It is used to represent the items, if not given in the constructor.

4

default_input_processor The fields which don’t specify input processor are the only ones for which default_input_processors are used.

5

default_output_processor The fields which don’t specify the output processor are the only ones for which default_output_processors are used.

6

default_selector_class It is a class used to construct the selector, if it is not given in the constructor.

7

selector It is an object that can be used to extract the data from sites.

Nested Loaders

它用于在解析文档子部分中的值时创建嵌套加载器。如果你不创建嵌套加载器,你需要为要提取的每个值指定完整的 XPath 或 CSS。

It is used to create nested loaders while parsing the values from the subsection of a document. If you don’t create nested loaders, you need to specify full XPath or CSS for each value that you want to extract.

例如,假设数据正在从头文件中提取 −

For instance, assume that the data is being extracted from a header page −

<header>
   <a class = "social" href = "http://facebook.com/whatever">facebook</a>
   <a class = "social" href = "http://twitter.com/whatever">twitter</a>
   <a class = "email" href = "mailto:someone@example.com">send mail</a>
</header>

接下来,你可以通过向头文件中添加相关值来用头选择器创建嵌套加载器 −

Next, you can create a nested loader with header selector by adding related values to the header −

loader = ItemLoader(item = Item())
header_loader = loader.nested_xpath('//header')
header_loader.add_xpath('social', 'a[@class = "social"]/@href')
header_loader.add_xpath('email', 'a[@class = "email"]/@href')
loader.load_item()

Reusing and extending Item Loaders

项目加载器旨在减轻在项目获取更多爬虫时出现的维护问题。

Item Loaders are designed to relieve the maintenance which becomes a fundamental problem when your project acquires more spiders.

例如,假设某个网站的产品名称包含三个破折号(例如 --DVD---)。如果你不希望它出现在最终产品名称中,你可以通过重复使用默认的产品项目加载器来移除那些破折号,如下面的代码所示 −

For instance, assume that a site has their product name enclosed in three dashes (e.g. --DVD---). You can remove those dashes by reusing the default Product Item Loader, if you don’t want it in the final product names as shown in the following code −

from scrapy.loader.processors import MapCompose
from demoproject.ItemLoaders import DemoLoader

def strip_dashes(x):
   return x.strip('-')

class SiteSpecificLoader(DemoLoader):
   title_in = MapCompose(strip_dashes, DemoLoader.title_in)

Available Built-in Processors

以下是常用的内置处理器 −

Following are some of the commonly used built-in processors −

class scrapy.loader.processors.Identity

它返回原始值而不做任何更改。例如 −

It returns the original value without altering it. For example −

>>> from scrapy.loader.processors import Identity
>>> proc = Identity()
>>> proc(['a', 'b', 'c'])
['a', 'b', 'c']

class scrapy.loader.processors.TakeFirst

它从接收值列表中返回第一个非空/非空值。例如 −

It returns the first value that is non-null/non-empty from the list of received values. For example −

>>> from scrapy.loader.processors import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'a', 'b', 'c'])
'a'

class scrapy.loader.processors.Join(separator = u' ')

它返回连接到分隔符的值。默认分隔符是 u' ',它相当于函数 u' '.join 。例如 −

It returns the value attached to the separator. The default separator is u' ' and it is equivalent to the function u' '.join. For example −

>>> from scrapy.loader.processors import Join
>>> proc = Join()
>>> proc(['a', 'b', 'c'])
u'a b c'
>>> proc = Join('<br>')
>>> proc(['a', 'b', 'c'])
u'a<br>b<br>c'

class scrapy.loader.processors.Compose(*functions, **default_loader_context)

它由一个处理器定义,其中它的每个输入值都传递到第一个函数,该函数的结果传递到第二个函数,依此类推,直到最后函数返回最终值作为输出。

It is defined by a processor where each of its input value is passed to the first function, and the result of that function is passed to the second function and so on, till lthe ast function returns the final value as output.

例如 -

For example −

>>> from scrapy.loader.processors import Compose
>>> proc = Compose(lambda v: v[0], str.upper)
>>> proc(['python', 'scrapy'])
'PYTHON'

class scrapy.loader.processors.MapCompose(*functions, **default_loader_context)

这是一个处理器,输入值会在其中进行迭代,第一个函数会应用到每个元素。接下来,这些函数调用的结果连接在一起,形成可迭代的新项,然后应用到第二个函数,依此类推,知道最后一个函数。

It is a processor where the input value is iterated and the first function is applied to each element. Next, the result of these function calls are concatenated to build new iterable that is then applied to the second function and so on, till the last function.

例如 -

For example −

>>> def filter_scrapy(x):
   return None if x == 'scrapy' else x

>>> from scrapy.loader.processors import MapCompose
>>> proc = MapCompose(filter_scrapy, unicode.upper)
>>> proc([u'hi', u'everyone', u'im', u'pythonscrapy'])
[u'HI, u'IM', u'PYTHONSCRAPY']

class scrapy.loader.processors.SelectJmes(json_path)

这个类使用提供的 json 路径查询值并返回输出。

This class queries the value using the provided json path and returns the output.

例如 -

For example −

>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose
>>> proc = SelectJmes("hello")
>>> proc({'hello': 'scrapy'})
'scrapy'
>>> proc({'hello': {'scrapy': 'world'}})
{'scrapy': 'world'}

以下是通过导入 json 查询值的代码 −

Following is the code, which queries the value by importing json −

>>> import json
>>> proc_single_json_str = Compose(json.loads, SelectJmes("hello"))
>>> proc_single_json_str('{"hello": "scrapy"}')
u'scrapy'
>>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('hello')))
>>> proc_json_list('[{"hello":"scrapy"}, {"world":"env"}]')
[u'scrapy']