Scrapy 简明教程

Scrapy - Extracting Items

Description

为了从网页中提取数据，Scrapy 使用了一种基于 XPath 和 CSS 表达式的选择器技术。以下是 XPath 表达式的一些示例 −

For extracting data from web pages, Scrapy uses a technique called selectors based on XPath and CSS expressions. Following are some examples of XPath expressions −

/html/head/title − This will select the <title> element, inside the <head> element of an HTML document.
/html/head/title/text() − This will select the text within the same <title> element.
//td − This will select all the elements from <td>.
//div[@class = "slice"] − This will select all elements from div which contain an attribute class = "slice"

选择器有四个基本的方法，如下表所示 −

Selectors have four basic methods as shown in the following table −

Sr.No	Method & Description
1	extract() It returns a unicode string along with the selected data.
2	re() It returns a list of unicode strings, extracted when the regular expression was given as argument.
3	xpath() It returns a list of selectors, which represents the nodes selected by the xpath expression given as an argument.
4	css() It returns a list of selectors, which represents the nodes selected by the CSS expression given as an argument.

Using Selectors in the Shell

要使用内置的 Scrapy 外壳演示选择器，你需要在你的系统中安装 IPython 。这里的重要事项是，在运行 Scrapy 时，URL 应该包含在引号内；否则带有“&”字符的 URL 将不起作用。你可以使用以下命令在项目的顶级目录中启动外壳 −

To demonstrate the selectors with the built-in Scrapy shell, you need to have IPython installed in your system. The important thing here is, the URLs should be included within the quotes while running Scrapy; otherwise the URLs with '&' characters won’t work. You can start a shell by using the following command in the project’s top level directory −

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

一个外壳将如下所示 −

A shell will look like the following −

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>(referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x3636b50>
[s]   item       {}
[s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   settings   <scrapy.settings.Settings object at 0x3fadc50>
[s]   spider     <Spider 'default' at 0x3cebf50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]:

当外壳加载时，你可以分别使用 response.body 和 response.header 访问正文或标题。类似地，你可以使用 response.selector.xpath() 或 response.selector.css() 对响应运行查询。

When shell loads, you can access the body or header by using response.body and response.header respectively. Similarly, you can run queries on the response using response.selector.xpath() or response.selector.css().

例如 −

For instance −

In [1]: response.xpath('//title')
Out[1]: [<Selector xpath = '//title' data = u'<title>My Book - Scrapy'>]

In [2]: response.xpath('//title').extract()
Out[2]: [u'<title>My Book - Scrapy: Index: Chapters</title>']

In [3]: response.xpath('//title/text()')
Out[3]: [<Selector xpath = '//title/text()' data = u'My Book - Scrapy: Index:'>]

In [4]: response.xpath('//title/text()').extract()
Out[4]: [u'My Book - Scrapy: Index: Chapters']

In [5]: response.xpath('//title/text()').re('(\w+):')
Out[5]: [u'Scrapy', u'Index', u'Chapters']

Extracting the Data

要从一个普通 HTML 站点中提取数据，我们必须检查该站点的源代码来获取 XPath。检查后，你会看到数据将位于 ul 标签中。选择 li 标签内的元素。

To extract data from a normal HTML site, we have to inspect the source code of the site to get XPaths. After inspecting, you can see that the data will be in the ul tag. Select the elements within li tag.

以下代码行显示了不同类型数据的提取 −

The following lines of code shows extraction of different types of data −

对于 li 标签中选择数据 −

For selecting data within li tag −

response.xpath('//ul/li')

对于选择描述 −

For selecting descriptions −

response.xpath('//ul/li/text()').extract()

对于选择网站标题 −

For selecting site titles −

response.xpath('//ul/li/a/text()').extract()

对于选择网站链接 −

For selecting site links −

response.xpath('//ul/li/a/@href').extract()

以下代码证明了上述提取器的使用 −

The following code demonstrates the use of above extractors −

import scrapy

class MyprojectSpider(scrapy.Spider):
   name = "project"
   allowed_domains = ["dmoz.org"]

   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]
   def parse(self, response):
      for sel in response.xpath('//ul/li'):
         title = sel.xpath('a/text()').extract()
         link = sel.xpath('a/@href').extract()
         desc = sel.xpath('text()').extract()
         print title, link, desc