Scrapy 简明教程

Scrapy - Selectors

Description

当您抓取网页时,您需要通过使用名为 selectors 的机制提取 HTML 源的特定部分,通过使用 XPath 或 CSS 表达式实现。选择器建立在 lxml 库之上,该库处理 Python 语言中的 XML 和 HTML。

When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.

使用以下代码片段定义选择器的不同概念 −

Use the following code snippet to define different concepts of selectors −

<html>
   <head>
      <title>My Website</title>
   </head>

   <body>
      <span>Hello world!!!</span>
      <div class = 'links'>
         <a href = 'one.html'>Link 1<img src = 'image1.jpg'/></a>
         <a href = 'two.html'>Link 2<img src = 'image2.jpg'/></a>
         <a href = 'three.html'>Link 3<img src = 'image3.jpg'/></a>
      </div>
   </body>
</html>

Constructing Selectors

您可以通过传递 textTextResponse 对象来构建选择器类实例。基于提供的输入类型,选择器选择以下规则 −

You can construct the selector class instances by passing the text or TextResponse object. Based on the provided input type, the selector chooses the following rules −

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

使用上述代码,您可以从文本中构建为 −

Using the above code, you can construct from the text as −

Selector(text = body).xpath('//span/text()').extract()

它会显示结果为 −

It will display the result as −

[u'Hello world!!!']

您可以从响应中构建为 −

You can construct from the response as −

response = HtmlResponse(url = 'http://mysite.com', body = body)
Selector(response = response).xpath('//span/text()').extract()

它会显示结果为 −

It will display the result as −

[u'Hello world!!!']

Using Selectors

使用上述简单代码片段,您可以构建 XPath 来选择标题标签中定义的文本,如下所示 −

Using the above simple code snippet, you can construct the XPath for selecting the text which is defined in the title tag as shown below −

>>response.selector.xpath('//title/text()')

现在,您可以使用 .extract() 方法提取文本数据,如下所示 −

Now, you can extract the textual data using the .extract() method shown as follows −

>>response.xpath('//title/text()').extract()

它会产生结果为 −

It will produce the result as −

[u'My Website']

您可以显示所有元素的名称,如下所示 −

You can display the name of all elements shown as follows −

>>response.xpath('//div[@class = "links"]/a/text()').extract()

它会显示元素为 −

It will display the elements as −

Link 1
Link 2
Link 3

如果您想要提取第一个元素,那么使用 .extract_first() 方法,如下所示 −

If you want to extract the first element, then use the method .extract_first(), shown as follows −

>>response.xpath('//div[@class = "links"]/a/text()').extract_first()

它会显示元素为 −

It will display the element as −

Link 1

Nesting Selectors

使用上述代码,您可以嵌套选择器,以使用 .xpath() 方法显示页面链接和图片源,如下所示 −

Using the above code, you can nest the selectors to display the page link and image source using the .xpath() method, shown as follows −

links = response.xpath('//a[contains(@href, "image")]')

for index, link in enumerate(links):
   args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
   print 'The link %d pointing to url %s and image %s' % args

它会显示结果为 −

It will display the result as −

Link 1 pointing to url [u'one.html'] and image [u'image1.jpg']
Link 2 pointing to url [u'two.html'] and image [u'image2.jpg']
Link 3 pointing to url [u'three.html'] and image [u'image3.jpg']

Selectors Using Regular Expressions

Scrapy 允许使用正则表达式提取数据,它使用 .re() 方法。从以上的 HTML 代码中,我们将提取图片名称,如下所示:

Scrapy allows to extract the data using regular expressions, which uses the .re() method. From the above HTML code, we will extract the image names shown as follows −

>>response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

上面的行显示图片名称如下:

The above line displays the image names as −

[u'Link 1',
u'Link 2',
u'Link 3']

Using Relative XPaths

当您使用 XPaths(它始于 / )时,嵌套选择器和 XPath 与文档的绝对路径相关,而不是选择器的相对路径。

When you are working with XPaths, which starts with the /, nested selectors and XPath are related to absolute path of the document, and not the relative path of the selector.

如果您想提取 <p> 元素,首先获取所有 div 元素:

If you want to extract the <p> elements, then first gain all div elements −

>>mydiv = response.xpath('//div')

接下来,您可以通过在 XPath 前加上一个点来提取里面的所有 'p' 元素,如下面的 .//p 所示:

Next, you can extract all the 'p' elements inside, by prefixing the XPath with a dot as .//p as shown below −

>>for p in mydiv.xpath('.//p').extract()

Using EXSLT Extensions

EXSLT 是一个社区,它为 XSLT(可扩展样式表语言转换)发布扩展,它将 XML 文档转换为 XHTML 文档。您可以使用 XPath 表达式中注册的命名空间中的 EXSLT 扩展,如下表所示:

The EXSLT is a community that issues the extensions to the XSLT (Extensible Stylesheet Language Transformations) which converts XML documents to XHTML documents. You can use the EXSLT extensions with the registered namespace in the XPath expressions as shown in the following table −

Sr.No

Prefix & Usage

Namespace

1

re regular expressions

http://exslt.org/regexp/index.html

2

set set manipulation

http://exslt.org/set/index.html

您可以在上一部分检查使用正则表达式提取数据的简单代码格式。

You can check the simple code format for extracting data using regular expressions in the previous section.

在将 XPath 与 Scrapy 选择器一起使用时,有一些 XPath 技巧很有用。有关详细信息,请单击此 link

There are some XPath tips, which are useful when using XPath with Scrapy selectors. For more information, click this link.