Scrapy 简明教程

Description

在本章中,我们将学习如何提取我们感兴趣的页面的链接,跟踪它们并从该页面提取数据。为此,我们需要对我们的 previous code 作出如下更改:

In this chapter, we’ll study how to extract the links of the pages of our interest, follow them and extract data from that page. For this, we need to make the following changes in our previous code shown as follows −

import scrapy
from tutorial.items import DmozItem

class MyprojectSpider(scrapy.Spider):
   name = "project"
   allowed_domains = ["dmoz.org"]

   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/",
   ]
   def parse(self, response):
      for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
         url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback = self.parse_dir_contents)

   def parse_dir_contents(self, response):
      for sel in response.xpath('//ul/li'):
         item = DmozItem()
         item['title'] = sel.xpath('a/text()').extract()
         item['link'] = sel.xpath('a/@href').extract()
         item['desc'] = sel.xpath('text()').extract()
         yield item

上面的代码包含以下方法-

The above code contains the following methods −

  1. parse() − It will extract the links of our interest.

  2. response.urljoin − The parse() method will use this method to build a new url and provide a new request, which will be sent later to callback.

  3. parse_dir_contents() − This is a callback which will actually scrape the data of interest.

在此处,Scrapy 使用回叫机制来跟踪链接。利用该机制,可以设计更大的爬虫并且可以跟踪所需的链接以从不同的页面中抓取所需信息。常规方法将是回叫方法,该方法将提取数据项,查找链接来跟踪到下一页,然后针对相同的回叫提供请求。

Here, Scrapy uses a callback mechanism to follow links. Using this mechanism, the bigger crawler can be designed and can follow links of interest to scrape the desired data from different pages. The regular method will be callback method, which will extract the items, look for links to follow the next page, and then provide a request for the same callback.

以下示例生成一个循环,该循环将跟踪链接到下一页。

The following example produces a loop, which will follow the links to the next page.

def parse_articles_follow_next_page(self, response):
   for article in response.xpath("//article"):
      item = ArticleItem()

      ... extract article data here

      yield item

   next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
   if next_page:
      url = response.urljoin(next_page[0].extract())
      yield scrapy.Request(url, self.parse_articles_follow_next_page)