Scrapy 简明教程

Scrapy - Requests and Responses

Description

Scrapy 可使用 RequestResponse 对象抓取网站。请求对象经过系统,使用蜘蛛执行请求并在返回响应对象时返回到请求。

Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.

Request Objects

请求对象是一个生成响应的 HTTP 请求。它具有以下类:

The request object is a HTTP request that generates a response. It has the following class −

class scrapy.http.Request(url[, callback, method = 'GET', headers, body, cookies, meta,
   encoding = 'utf-8', priority = 0, dont_filter = False, errback])

下表显示请求对象的各个参数:

Following table shows the parameters of Request objects −

Sr.No

Parameter & Description

1

url It is a string that specifies the URL request.

2

callback It is a callable function which uses the response of the request as first parameter.

3

method It is a string that specifies the HTTP method request.

4

headers It is a dictionary with request headers.

5

body It is a string or unicode that has a request body.

6

cookies It is a list containing request cookies.

7

meta It is a dictionary that contains values for metadata of the request.

8

encoding It is a string containing utf-8 encoding used to encode URL.

9

priority It is an integer where the scheduler uses priority to define the order to process requests.

10

dont_filter It is a boolean specifying that the scheduler should not filter the request.

11

errback It is a callable function to be called when an exception while processing a request is raised.

Passing Additional Data to Callback Functions

请求的回调函数在以其第一个参数下载响应时调用。

The callback function of a request is called when the response is downloaded as its first parameter.

例如 -

For example −

def parse_page1(self, response):
   return scrapy.Request("http://www.something.com/some_page.html",
      callback = self.parse_page2)

def parse_page2(self, response):
   self.logger.info("%s page visited", response.url)

如果您想要将参数传递给可调用函数并在第二个回调中接收这些参数,则可以使用 Request.meta 属性,如下例所示:

You can use Request.meta attribute, if you want to pass arguments to callable functions and receive those arguments in the second callback as shown in the following example −

def parse_page1(self, response):
   item = DemoItem()
   item['foremost_link'] = response.url
   request = scrapy.Request("http://www.something.com/some_page.html",
      callback = self.parse_page2)
   request.meta['item'] = item
   return request

def parse_page2(self, response):
   item = response.meta['item']
   item['other_link'] = response.url
   return item

Using errbacks to Catch Exceptions in Request Processing

当在处理请求时引发异常时,调用 errback 函数。

The errback is a callable function to be called when an exception while processing a request is raised.

以下示例演示了这一点−

The following example demonstrates this −

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class DemoSpider(scrapy.Spider):
   name = "demo"
   start_urls = [
      "http://www.httpbin.org/",              # HTTP 200 expected
      "http://www.httpbin.org/status/404",    # Webpage not found
      "http://www.httpbin.org/status/500",    # Internal server error
      "http://www.httpbin.org:12345/",        # timeout expected
      "http://www.httphttpbinbin.org/",       # DNS error expected
   ]

   def start_requests(self):
      for u in self.start_urls:
         yield scrapy.Request(u, callback = self.parse_httpbin,
         errback = self.errback_httpbin,
         dont_filter=True)

   def parse_httpbin(self, response):
      self.logger.info('Recieved response from {}'.format(response.url))
      # ...

   def errback_httpbin(self, failure):
      # logs failures
      self.logger.error(repr(failure))

      if failure.check(HttpError):
         response = failure.value.response
         self.logger.error("HttpError occurred on %s", response.url)

      elif failure.check(DNSLookupError):
         request = failure.request
         self.logger.error("DNSLookupError occurred on %s", request.url)

      elif failure.check(TimeoutError, TCPTimedOutError):
         request = failure.request
         self.logger.error("TimeoutError occurred on %s", request.url)

Request.meta Special Keys

request.meta 特殊键是 Scrapy 识别的特殊 meta 键列表。

The request.meta special keys is a list of special meta keys identified by Scrapy.

下表显示 Request.meta 的一些键:

Following table shows some of the keys of Request.meta −

Sr.No

Key & Description

1

dont_redirect It is a key when set to true, does not redirect the request based on the status of the response.

2

dont_retry It is a key when set to true, does not retry the failed requests and will be ignored by the middleware.

3

handle_httpstatus_list It is a key that defines which response codes per-request basis can be allowed.

4

handle_httpstatus_all It is a key used to allow any response code for a request by setting it to true.

5

dont_merge_cookies It is a key used to avoid merging with the existing cookies by setting it to true.

6

cookiejar It is a key used to keep multiple cookie sessions per spider.

7

dont_cache It is a key used to avoid caching HTTP requests and response on each policy.

8

redirect_urls It is a key which contains URLs through which the requests pass.

9

bindaddress It is the IP of the outgoing IP address that can be used to perform the request.

10

dont_obey_robotstxt It is a key when set to true, does not filter the requests prohibited by the robots.txt exclusion standard, even if ROBOTSTXT_OBEY is enabled.

11

download_timeout It is used to set timeout (in secs) per spider for which the downloader will wait before it times out.

12

download_maxsize It is used to set maximum size (in bytes) per spider, which the downloader will download.

13

proxy Proxy can be set for Request objects to set HTTP proxy for the use of requests.

Request Subclasses

你可以通过对 request 类进行子类化来实现自己的自定义功能。内置请求子类如下 −

You can implement your own custom functionality by subclassing the request class. The built-in request subclasses are as follows −

FormRequest Objects

FormRequest 类通过扩展基本请求来处理 HTML 表单。它具有以下类 −

The FormRequest class deals with HTML forms by extending the base request. It has the following class −

class scrapy.http.FormRequest(url[,formdata, callback, method = 'GET', headers, body,
   cookies, meta, encoding = 'utf-8', priority = 0, dont_filter = False, errback])

以下是参数 −

Following is the parameter −

formdata − 这是一个具有 HTML 表单数据的字典,该字典被分配给请求正文。

formdata − It is a dictionary having HTML form data that is assigned to the body of the request.

Note − 其余参数与 request 类相同,并在 Request Objects 部分中进行了说明。

Note − Remaining parameters are the same as request class and is explained in Request Objects section.

除了请求方法之外, FormRequest 对象还支持以下类方法 −

The following class methods are supported by FormRequest objects in addition to request methods −

classmethod from_response(response[, formname = None, formnumber = 0, formdata = None,
   formxpath = None, formcss = None, clickdata = None, dont_click = False, ...])

下表显示了上面类的参数 -

The following table shows the parameters of the above class −

Sr.No

Parameter & Description

1

response It is an object used to pre-populate the form fields using HTML form of response.

2

formname It is a string where the form having name attribute will be used, if specified.

3

formnumber It is an integer of forms to be used when there are multiple forms in the response.

4

formdata It is a dictionary of fields in the form data used to override.

5

formxpath It is a string when specified, the form matching the xpath is used.

6

formcss It is a string when specified, the form matching the css selector is used.

7

clickdata It is a dictionary of attributes used to observe the clicked control.

8

dont_click The data from the form will be submitted without clicking any element, when set to true.

Examples

以下是部分请求用例 -

Following are some of the request usage examples −

Using FormRequest to send data via HTTP POST

Using FormRequest to send data via HTTP POST

下面的代码演示了在蜘蛛中重复 HTML 表单 POST 时,如何返回 FormRequest 对象 -

The following code demonstrates how to return FormRequest object when you want to duplicate HTML form POST in your spider −

return [FormRequest(url = "http://www.something.com/post/action",
   formdata = {'firstname': 'John', 'lastname': 'dave'},
   callback = self.after_post)]

Using FormRequest.from_response() to simulate a user login

Using FormRequest.from_response() to simulate a user login

通常,网站使用元素通过它们提供预先填充的表单字段。

Normally, websites use elements through which it provides pre-populated form fields.

当你希望这些字段在抓取时自动填充时,可以使用 FormRequest.form_response() 方法。

The FormRequest.form_response() method can be used when you want these fields to be automatically populate while scraping.

以下示例演示了这一点。

The following example demonstrates this.

import scrapy
class DemoSpider(scrapy.Spider):
   name = 'demo'
   start_urls = ['http://www.something.com/users/login.php']
   def parse(self, response):
      return scrapy.FormRequest.from_response(
         response,
         formdata = {'username': 'admin', 'password': 'confidential'},
         callback = self.after_login
      )

   def after_login(self, response):
      if "authentication failed" in response.body:
         self.logger.error("Login failed")
         return
      # You can continue scraping here

Response Objects

这是一个表征 HTTP 响应的对象,它被输入到需要处理的爬虫中。它有以下类 -

It is an object indicating HTTP response that is fed to the spiders to process. It has the following class −

class scrapy.http.Response(url[, status = 200, headers, body, flags])

下表显示了响应对象的参数 -

The following table shows the parameters of Response objects −

Sr.No

Parameter & Description

1

url It is a string that specifies the URL response.

2

status It is an integer that contains HTTP status response.

3

headers It is a dictionary containing response headers.

4

body It is a string with response body.

5

flags It is a list containing flags of response.

Response Subclasses

你可以通过对响应类进行子类化来实现你自定义的功能。内置的响应子类如下:

You can implement your own custom functionality by subclassing the response class. The built-in response subclasses are as follows −

TextResponse objects

TextResponse objects

TextResponse 对象被用于二进制数据,例如图像、声音等,它有能力编码基本的响应类。它包含以下类:

TextResponse objects are used for binary data such as images, sounds, etc. which has the ability to encode the base Response class. It has the following class −

class scrapy.http.TextResponse(url[, encoding[,status = 200, headers, body, flags]])

以下是参数 −

Following is the parameter −

encoding − 这是一个用于对响应进行编码的字符串。

encoding − It is a string with encoding that is used to encode a response.

Note − 剩余的参数与响应类相同,在 Response Objects 部分中对其进行了说明。

Note − Remaining parameters are same as response class and is explained in Response Objects section.

下表显示了 TextResponse 对象在响应方法之外支持的属性:

The following table shows the attributes supported by TextResponse object in addition to response methods −

Sr.No

Attribute & Description

1

text It is a response body, where response.text can be accessed multiple times.

2

encoding It is a string containing encoding for response.

3

selector It is an attribute instantiated on first access and uses response as target.

下表显示了 TextResponse 对象在响应方法之外支持的方法:

The following table shows the methods supported by TextResponse objects in addition to response methods −

Sr.No

Method & Description

1

xpath (query) It is a shortcut to TextResponse.selector.xpath(query).

2

css (query) It is a shortcut to TextResponse.selector.css(query).

3

body_as_unicode() It is a response body available as a method, where response.text can be accessed multiple times.

HtmlResponse Objects

它是一个支持编码和自动发现的对象,通过查看 HTML 的 meta httpequiv 属性进行此操作。它的参数与响应类相同,在响应对象部分中对此进行了说明。它包含以下类:

It is an object that supports encoding and auto-discovering by looking at the meta httpequiv attribute of HTML. Its parameters are the same as response class and is explained in Response objects section. It has the following class −

class scrapy.http.HtmlResponse(url[,status = 200, headers, body, flags])

XmlResponse Objects

它是一个通过查看 XML 行支持编码和自动发现的对象。它的参数与响应类相同,在响应对象部分中对此进行了说明。它包含以下类:

It is an object that supports encoding and auto-discovering by looking at the XML line. Its parameters are the same as response class and is explained in Response objects section. It has the following class −

class scrapy.http.XmlResponse(url[, status = 200, headers, body, flags])