Scrapy 简明教程

Description

顾名思义,链接提取器是可以用于使用 scrapy.http.Response 对象从网页中提取链接的对象。在 Scrapy 中,有内建的提取器,例如 scrapy.linkextractors 导入 LinkExtractor 。您可以通过实现简单界面根据需要自定义您自己的链接提取器。

As the name itself indicates, Link Extractors are the objects that are used to extract links from web pages using scrapy.http.Response objects. In Scrapy, there are built-in extractors such as scrapy.linkextractors import LinkExtractor. You can customize your own link extractor according to your needs by implementing a simple interface.

每个链接提取器都有一个称为 extract_links 的公共方法,它包括一个 Response 对象并返回一个 scrapy.link.Link 对象列表。您只能实例化链接提取器一次,并多次调用 extract_links 方法以使用不同的响应来提取链接。CrawlSpiderclass 使用链接提取器和一组规则,其主要目的是提取链接。

Every link extractor has a public method called extract_links which includes a Response object and returns a list of scrapy.link.Link objects. You can instantiate the link extractors only once and call the extract_links method various times to extract links with different responses. The CrawlSpiderclass uses link extractors with a set of rules whose main purpose is to extract links.

通常,链接提取器与 Scrapy 一起分组,并在 scrapy.linkextractors 模块中提供。默认情况下,链接提取器将是 LinkExtractor,其功能与 LxmlLinkExtractor 相同 −

Normally link extractors are grouped with Scrapy and are provided in scrapy.linkextractors module. By default, the link extractor will be LinkExtractor which is equal in functionality with LxmlLinkExtractor −

from scrapy.linkextractors import LinkExtractor

LxmlLinkExtractor

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow = (), deny = (),
   allow_domains = (), deny_domains = (), deny_extensions = None, restrict_xpaths = (),
   restrict_css = (), tags = ('a', 'area'), attrs = ('href', ),
   canonicalize = True, unique = True, process_value = None)

LxmlLinkExtractor 是一个强烈推荐的链接提取器,因为它具有方便的过滤选项,并且与 lxml 的强大 HTMLParser 一起使用。

The LxmlLinkExtractor is a highly recommended link extractor, because it has handy filtering options and it is used with lxml’s robust HTMLParser.

Sr.No

Parameter & Description

1

allow (a regular expression (or list of)) It allows a single expression or group of expressions that should match the url which is to be extracted. If it is not mentioned, it will match all the links.

2

deny (a regular expression (or list of)) It blocks or excludes a single expression or group of expressions that should match the url which is not to be extracted. If it is not mentioned or left empty, then it will not eliminate the undesired links.

3

allow_domains (str or list) It allows a single string or list of strings that should match the domains from which the links are to be extracted.

4

deny_domains (str or list) It blocks or excludes a single string or list of strings that should match the domains from which the links are not to be extracted.

5

deny_extensions (list) It blocks the list of strings with the extensions when extracting the links. If it is not set, then by default it will be set to IGNORED_EXTENSIONS which contains predefined list in scrapy.linkextractors package.

6

restrict_xpaths (str or list) It is an XPath list region from where the links are to be extracted from the response. If given, the links will be extracted only from the text, which is selected by XPath.

7

restrict_css (str or list) It behaves similar to restrict_xpaths parameter which will extract the links from the CSS selected regions inside the response.

8

tags (str or list) A single tag or a list of tags that should be considered when extracting the links. By default, it will be (’a’, ’area’).

9

attrs (list) A single attribute or list of attributes should be considered while extracting links. By default, it will be (’href’,).

10

canonicalize (boolean) The extracted url is brought to standard form using scrapy.utils.url.canonicalize_url. By default, it will be True.

11

unique (boolean) It will be used if the extracted links are repeated.

12

process_value (callable) It is a function which receives a value from scanned tags and attributes. The value received may be altered and returned or else nothing will be returned to reject the link. If not used, by default it will be lambda x: x.

Example

以下代码用来提取链接 −

The following code is used to extract the links −

<a href = "javascript:goToPage('../other/page.html'); return false">Link text</a>

以下代码函数可用于 process_value −

The following code function can be used in process_value −

def process_value(val):
   m = re.search("javascript:goToPage\('(.*?)'", val)
   if m:
      return m.group(1)