Scrapy 简明教程

Scrapy - Settings

Description

Scrapy 组件的行为可以通过 Scrapy 设置进行修改。如果你有多个 Scrapy 项目,设置还可以选择当前激活的 Scrapy 项目。

The behavior of Scrapy components can be modified using Scrapy settings. The settings can also select the Scrapy project that is currently active, in case you have multiple Scrapy projects.

Designating the Settings

当你抓取网站时,你必须通知 Scrapy 你正在使用哪些设置。为此,应该使用环境变量 SCRAPY_SETTINGS_MODULE ,其值应为 Python 路径语法。

You must notify Scrapy which setting you are using when you scrap a website. For this, environment variable SCRAPY_SETTINGS_MODULE should be used and its value should be in Python path syntax.

Populating the Settings

下表显示了一些可以填充设置的机制−

The following table shows some of the mechanisms by which you can populate the settings −

Sr.No

Mechanism & Description

1

Command line options Here, the arguments that are passed takes highest precedence by overriding other options. The -s is used to override one or more settings. scrapy crawl myspider -s LOG_FILE = scrapy.log

2

Settings per-spider Spiders can have their own settings that overrides the project ones by using attribute custom_settings. class DemoSpider(scrapy.Spider): name = 'demo' custom_settings = { 'SOME_SETTING': 'some value', }

3

Project settings module Here, you can populate your custom settings such as adding or modifying the settings in the settings.py file.

4

Default settings per-command Each Scrapy tool command defines its own settings in the default_settings attribute, to override the global default settings.

5

Default global settings These settings are found in the scrapy.settings.default_settings module.

Access Settings

它们可通过 self.settings 获得,并在初始化后设置在基础 Spider 中。

They are available through self.settings and set in the base spider after it is initialized.

以下示例演示了这一点。

The following example demonstrates this.

class DemoSpider(scrapy.Spider):
   name = 'demo'
   start_urls = ['http://example.com']
   def parse(self, response):
      print("Existing settings: %s" % self.settings.attributes.keys())

要在初始化 Spider 之前使用设置,您必须在 Spider 的 init () 方法中覆盖 from_crawler 方法。您可以通过传递给 from_crawler 方法的属性 scrapy.crawler.Crawler.settings 访问设置。

To use settings before initializing the spider, you must override from_crawler method in the init() method of your spider. You can access settings through attribute scrapy.crawler.Crawler.settings passed to from_crawler method.

以下示例演示了这一点。

The following example demonstrates this.

class MyExtension(object):
   def __init__(self, log_is_enabled = False):
      if log_is_enabled:
         print("Enabled log")
         @classmethod
   def from_crawler(cls, crawler):
      settings = crawler.settings
      return cls(settings.getbool('LOG_ENABLED'))

Rationale for Setting Names

设置名称被添加为它们配置的组件的前缀。例如,对于 robots.txt 扩展,设置名称可以是 ROBOTSTXT_ENABLED、 ROBOTSTXT_OBEY、ROBOTSTXT_CACHEDIR 等。

Setting names are added as a prefix to the component they configure. For example, for robots.txt extension, the setting names can be ROBOTSTXT_ENABLED, ROBOTSTXT_OBEY, ROBOTSTXT_CACHEDIR, etc.

Built-in Settings Reference

下表显示了 Scrapy 的内置设置−

The following table shows the built-in settings of Scrapy −

Sr.No

Setting & Description

1

AWS_ACCESS_KEY_ID It is used to access Amazon Web Services. Default value: None

2

AWS_SECRET_ACCESS_KEY It is used to access Amazon Web Services. Default value: None

3

BOT_NAME It is the name of bot that can be used for constructing User-Agent. Default value: 'scrapybot'

4

CONCURRENT_ITEMS Maximum number of existing items in the Item Processor used to process parallely. Default value: 100

5

CONCURRENT_REQUESTS Maximum number of existing requests which Scrapy downloader performs. Default value: 16

6

CONCURRENT_REQUESTS_PER_DOMAIN Maximum number of existing requests that perform simultaneously for any single domain. Default value: 8

7

CONCURRENT_REQUESTS_PER_IP Maximum number of existing requests that performs simultaneously to any single IP. Default value: 0

8

DEFAULT_ITEM_CLASS It is a class used to represent items. Default value: 'scrapy.item.Item'

9

DEFAULT_REQUEST_HEADERS It is a default header used for HTTP requests of Scrapy. Default value − { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9, /;q=0.8', 'Accept-Language': 'en', }

10

DEPTH_LIMIT The maximum depth for a spider to crawl any site. Default value: 0

11

DEPTH_PRIORITY It is an integer used to alter the priority of request according to the depth. Default value: 0

12

DEPTH_STATS It states whether to collect depth stats or not. Default value: True

13

DEPTH_STATS_VERBOSE This setting when enabled, the number of requests is collected in stats for each verbose depth. Default value: False

14

DNSCACHE_ENABLED It is used to enable DNS in memory cache. Default value: True

15

DNSCACHE_SIZE It defines the size of DNS in memory cache. Default value: 10000

16

DNS_TIMEOUT It is used to set timeout for DNS to process the queries. Default value: 60

17

DOWNLOADER It is a downloader used for the crawling process. Default value: 'scrapy.core.downloader.Downloader'

18

DOWNLOADER_MIDDLEWARES It is a dictionary holding downloader middleware and their orders. Default value: {}

19

DOWNLOADER_MIDDLEWARES_BASE It is a dictionary holding downloader middleware that is enabled by default. Default value − { 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, }

20

DOWNLOADER_STATS This setting is used to enable the downloader stats. Default value: True

21

DOWNLOAD_DELAY It defines the total time for downloader before it downloads the pages from the site. Default value: 0

22

DOWNLOAD_HANDLERS It is a dictionary with download handlers. Default value: {}

23

DOWNLOAD_HANDLERS_BASE It is a dictionary with download handlers that is enabled by default. Default value − { 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', }

24

DOWNLOAD_TIMEOUT It is the total time for downloader to wait before it times out. Default value: 180

25

DOWNLOAD_MAXSIZE It is the maximum size of response for the downloader to download. Default value: 1073741824 (1024MB)

26

DOWNLOAD_WARNSIZE It defines the size of response for downloader to warn. Default value: 33554432 (32MB)

27

DUPEFILTER_CLASS It is a class used for detecting and filtering of requests that are duplicate. Default value: 'scrapy.dupefilters.RFPDupeFilter'

28

DUPEFILTER_DEBUG This setting logs all duplicate filters when set to true. Default value: False

29

EDITOR It is used to edit spiders using the edit command. Default value: Depends on the environment

30

EXTENSIONS It is a dictionary having extensions that are enabled in the project. Default value: {}

31

EXTENSIONS_BASE It is a dictionary having built-in extensions. Default value: { 'scrapy.extensions.corestats.CoreStats': 0, }

32

FEED_TEMPDIR It is a directory used to set the custom folder where crawler temporary files can be stored.

33

ITEM_PIPELINES It is a dictionary having pipelines. Default value: {}

34

LOG_ENABLED It defines if the logging is to be enabled. Default value: True

35

LOG_ENCODING It defines the type of encoding to be used for logging. Default value: 'utf-8'

36

LOG_FILE It is the name of the file to be used for the output of logging. Default value: None

37

LOG_FORMAT It is a string using which the log messages can be formatted. Default value: '%(asctime)s [%(name)s] %(levelname)s: %(message)s'

38

LOG_DATEFORMAT It is a string using which date/time can be formatted. Default value: '%Y-%m-%d %H:%M:%S'

39

LOG_LEVEL It defines minimum log level. Default value: 'DEBUG'

40

LOG_STDOUT This setting if set to true, all your process output will appear in the log. Default value: False

41

MEMDEBUG_ENABLED It defines if the memory debugging is to be enabled. Default Value: False

42

MEMDEBUG_NOTIFY It defines the memory report that is sent to a particular address when memory debugging is enabled. Default value: []

43

MEMUSAGE_ENABLED It defines if the memory usage is to be enabled when a Scrapy process exceeds a memory limit. Default value: False

44

MEMUSAGE_LIMIT_MB It defines the maximum limit for the memory (in megabytes) to be allowed. Default value: 0

45

MEMUSAGE_CHECK_INTERVAL_SECONDS It is used to check the present memory usage by setting the length of the intervals. Default value: 60.0

46

MEMUSAGE_NOTIFY_MAIL It is used to notify with a list of emails when the memory reaches the limit. Default value: False

47

MEMUSAGE_REPORT It defines if the memory usage report is to be sent on closing each spider. Default value: False

48

MEMUSAGE_WARNING_MB It defines a total memory to be allowed before a warning is sent. Default value: 0

49

NEWSPIDER_MODULE It is a module where a new spider is created using genspider command. Default value: ''

50

RANDOMIZE_DOWNLOAD_DELAY It defines a random amount of time for a Scrapy to wait while downloading the requests from the site. Default value: True

51

REACTOR_THREADPOOL_MAXSIZE It defines a maximum size for the reactor threadpool. Default value: 10

52

REDIRECT_MAX_TIMES It defines how many times a request can be redirected. Default value: 20

53

REDIRECT_PRIORITY_ADJUST This setting when set, adjusts the redirect priority of a request. Default value: +2

54

RETRY_PRIORITY_ADJUST This setting when set, adjusts the retry priority of a request. Default value: -1

55

ROBOTSTXT_OBEY Scrapy obeys robots.txt policies when set to true. Default value: False

56

SCHEDULER It defines the scheduler to be used for crawl purpose. Default value: 'scrapy.core.scheduler.Scheduler'

57

SPIDER_CONTRACTS It is a dictionary in the project having spider contracts to test the spiders. Default value: {}

58

SPIDER_CONTRACTS_BASE It is a dictionary holding Scrapy contracts which is enabled in Scrapy by default. Default value − { 'scrapy.contracts.default.UrlContract' : 1, 'scrapy.contracts.default.ReturnsContract': 2, }

59

SPIDER_LOADER_CLASS It defines a class which implements SpiderLoader API to load spiders. Default value: 'scrapy.spiderloader.SpiderLoader'

60

SPIDER_MIDDLEWARES It is a dictionary holding spider middlewares. Default value: {}

61

SPIDER_MIDDLEWARES_BASE It is a dictionary holding spider middlewares that is enabled in Scrapy by default. Default value − { 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, }

62

SPIDER_MODULES It is a list of modules containing spiders which Scrapy will look for. Default value: []

63

STATS_CLASS It is a class which implements Stats Collector API to collect stats. Default value: 'scrapy.statscollectors.MemoryStatsCollector'

64

STATS_DUMP This setting when set to true, dumps the stats to the log. Default value: True

65

STATSMAILER_RCPTS Once the spiders finish scraping, Scrapy uses this setting to send the stats. Default value: []

66

TELNETCONSOLE_ENABLED It defines whether to enable the telnetconsole. Default value: True

67

TELNETCONSOLE_PORT It defines a port for telnet console. Default value: [6023, 6073]

68

TEMPLATES_DIR It is a directory containing templates that can be used while creating new projects. Default value: templates directory inside scrapy module

69

URLLENGTH_LIMIT It defines the maximum limit of the length for URL to be allowed for crawled URLs. Default value: 2083

70

USER_AGENT It defines the user agent to be used while crawling a site. Default value: "Scrapy/VERSION (+http://scrapy.org)"

有关其他 Scrapy 设置,请访问此 link

For other Scrapy settings, go to this link.