Scrapy 简明教程

Scrapy - Command Line Tools

Description

Scrapy 命令行工具用于控制 Scrapy,通常称为 'Scrapy tool' 。它包括针对各个对象的一组参数和选项的命令。

The Scrapy command line tool is used for controlling Scrapy, which is often referred to as 'Scrapy tool'. It includes the commands for various objects with a group of arguments and options.

Configuration Settings

Scrapy 会在 scrapy.cfg 文件中查找配置设定。以下是几个位置 −

Scrapy will find configuration settings in the scrapy.cfg file. Following are a few locations −

  1. C:\scrapy(project folder)\scrapy.cfg in the system

  2. ~/.config/scrapy.cfg ($XDG_CONFIG_HOME) and ~/.scrapy.cfg ($HOME) for global settings

  3. You can find the scrapy.cfg inside the root of the project.

Scrapy 还可以使用以下环境变量进行配置 −

Scrapy can also be configured using the following environment variables −

  1. SCRAPY_SETTINGS_MODULE

  2. SCRAPY_PROJECT

  3. SCRAPY_PYTHON_SHELL

Default Structure Scrapy Project

以下结构展示了 Scrapy 项目的默认文件结构。

The following structure shows the default file structure of the Scrapy project.

scrapy.cfg                - Deploy the configuration file
project_name/             - Name of the project
   _init_.py
   items.py               - It is project's items file
   pipelines.py           - It is project's pipelines file
   settings.py            - It is project's settings file
   spiders                - It is the spiders directory
      _init_.py
      spider_name.py
      . . .

scrapy.cfg 文件是项目根目录,包括项目名称和项目设定。例如 −

The scrapy.cfg file is a project root directory, which includes the project name with the project settings. For instance −

[settings]
default = [name of the project].settings

[deploy]
#url = http://localhost:6800/
project = [name of the project]

Using Scrapy Tool

Scrapy 工具提供了一些使用方法和可用命令,如下所示 −

Scrapy tool provides some usage and available commands as follows −

Scrapy X.Y  - no active project
Usage:
   scrapy  [options] [arguments]
Available commands:
   crawl      It puts spider (handle the URL) to work for crawling data
   fetch      It fetches the response from the given URL

Creating a Project

您可以使用以下命令在 Scrapy 中创建项目 −

You can use the following command to create the project in Scrapy −

scrapy startproject project_name

这会创建名为 project_name 的项目目录。然后,使用以下命令转到新创建的项目 −

This will create the project called project_name directory. Next, go to the newly created project, using the following command −

cd  project_name

Controlling Projects

您可以使用 Scrapy 工具控制项目和管理它们,还可以使用以下命令创建新的爬虫 −

You can control the project and manage them using the Scrapy tool and also create the new spider, using the following command −

scrapy genspider mydomain mydomain.com

crawl 等命令必须在 Scrapy 项目内使用。您将在下一部分中了解哪些命令必须在 Scrapy 项目内运行。

The commands such as crawl, etc. must be used inside the Scrapy project. You will come to know which commands must run inside the Scrapy project in the coming section.

Scrapy 包含一些内置命令,可用于您的项目。要查看可用命令的列表,请使用以下命令 −

Scrapy contains some built-in commands, which can be used for your project. To see the list of available commands, use the following command −

scrapy -h

当您运行以下命令时,Scrapy 将显示可用命令的列表,如下所示 −

When you run the following command, Scrapy will display the list of available commands as listed −

  1. fetch − It fetches the URL using Scrapy downloader.

  2. runspider − It is used to run self-contained spider without creating a project.

  3. settings − It specifies the project setting value.

  4. shell − It is an interactive scraping module for the given URL.

  5. startproject − It creates a new Scrapy project.

  6. version − It displays the Scrapy version.

  7. view − It fetches the URL using Scrapy downloader and show the contents in a browser.

你可以获得一些与项目相关的命令,如下所示−

You can have some project related commands as listed −

  1. crawl − It is used to crawl data using the spider.

  2. check − It checks the items returned by the crawled command.

  3. list − It displays the list of available spiders present in the project.

  4. edit − You can edit the spiders by using the editor.

  5. parse − It parses the given URL with the spider.

  6. bench − It is used to run quick benchmark test (Benchmark tells how many number of pages can be crawled per minute by Scrapy).

Custom Project Commands

您可以使用 COMMANDS_MODULE 在 Scrapy 项目中设置自定义项目命令。它在设置中包含一个默认的空字符串。您可以添加以下自定义命令−

You can build a custom project command with COMMANDS_MODULE setting in Scrapy project. It includes a default empty string in the setting. You can add the following custom command −

COMMANDS_MODULE = 'mycmd.commands'

可以使用 setup.py 文件中 scrapy.commands 部分添加 Scrapy 命令,如下所示−

Scrapy commands can be added using the scrapy.commands section in the setup.py file shown as follows −

from setuptools import setup, find_packages

setup(name = 'scrapy-module_demo',
   entry_points = {
      'scrapy.commands': [
         'cmd_demo = my_module.commands:CmdDemo',
      ],
   },
)

上面的代码在 setup.py 文件中添加了 cmd_demo 命令。

The above code adds cmd_demo command in the setup.py file.