Scrapy 简明教程

Scrapy - Crawling

Description

要执行你的爬虫程序，在 first_scrapy 目录中运行以下命令 −

To execute your spider, run the following command within your first_scrapy directory −

scrapy crawl first

其中， first 是创建爬虫程序时指定的爬虫程序名称。

Where, first is the name of the spider specified while creating the spider.

爬虫程序爬取后，可以看到以下输出 −

Once the spider crawls, you can see the following output −

2016-08-09 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2016-08-09 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Spider opened
2016-08-09 18:13:08-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-09 18:13:09-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-09 18:13:09-0400 [scrapy] INFO: Closing spider (finished)

正如你可以在输出中看到的，对于每个 URL 都有一个日志行，该日志行（引用：无）指出 URL 是开始 URL，并且它们没有引用。接下来，你应该看到在 first_scrapy 目录中创建了两个新文件，命名为 Books.html 和 Resources.html。

As you can see in the output, for each URL there is a log line which (referer: None) states that the URLs are start URLs and they have no referrers. Next, you should see two new files named Books.html and Resources.html are created in your first_scrapy directory.