Scrapy 简明教程

Scrapy - Crawling

Description

要执行你的爬虫程序,在 first_scrapy 目录中运行以下命令 −

scrapy crawl first

其中, first 是创建爬虫程序时指定的爬虫程序名称。

爬虫程序爬取后,可以看到以下输出 −

2016-08-09 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2016-08-09 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2016-08-09 18:13:07-0400 [scrapy] INFO: Spider opened
2016-08-09 18:13:08-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-09 18:13:09-0400 [scrapy] DEBUG: Crawled (200)
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-09 18:13:09-0400 [scrapy] INFO: Closing spider (finished)

正如你可以在输出中看到的,对于每个 URL 都有一个日志行,该日志行(引用:无)指出 URL 是开始 URL,并且它们没有引用。接下来,你应该看到在 first_scrapy 目录中创建了两个新文件,命名为 Books.html 和 Resources.html。