Scrapy 简明教程
Scrapy - Shell
Description
Scrapy shell 可用于使用无错误的代码来抓取数据,而无需使用爬虫。Scrapy shell 的主要目的是测试已提取的代码、XPath 或 CSS 表达式。它还有助于指定您要从中抓取数据的网页。
Scrapy shell can be used to scrap the data with error free code, without the use of spider. The main purpose of Scrapy shell is to test the extracted code, XPath, or CSS expressions. It also helps specify the web pages from which you are scraping the data.
Configuring the Shell
可以通过安装 IPython (用于交互式计算)控制台来配置 shell,这是一个功能强大的交互式 shell,它提供了自动完成、着色输出等功能。
The shell can be configured by installing the IPython (used for interactive computing) console, which is a powerful interactive shell that gives the auto completion, colorized output, etc.
如果您在 Unix 平台上工作,那么最好安装 IPython。如果 IPython 不可访问,您还可以使用 bpython 。
If you are working on the Unix platform, then it’s better to install the IPython. You can also use bpython, if IPython is inaccessible.
您可以通过设置名为 SCRAPY_PYTHON_SHELL 的环境变量或按如下方式定义 scrapy.cfg 文件来配置 shell:
You can configure the shell by setting the environment variable called SCRAPY_PYTHON_SHELL or by defining the scrapy.cfg file as follows −
[settings]
shell = bpython
Launching the Shell
可以使用以下命令启动 Scrapy shell:
Scrapy shell can be launched using the following command −
scrapy shell <url>
url 指定要抓取数据的 URL。
The url specifies the URL for which the data needs to be scraped.
Using the Shell
shell 提供一些附加快捷方式和 Scrapy 对象,如下表所述:
The shell provides some additional shortcuts and Scrapy objects as described in the following table −
Available Shortcuts
shell 在项目中提供了以下可用快捷方式:
Shell provides the following available shortcuts in the project −
Sr.No |
Shortcut & Description |
1 |
shelp() It provides the available objects and shortcuts with the help option. |
2 |
fetch(request_or_url) It collects the response from the request or URL and associated objects will get updated properly. |
3 |
view(response) You can view the response for the given request in the local browser for observation and to display the external link correctly, it appends a base tag to the response body. |
Available Scrapy Objects
Shell 在 project 中提供了以下可用的 Scrapy 对象 −
Shell provides the following available Scrapy objects in the project −
Sr.No |
Object & Description |
1 |
crawler It specifies the current crawler object. |
2 |
spider If there is no spider for present URL, then it will handle the URL or spider object by defining the new spider. |
3 |
request It specifies the request object for the last collected page. |
4 |
response It specifies the response object for the last collected page. |
5 |
settings It provides the current Scrapy settings. |
Example of Shell Session
让我们尝试爬取 scrapy.org 网站,然后按照说明开始从 reddit.com 爬取数据。
Let us try scraping scrapy.org site and then begin to scrap the data from reddit.com as described.
在继续之前,首先我们将按照以下命令启动 shell −
Before moving ahead, first we will launch the shell as shown in the following command −
scrapy shell 'http://scrapy.org' --nolog
Scrapy 将在使用以上网址时显示可用的对象 −
Scrapy will display the available objects while using the above URL −
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
[s] item {}
[s] request <GET http://scrapy.org >
[s] response <200 http://scrapy.org >
[s] settings <scrapy.settings.Settings object at 0x2bfd650>
[s] spider <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s] shelp() Provides available objects and shortcuts with help option
[s] fetch(req_or_url) Collects the response from the request or URL and associated
objects will get update
[s] view(response) View the response for the given request
接下来,开始使用对象,如下所示 −
Next, begin with the working of objects, shown as follows −
>> response.xpath('//title/text()').extract_first()
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'
>> fetch("http://reddit.com")
[s] Available Scrapy objects:
[s] crawler
[s] item {}
[s] request
[s] response <200 https://www.reddit.com/>
[s] settings
[s] spider
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>> response.xpath('//title/text()').extract()
[u'reddit: the front page of the internet']
>> request = request.replace(method="POST")
>> fetch(request)
[s] Available Scrapy objects:
[s] crawler
...
Invoking the Shell from Spiders to Inspect Responses
只有当您希望获取该响应时,您才能检查从爬虫处理的响应。
You can inspect the responses which are processed from the spider, only if you are expecting to get that response.
例如 −
For instance −
import scrapy
class SpiderDemo(scrapy.Spider):
name = "spiderdemo"
start_urls = [
"http://mysite.com",
"http://mysite1.org",
"http://mysite2.net",
]
def parse(self, response):
# You can inspect one specific response
if ".net" in response.url:
from scrapy.shell import inspect_response
inspect_response(response, self)
如以上代码所示,您可以在爬虫中调用 shell 来使用以下函数检查响应 −
As shown in the above code, you can invoke the shell from spiders to inspect the responses using the following function −
scrapy.shell.inspect_response
现在运行爬虫,您将得到以下屏幕 −
Now run the spider, and you will get the following screen −
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None)
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None)
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None)
[s] Available Scrapy objects:
[s] crawler
...
>> response.url
'http://mysite2.org'
您可以通过以下代码检查提取的代码是否有效 −
You can examine whether the extracted code is working using the following code −
>> response.xpath('//div[@class = "val"]')
它会以下形式显示输出
It displays the output as
[]
以上行仅显示了一个空白输出。现在您可以调用 shell 来检查响应,如下所示 −
The above line has displayed only a blank output. Now you can invoke the shell to inspect the response as follows −
>> view(response)
它会显示以下响应
It displays the response as
True