Scrapy 简明教程

Scrapy - Scraped Data

Description

存储抓取数据的最佳方法是使用 Feed 导出,这确保了使用多种序列化格式正确地存储数据。JSON、JSON 行、CSV、XML 是序列化格式中现有的格式。可以通过以下命令存储数据 −

The best way to store scraped data is by using Feed exports, which makes sure that data is being stored properly using multiple serialization formats. JSON, JSON lines, CSV, XML are the formats supported readily in serialization formats. The data can be stored with the following command −

scrapy crawl dmoz -o data.json

此命令将创建一个 data.json 文件,其中包含 JSON 格式的抓取数据。此方法适用于小量数据。如果必须处理大量数据,那么我们可以使用 Item Pipeline。就像 data.json 文件,在项目在 tutorial/pipelines.py 中创建时会设置保留文件。

This command will create a data.json file containing scraped data in JSON. This technique holds good for small amount of data. If large amount of data has to be handled, then we can use Item Pipeline. Just like data.json file, a reserved file is set up when the project is created in tutorial/pipelines.py.