Scrapy 简明教程

Scrapy - Overview

Scrapy 是一个使用 Python 编写的高速、开放源代码 Web 爬取框架,它用于借助基于 XPath 的选择器从网页提取数据。

Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.

Scrapy 最初于 2008 年 6 月 26 日发布,并获得 BSD 许可,并在 2015 年 6 月发布了 1.0 里程碑版本。

Scrapy was first released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June 2015.

Why Use Scrapy?

  1. It is easier to build and scale large crawling projects.

  2. It has a built-in mechanism called Selectors, for extracting the data from websites.

  3. It handles the requests asynchronously and it is fast.

  4. It automatically adjusts crawling speed using Auto-throttling mechanism.

  5. Ensures developer accessibility.

Features of Scrapy

  1. Scrapy is an open source and free to use web crawling framework.

  2. Scrapy generates feed exports in formats such as JSON, CSV, and XML.

  3. Scrapy has built-in support for selecting and extracting data from sources either by XPath or CSS expressions.

  4. Scrapy based on crawler, allows extracting data from the web pages automatically.

Advantages

  1. Scrapy is easily extensible, fast, and powerful.

  2. It is a cross-platform application framework (Windows, Linux, Mac OS and BSD).

  3. Scrapy requests are scheduled and processed asynchronously.

  4. Scrapy comes with built-in service called Scrapyd which allows to upload projects and control spiders using JSON web service.

  5. It is possible to scrap any website, though that website does not have API for raw data access.

Disadvantages

  1. Scrapy is only for Python 2.7.

  2. Installation is different for different operating systems.