Scrapy 简明教程

Scrapy - First Spider

Description

Spider 是一个类,它定义了从哪里提取数据的初始 URL,以及如何遵循分页链接以及如何提取和解析 items.py 中定义的字段。Scrapy 提供不同类型的蜘蛛,每种蜘蛛都给出了一个特定目的。

Spider is a class that defines initial URL to extract the data from, how to follow pagination links and how to extract and parse the fields defined in the items.py. Scrapy provides different types of spiders each of which gives a specific purpose.

在 first_scrapy/spiders 目录下创建一个名为 "first_spider.py" 的文件,在那里我们可以告诉 Scrapy 如何找到我们正在寻找的确切数据。为此,你必须定义一些属性:

Create a file called "first_spider.py" under the first_scrapy/spiders directory, where we can tell Scrapy how to find the exact data we’re looking for. For this, you must define some attributes −

  1. name − It defines the unique name for the spider.

  2. allowed_domains − It contains the base URLs for the spider to crawl.

  3. start-urls − A list of URLs from where the spider starts crawling.

  4. parse() − It is a method that extracts and parses the scraped data.

以下代码演示了一个爬虫代码的样例 −

The following code demonstrates how a spider code looks like −

import scrapy

class firstSpider(scrapy.Spider):
   name = "first"
   allowed_domains = ["dmoz.org"]

   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]
   def parse(self, response):
      filename = response.url.split("/")[-2] + '.html'
      with open(filename, 'wb') as f:
         f.write(response.body)