Python Web Scraping 简明教程
Python Web Scraping - Data Processing
在之前的章节中,我们学习了如何通过各种 Python 模块从网页或网络抓取中提取数据。在本章中,我们来了解处理已抓取数据的各种技术。
In earlier chapters, we learned about extracting the data from web pages or web scraping by various Python modules. In this chapter, let us look into various techniques to process the data that has been scraped.
Introduction
要处理已抓取的数据,我们必须将数据存储在本地计算机中,采用特定的格式,例如表格(CSV)、JSON 或有时是像 MySQL 的数据库。
To process the data that has been scraped, we must store the data on our local machine in a particular format like spreadsheet (CSV), JSON or sometimes in databases like MySQL.
CSV and JSON Data Processing
首先,我们将在从网页抓取后将信息写入 CSV 文件或电子表格中。让我们首先通过一个简单的示例来了解,在示例中,我们将首先使用 BeautifulSoup 模块抓取信息,就像之前所做的那样,然后使用 Python CSV 模块将文本信息写入 CSV 文件。
First, we are going to write the information, after grabbing from web page, into a CSV file or a spreadsheet. Let us first understand through a simple example in which we will first grab the information using BeautifulSoup module, as did earlier, and then by using Python CSV module we will write that textual information into CSV file.
首先,我们需要导入必要的 Python 库,如下所示:
First, we need to import the necessary Python libraries as follows −
import requests
from bs4 import BeautifulSoup
import csv
在以下代码行中,我们使用 requests 对 url: https://authoraditiagarwal.com/ 发出 GET HTTP 请求,即通过发出 GET 请求。
In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.
r = requests.get('https://authoraditiagarwal.com/')
现在,我们需要创建一个 Soup 对象,如下所示:
Now, we need to create a Soup object as follows −
soup = BeautifulSoup(r.text, 'lxml')
现在,借助以下代码行,我们会将抓取的数据写入名为 dataprocessing.csv 的 CSV 文件中。
Now, with the help of next lines of code, we will write the grabbed data into a CSV file named dataprocessing.csv.
f = csv.writer(open(' dataprocessing.csv ','w'))
f.writerow(['Title'])
f.writerow([soup.title.text])
运行此脚本后,文本信息或网页标题将保存在上面提到的本地计算机中的 CSV 文件中。
After running this script, the textual information or the title of the webpage will be saved in the above mentioned CSV file on your local machine.
同样,我们可以将收集的信息保存在 JSON 文件中。以下是一个易于理解的 Python 脚本,用于执行相同的操作,在该脚本中,我们会抓取与在上一个 Python 脚本中所做的相同的信息,但是这一次,使用 JSON Python 模块将抓取的信息保存在 JSONfile.txt 中。
Similarly, we can save the collected information in a JSON file. The following is an easy to understand Python script for doing the same in which we are grabbing the same information as we did in last Python script, but this time the grabbed information is saved in JSONfile.txt by using JSON Python module.
import requests
from bs4 import BeautifulSoup
import csv
import json
r = requests.get('https://authoraditiagarwal.com/')
soup = BeautifulSoup(r.text, 'lxml')
y = json.dumps(soup.title.text)
with open('JSONFile.txt', 'wt') as outfile:
json.dump(y, outfile)
运行此脚本后,抓取的信息(即网页标题)将保存在本地计算机上上面提到的文本文件中。
After running this script, the grabbed information i.e. title of the webpage will be saved in the above mentioned text file on your local machine.
Data Processing using AWS S3
有时候,我们可能想将抓取的数据保存在本地存储中以进行存档。但是,如果我们需要大规模存储和分析此数据怎么办?答案是名为 Amazon S3 或 AWS S3(简单存储服务)的云存储服务。基本上,AWS S3 是一个对象存储,用于存储和检索任意数量的任何地方的数据。
Sometimes we may want to save scraped data in our local storage for archive purpose. But what if the we need to store and analyze this data at a massive scale? The answer is cloud storage service named Amazon S3 or AWS S3 (Simple Storage Service). Basically AWS S3 is an object storage which is built to store and retrieve any amount of data from anywhere.
我们可以按照以下步骤将数据存储在 AWS S3 中:
We can follow the following steps for storing data in AWS S3 −
Step 1 - 首先,我们需要一个 AWS 帐户,该帐户将为我们提供在存储数据时用于 Python 脚本的密钥。它将创建一个 S3 存储桶,我们可以在其中存储数据。
Step 1 − First we need an AWS account which will provide us the secret keys for using in our Python script while storing the data. It will create a S3 bucket in which we can store our data.
Step 2 − 接下来,我们需要安装 boto3 Python 库来访问 S3 存储桶。它可以通过以下命令进行安装−
Step 2 − Next, we need to install boto3 Python library for accessing S3 bucket. It can be installed with the help of the following command −
pip install boto3
Step 3 − 然后,我们可以使用以下 Python 脚本从网站抓取数据并将其保存到 AWS S3 存储桶中。
Step 3 − Next, we can use the following Python script for scraping data from web page and saving it to AWS S3 bucket.
首先,我们需要导入用于抓取的 Python 库,在这里我们正在使用 requests ,并 boto3 将数据保存到 S3 存储桶中。
First, we need to import Python libraries for scraping, here we are working with requests, and boto3 saving data to S3 bucket.
import requests
import boto3
现在我们可以从 URL 中抓取数据。
Now we can scrape the data from our URL.
data = requests.get("Enter the URL").text
现在需要存储数据到 S3 存储桶中,我们需要按如下方式创建 S3 客户端−
Now for storing data to S3 bucket, we need to create S3 client as follows −
s3 = boto3.client('s3')
bucket_name = "our-content"
下面的代码行将按如下方式创建 S3 存储桶−
Next line of code will create S3 bucket as follows −
s3.create_bucket(Bucket = bucket_name, ACL = 'public-read')
s3.put_object(Bucket = bucket_name, Key = '', Body = data, ACL = "public-read")
现在,您可以通过 AWS 帐户中的名称为 our-content 检查该存储桶。
Now you can check the bucket with name our-content from your AWS account.
Data processing using MySQL
接下来,让我们学习如何使用 MySQL 处理数据。如果想了解有关 MySQL 的信息,您可以关注链接 [role="bare" [role="bare"]https://www.tutorialspoint.com/mysql/ 。]
Let us learn how to process data using MySQL. If you want to learn about MySQL, then you can follow the link [role="bare"https://www.tutorialspoint.com/mysql/.]
通过以下步骤,我们可以将数据抓取并处理到 MySQL 表中−
With the help of following steps, we can scrape and process data into MySQL table −
Step 1 − 首先,通过使用 MySQL,我们需要创建一个数据库和表,用以保存我们抓取的数据。例如,我们通过以下查询创建该表−
Step 1 − First, by using MySQL we need to create a database and table in which we want to save our scraped data. For example, we are creating the table with following query −
CREATE TABLE Scrap_pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,
title VARCHAR(200), content VARCHAR(10000),PRIMARY KEY(id));
Step 2 − 接下来,我们需要处理 Unicode。请注意,MySQL 默认不处理 Unicode。我们需要通过以下命令来启用此功能,这将更改数据库的默认字符集、表的默认字符集与两个列的默认字符集−
Step 2 − Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by default. We need to turn on this feature with the help of following commands which will change the default character set for the database, for the table and for both of the columns −
ALTER DATABASE scrap CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CONVERT TO CHARACTER SET utf8mb4 COLLATE
utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CHANGE title title VARCHAR(200) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages CHANGE content content VARCHAR(10000) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
Step 3 − 现在,将 MySQL 与 Python 集成。为此我们需要 PyMySQL,可以通过以下命令进行安装−
Step 3 − Now, integrate MySQL with Python. For this, we will need PyMySQL which can be installed with the help of the following command
pip install PyMySQL
Step 4 − 现在,我们之前创建的名为 Scrap 的数据库已做好准备,用于从 web 抓取数据后保存到名为 Scrap_pages 的表中。在示例中,我们将从 Wikipedia 抓取数据,然后将其保存在我们的数据库中。
Step 4 − Now, our database named Scrap, created earlier, is ready to save the data, after scraped from web, into table named Scrap_pages. Here in our example we are going to scrape data from Wikipedia and it will be saved into our database.
首先,我们需要导入所需的 Python 模块。
First, we need to import the required Python modules.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re
现在,建立一个连接,即将其与 Python 集成。
Now, make a connection, that is integrate this with Python.
conn = pymysql.connect(host='127.0.0.1',user='root', passwd = None, db = 'mysql',
charset = 'utf8')
cur = conn.cursor()
cur.execute("USE scrap")
random.seed(datetime.datetime.now())
def store(title, content):
cur.execute('INSERT INTO scrap_pages (title, content) VALUES ''("%s","%s")', (title, content))
cur.connection.commit()
现在,连接到 Wikipedia 并从此处获取数据。
Now, connect with Wikipedia and get data from it.
def getLinks(articleUrl):
html = urlopen('http://en.wikipedia.org'+articleUrl)
bs = BeautifulSoup(html, 'html.parser')
title = bs.find('h1').get_text()
content = bs.find('div', {'id':'mw-content-text'}).find('p').get_text()
store(title, content)
return bs.find('div', {'id':'bodyContent'}).findAll('a',href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)
最后,我们需要关闭游标和连接。
Lastly, we need to close both cursor and connection.
finally:
cur.close()
conn.close()
这会将从 Wikipedia 收集的数据保存到名为 scrap_pages 的表中。如果您熟悉 MySQL 和 web 抓取,那么上面这段代码将不难理解。
This will save the data gather from Wikipedia into table named scrap_pages. If you are familiar with MySQL and web scraping, then the above code would not be tough to understand.
Data processing using PostgreSQL
PostgreSQL 由全球志愿者团队开发,它是一个开源关系数据库管理系统 (RDMS)。使用 PostgreSQL 处理抓取数据的过程与 MySQL 类似。两处会有所改变:第一,命令将与 MySQL 不同,第二,在这里,我们将使用 psycopg2 Python 库执行其与 Python 的集成。
PostgreSQL, developed by a worldwide team of volunteers, is an open source relational database Management system (RDMS). The process of processing the scraped data using PostgreSQL is similar to that of MySQL. There would be two changes: First, the commands would be different to MySQL and second, here we will use psycopg2 Python library to perform its integration with Python.
如果您不熟悉 PostgreSQL,可以在 [role="bare" [role="bare"]https://www.tutorialspoint.com/postgresql/ 学习。按照以下命令,我们可以安装 psycopg2 Python 库 −
If you are not familiar with PostgreSQL then you can learn it at [role="bare"https://www.tutorialspoint.com/postgresql/.] And with the help of following command we can install psycopg2 Python library −
pip install psycopg2