Python Web Scraping 简明教程

Processing Images and Videos

网络抓取通常涉及下载、存储和处理网络媒体内容。在本章中,让我们了解如何处理从网络下载的内容。

Web scraping usually involves downloading, storing and processing the web media content. In this chapter, let us understand how to process the content downloaded from the web.

Introduction

抓取过程中获取的网络媒体内容可以是图像、音频和视频文件,既可以为非网页形式,也可以是数据文件。但是,我们能否信任所下载数据,尤其是要下载并存储在计算机内存中的数据的扩展?这导致了解我们要在本地存储的数据类型至关重要。

The web media content that we obtain during scraping can be images, audio and video files, in the form of non-web pages as well as data files. But, can we trust the downloaded data especially on the extension of data we are going to download and store in our computer memory? This makes it essential to know about the type of data we are going to store locally.

Getting Media Content from Web Page

在本节中,我们将学习如何根据来自 Web 服务器的信息正确表示媒体类型下载媒体内容。我们能够借助 Python requests 模块完成这一操作,就像在上一章中所做的那样。

In this section, we are going to learn how we can download media content which correctly represents the media type based on the information from web server. We can do it with the help of Python requests module as we did in previous chapter.

首先,我们需要导入必要的 Python 模块,如下所示:

First, we need to import necessary Python modules as follows −

import requests

现在,提供要下载和本地存储的媒体内容的 URL。

Now, provide the URL of the media content we want to download and store locally.

url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"

使用以下代码创建 HTTP 响应对象。

Use the following code to create HTTP response object.

r = requests.get(url)

借助以下代码行,我们可以将收到的内容另存为 .png 文件。

With the help of following line of code, we can save the received content as .png file.

with open("ThinkBig.png",'wb') as f:
   f.write(r.content)

运行上述 Python 脚本后,我们将获得名为 ThinkBig.png 的文件,该文件包含已下载的图像。

After running the above Python script, we will get a file named ThinkBig.png, which would have the downloaded image.

Extracting Filename from URL

从网站下载内容后,我们也希望将其保存到文件中,文件名应在 URL 中找到。但我们也可以检查 URL 中是否存在其他片段数字。为此,我们需要从 URL 中找到实际文件名。

After downloading the content from web site, we also want to save it in a file with a file name found in the URL. But we can also check, if numbers of additional fragments exist in URL too. For this, we need to find the actual filename from the URL.

借助以下 Python 脚本,使用 urlparse ,我们可以从 URL 中提取文件名:

With the help of following Python script, using urlparse, we can extract the filename from URL −

import urllib3
import os
url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
a = urlparse(url)
a.path

你可以观察到输出,如下所示 −

You can observe the output as shown below −

'/wp-content/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg'
os.path.basename(a.path)

你可以观察到输出,如下所示 −

You can observe the output as shown below −

'MetaSlider_ThinkBig-1080x180.jpg'

运行上述脚本后,我们将从 URL 获取文件名。

Once you run the above script, we will get the filename from URL.

Information about Type of Content from URL

从 Web 服务器提取内容时,通过 GET 请求,我们还可以检查它所提供的信息。借助以下 Python 脚本,我们可以确定 Web 服务器对于内容类型意味着什么:

While extracting the contents from web server, by GET request, we can also check its information provided by the web server. With the help of following Python script we can determine what web server means with the type of the content −

首先,我们需要导入必要的 Python 模块,如下所示:

First, we need to import necessary Python modules as follows −

import requests

现在,我们需要提供要下载并本地存储的媒体内容的 URL。

Now, we need to provide the URL of the media content we want to download and store locally.

url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"

以下代码行将创建 HTTP 响应对象。

Following line of code will create HTTP response object.

r = requests.get(url, allow_redirects=True)

现在,我们可以获得 Web 服务器可提供关于内容的哪些类型信息。

Now, we can get what type of information about content can be provided by web server.

for headers in r.headers: print(headers)

你可以观察到输出,如下所示 −

You can observe the output as shown below −

Date
Server
Upgrade
Connection
Last-Modified
Accept-Ranges
Content-Length
Keep-Alive
Content-Type

借助以下代码行,我们可以获取关于内容类型(例如 content-type)的特定信息:

With the help of following line of code we can get the particular information about content type, say content-type −

print (r.headers.get('content-type'))

你可以观察到输出,如下所示 −

You can observe the output as shown below −

image/jpeg

借助以下代码行,我们可以获取关于内容类型(例如 EType)的特定信息:

With the help of following line of code, we can get the particular information about content type, say EType −

print (r.headers.get('ETag'))

你可以观察到输出,如下所示 −

You can observe the output as shown below −

None

观察以下命令:

Observe the following command −

print (r.headers.get('content-length'))

你可以观察到输出,如下所示 −

You can observe the output as shown below −

12636

借助以下代码行,我们可以获取关于内容类型(例如 Server)的特定信息:

With the help of following line of code we can get the particular information about content type, say Server −

print (r.headers.get('Server'))

你可以观察到输出,如下所示 −

You can observe the output as shown below −

Apache

Generating Thumbnail for Images

缩略图是非常小的说明或表现形式。用户可能希望仅保存大图像的缩略图或同时保存图像和缩略图。在本节中,我们将创建名为 ThinkBig.png 的图像缩略图,该图像已下载于上一节“从网页获取媒体内容”。

Thumbnail is a very small description or representation. A user may want to save only thumbnail of a large image or save both the image as well as thumbnail. In this section we are going to create a thumbnail of the image named ThinkBig.png downloaded in the previous section “Getting media content from web page”.

对于这个Python脚本,我们需要安装Pillow,Python图像库的分支,它有用于处理图像的有用函数。可以使用以下命令安装它:

For this Python script, we need to install Python library named Pillow, a fork of the Python Image library having useful functions for manipulating images. It can be installed with the help of following command −

pip install pillow

以下Python脚本将创建图像的缩略图,并将缩略图文件加上 Th_ 前缀保存到当前目录

The following Python script will create a thumbnail of the image and will save it to the current directory by prefixing thumbnail file with Th_

import glob
from PIL import Image
for infile in glob.glob("ThinkBig.png"):
   img = Image.open(infile)
   img.thumbnail((128, 128), Image.ANTIALIAS)
   if infile[0:2] != "Th_":
      img.save("Th_" + infile, "png")

上面的代码非常容易理解,你可以检查当前目录中的缩略图文件。

The above code is very easy to understand and you can check for the thumbnail file in the current directory.

Screenshot from Website

在网络抓取中,一项非常常见的任务是对网站进行屏幕截图。为了实现这一点,我们将使用selenium和webdriver。以下Python脚本将从网站获取屏幕截图,并将其保存到当前目录。

In web scraping, a very common task is to take screenshot of a website. For implementing this, we are going to use selenium and webdriver. The following Python script will take the screenshot from website and will save it to current directory.

From selenium import webdriver
path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
browser = webdriver.Chrome(executable_path = path)
browser.get('https://tutorialspoint.com/')
screenshot = browser.save_screenshot('screenshot.png')
browser.quit

你可以观察到输出,如下所示 −

You can observe the output as shown below −

DevTools listening on ws://127.0.0.1:1456/devtools/browser/488ed704-9f1b-44f0-
a571-892dc4c90eb7
<bound method WebDriver.quit of <selenium.webdriver.chrome.webdriver.WebDriver
(session="37e8e440e2f7807ef41ca7aa20ce7c97")>>

运行脚本后,你可以检查当前目录中的 screenshot.png 文件。

After running the script, you can check your current directory for screenshot.png file.

screenshot

Thumbnail Generation for Video

假设我们从网站上下载了视频,并希望为它们生成缩略图,以便可以根据缩略图单击特定的视频。为了生成视频的缩略图,我们需要一个名为 ffmpeg 的简单工具,可以从 www.ffmpeg.org 下载。下载后,我们需要根据操作系统的规范进行安装。

Suppose we have downloaded videos from website and wanted to generate thumbnails for them so that a specific video, based on its thumbnail, can be clicked. For generating thumbnail for videos we need a simple tool called ffmpeg which can be downloaded from www.ffmpeg.org. After downloading, we need to install it as per the specifications of our OS.

以下Python脚本将生成视频的缩略图,并将其保存到我们的本地目录中:

The following Python script will generate thumbnail of the video and will save it to our local directory −

import subprocess
video_MP4_file = “C:\Users\gaurav\desktop\solar.mp4
thumbnail_image_file = 'thumbnail_solar_video.jpg'
subprocess.call(['ffmpeg', '-i', video_MP4_file, '-ss', '00:00:20.000', '-
   vframes', '1', thumbnail_image_file, "-y"])

运行上述脚本后,我们将在本地目录中得到一个名为 thumbnail_solar_video.jpg 的缩略图。

After running the above script, we will get the thumbnail named thumbnail_solar_video.jpg saved in our local directory.

Ripping an MP4 video to an MP3

假设你从网站下载了一些视频文件,但你只需要该文件中的音频就足够了,那么就可以在Python中通过名为 moviepy 的Python库来完成,可以使用以下命令进行安装:

Suppose you have downloaded some video file from a website, but you only need audio from that file to serve your purpose, then it can be done in Python with the help of Python library called moviepy which can be installed with the help of following command −

pip install moviepy

现在,在使用以下脚本成功安装moviepy后,我们可以将MP4转换为MP3。

Now, after successfully installing moviepy with the help of following script we can convert and MP4 to MP3.

import moviepy.editor as mp
clip = mp.VideoFileClip(r"C:\Users\gaurav\Desktop\1234.mp4")
clip.audio.write_audiofile("movie_audio.mp3")

你可以观察到输出,如下所示 −

You can observe the output as shown below −

[MoviePy] Writing audio in movie_audio.mp3
100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 674/674 [00:01<00:00,
476.30it/s]
[MoviePy] Done.

上述脚本将MP3音频文件保存在本地目录中。

The above script will save the audio MP3 file in the local directory.