Python Web Scraping 简明教程

Getting Started with Python

在第一章中,我们学习了网络抓取是什么。在本章中,让我们看看如何使用Python来实现网络抓取。

In the first chapter, we have learnt what web scraping is all about. In this chapter, let us see how to implement web scraping using Python.

Why Python for Web Scraping?

Python是实现网络抓取的流行工具。Python编程语言也用于与网络安全、渗透测试以及数字取证应用程序相关的其他有用的项目。使用Python的基本编程,可以在不使用任何其他第三方工具的情况下执行网络抓取。

Python is a popular tool for implementing web scraping. Python programming language is also used for other useful projects related to cyber security, penetration testing as well as digital forensic applications. Using the base programming of Python, web scraping can be performed without using any other third party tool.

Python编程语言正获得巨大的欢迎,使Python非常适合网络抓取项目的理由如下:

Python programming language is gaining huge popularity and the reasons that make Python a good fit for web scraping projects are as below −

Syntax Simplicity

与其他编程语言相比,Python具有最简单的结构。Python的这个特性使其测试更加容易,开发人员可以更多地关注编程。

Python has the simplest structure when compared to other programming languages. This feature of Python makes the testing easier and a developer can focus more on programming.

Inbuilt Modules

使用Python进行网络抓取的另一个原因是它拥有内置的和外部的有用库。我们可以通过使用Python作为编程的基础,执行与网络抓取相关的许多实现。

Another reason for using Python for web scraping is the inbuilt as well as external useful libraries it possesses. We can perform many implementations related to web scraping by using Python as the base for programming.

Open Source Programming Language

Python获得了社区的巨大支持,因为它是一种开源编程语言。

Python has huge support from the community because it is an open source programming language.

Wide range of Applications

Python可用于从小型shell脚本到企业web应用程序的各种编程任务。

Python can be used for various programming tasks ranging from small shell scripts to enterprise web applications.

Installation of Python

Python发行版适用于Windows、MAC和Unix/Linux等平台。我们只需要下载适用于我们平台的二进制代码即可安装Python。但是如果我们平台的二进制代码不可用,则我们必须有一个C编译器,以便可以手工编译源代码。

Python distribution is available for platforms like Windows, MAC and Unix/Linux. We need to download only the binary code applicable for our platform to install Python. But in case if the binary code for our platform is not available, we must have a C compiler so that source code can be compiled manually.

我们可在不同平台上安装 Python,方法如下 −

We can install Python on various platforms as follows −

Installing Python on Unix and Linux

您需要执行以下步骤才能在 Unix/Linux 机器上安装 Python −

You need to followings steps given below to install Python on Unix/Linux machines −

Step 1 − 访问链接 https://www.python.org/downloads/

Step 1 − Go to the link https://www.python.org/downloads/

Step 2 − 下载适用于 Unix/Linux 的压缩源代码,这是在上述链接中提供的。

Step 2 − Download the zipped source code available for Unix/Linux on above link.

Step 3 − 将这些文件解压到您的机器上。

Step 3 − Extract the files onto your computer.

Step 4 − 使用以下命令完成安装 −

Step 4 − Use the following commands to complete the installation −

run ./configure script
make
make install

您可以在标准位置 /usr/local/bin 中找到已安装的 Python,其库位于 /usr/local/lib/pythonXX ,其中 XX 是 Python 的版本。

You can find installed Python at the standard location /usr/local/bin and its libraries at /usr/local/lib/pythonXX, where XX is the version of Python.

Installing Python on Windows

您需要执行以下步骤才能在 Windows 机器上安装 Python −

You need to followings steps given below to install Python on Windows machines −

Step 1 − 访问链接 https://www.python.org/downloads/

Step 1 − Go to the link https://www.python.org/downloads/

Step 2 − 下载 Windows 安装程序 python-XYZ.msi 文件,其中 XYZ 是我们需要安装的版本。

Step 2 − Download the Windows installer python-XYZ.msi file, where XYZ is the version we need to install.

Step 3 − 现在,将安装程序文件保存在您的本地机器中并运行 MSI 文件。

Step 3 − Now, save the installer file to your local machine and run the MSI file.

Step 4 − 最后,运行下载的文件,调出 Python 安装向导。

Step 4 − At last, run the downloaded file to bring up the Python install wizard.

Installing Python on Macintosh

我们必须使用 Homebrew 来在 Mac OS X 上安装 Python 3。Homebrew 易于安装且是一个出色的软件包安装程序。

We must use Homebrew for installing Python 3 on Mac OS X. Homebrew is easy to install and a great package installer.

也可以使用以下命令安装 Homebrew −

Homebrew can also be installed by using the following command −

$ ruby -e "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/master/install)"

为更新软件包管理器,我们可以使用以下命令 −

For updating the package manager, we can use the following command −

$ brew update

借助以下命令,我们可在我们的 MAC 机器上安装 Python3 −

With the help of the following command, we can install Python3 on our MAC machine −

$ brew install python3

Setting Up the PATH

您可以使用以下说明设置不同环境中的路径 −

You can use the following instructions to set up the path on various environments −

Setting Up the Path on Unix/Linux

使用以下命令通过不同的命令 shell 设置路径 −

Use the following commands for setting up paths using various command shells −

For csh shell

setenv PATH "$PATH:/usr/local/bin/python".

For bash shell (Linux)

ATH="$PATH:/usr/local/bin/python".

For sh or ksh shell

PATH="$PATH:/usr/local/bin/python".

Setting Up the Path on Windows

为设置 Windows 上的路径,我们可以在命令提示符处使用以下路径 %path%;C:\Python ,然后按 Enter。

For setting the path on Windows, we can use the path %path%;C:\Python at the command prompt and then press Enter.

Running Python

我们可以通过以下三种方式中的任何一种启动 Python −

We can start Python using any of the following three ways −

Interactive Interpreter

可使用提供命令行解释器或 shell 的操作系统,如 UNIX 和 DOS 来启动 Python。

An operating system such as UNIX and DOS that is providing a command-line interpreter or shell can be used for starting Python.

我们可以按照以下方式在交互解释器中开始编码:

We can start coding in interactive interpreter as follows −

Step 1 − 在命令行中输入 python

Step 1 − Enter python at the command line.

Step 2 - 然后,我们可以在交互解释器中立即开始编码。

Step 2 − Then, we can start coding right away in the interactive interpreter.

$python # Unix/Linux
or
python% # Unix/Linux
or
C:> python # Windows/DOS

Script from the Command-line

我们可以通过调用解释器来在命令行执行 Python 脚本。它可以理解为以下内容:

We can execute a Python script at command line by invoking the interpreter. It can be understood as follows −

$python script.py # Unix/Linux
or
python% script.py # Unix/Linux
or
C: >python script.py # Windows/DOS

Integrated Development Environment

如果系统具有支持 Python 的 GUI 应用程序,我们还可以从 GUI 环境运行 Python。以下列出了一些在各种平台上支持 Python 的集成开发环境:

We can also run Python from GUI environment if the system is having GUI application that is supporting Python. Some IDEs that support Python on various platforms are given below −

IDE for UNIX - UNIX 针对 Python 具有 IDLE IDE。

IDE for UNIX − UNIX, for Python, has IDLE IDE.

IDE for Windows - Windows 具有具有 GUI 的 PythonWin IDE。

IDE for Windows − Windows has PythonWin IDE which has GUI too.

IDE for Macintosh - Macintosh 具有 IDLE IDE,可以从主网站下载为 MacBinary 或 BinHex’d 文件。

IDE for Macintosh − Macintosh has IDLE IDE which is downloadable as either MacBinary or BinHex’d files from the main website.