Beautiful Soup 简明教程

Beautiful Soup - Installation

Beautiful Soup 是一个让从网页中抓取信息变得容易的库。它位于 HTML 或 XML 解析器之上,为解析树的迭代、搜索和修改提供了 Pythonic 惯用语。

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

BeautifulSoup 软件包不是 Python 标准库的一部分,因此必须安装它。在安装最新版本之前,让我们按照 Python 推荐的方法创建一个虚拟环境。

BeautifulSoup package is not a part of Python’s standard library, hence it must be installed. Before installing the latest version, let us create a virtual environment, as per Python’s recommended method.

虚拟环境允许我们为特定项目创建 Python 的隔离工作副本,而不会影响外部设置。

A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.

我们将使用 Python 标准库中的 venv 模块来创建虚拟环境。PIP 默认包含在 Python 3.4 及更高版本中。

We shall use venv module in Python’s standard library to create virtual environment. PIP is included by default in Python version 3.4 or later.

在 Windows 中使用以下命令创建虚拟环境

Use the following command to create virtual environment in Windows

C:\uses\user\>python -m venv myenv

在 Ubuntu Linux 中,在创建虚拟环境之前更新 APT 存储库并根据需要安装 venv

On Ubuntu Linux, update the APT repo and install venv if required before creating virtual environment

mvl@GNVBGL3:~ $ sudo apt update && sudo apt upgrade -y
mvl@GNVBGL3:~ $ sudo apt install python3-venv

然后使用以下命令创建一个虚拟环境

Then use the following command to create a virtual environment

mvl@GNVBGL3:~ $ sudo python3 -m venv myenv

你需要激活虚拟环境。在 Windows 中使用该命令

You need to activate the virtual environment. On Windows use the command

C:\uses\user\>cd myenv
C:\uses\user\myenv>scripts\activate
(myenv) C:\Users\users\user\myenv>

在 Ubuntu Linux 中,使用以下命令激活虚拟环境

On Ubuntu Linux, use following command to activate the virtual environment

mvl@GNVBGL3:~$ cd myenv
mvl@GNVBGL3:~/myenv$ source bin/activate
(myenv) mvl@GNVBGL3:~/myenv$

虚拟环境的名称显示在括号中。现在它已激活,我们现在可以在其中安装 BeautifulSoup。

Name of the virtual environment appears in the parenthesis. Now that it is activated, we can now install BeautifulSoup in it.

(myenv) mvl@GNVBGL3:~/myenv$ pip3 install beautifulsoup4
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
143.0/143.0 KB 325.2 kB/s eta 0:00:00
Collecting soupsieve>1.2
  Downloading soupsieve-2.4.1-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.2 soupsieve-2.4.1

请注意,Beautifulsoup4 的最新版本为 4.12.2,并且需要 Python 3.8 或更高版本。

Note that the latest version of Beautifulsoup4 is 4.12.2 and requires Python 3.8 or later.

如果没有安装 easy_install 或 pip,则可以下载 Beautiful Soup 4 源归档并使用 setup.py 安装它。

If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.

(myenv) mvl@GNVBGL3:~/myenv$ python setup.py install

要检查 Beautifulsoup 是否已正确安装,请在 Python 终端中输入以下命令 -

To check if Beautifulsoup is properly install, enter following commands in Python terminal −

>>> import bs4
>>> bs4.__version__
'4.12.2'

如果安装不成功,你将收到 ModuleNotFoundError。

If the installation hasn’t been successful, you will get ModuleNotFoundError.

你还需要安装 requests 库。它是一个用于 Python 的 HTTP 库。

You will also need to install requests library. It is a HTTP library for Python.

pip3 install requests

Installing a Parser

默认情况下,Beautiful Soup 支持 Python 标准库中包含的 HTML 解析器,但它还支持许多外部第三方 Python 解析器,如 lxml 解析器或 html5lib 解析器。

By default, Beautiful Soup supports the HTML parser included in Python’s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

要安装 lxml 或 html5lib 解析器,请使用命令:

To install lxml or html5lib parser, use the command:

pip3 install lxml
pip3 install html5lib

这些解析器具有各自的优点和缺点,如下所示 -

These parsers have their advantages and disadvantages as shown below −

Parser: Python’s html.parser

Usage - BeautifulSoup(markup, "html.parser")

Usage − BeautifulSoup(markup, "html.parser")

Advantages

Advantages

  1. Batteries included

  2. Decent speed

  3. Lenient (As of Python 3.2)

Disadvantages

Disadvantages

  1. Not as fast as lxml, less lenient than html5lib.

Parser: lxml’s HTML parser

Usage − BeautifulSoup(markup, "lxml")

Usage − BeautifulSoup(markup, "lxml")

Advantages

Advantages

  1. Very fast

  2. Lenient

Disadvantages

Disadvantages

.

Parser: lxml’s XML parser

Usage − BeautifulSoup(markup, "lxml-xml")

Usage − BeautifulSoup(markup, "lxml-xml")

或 BeautifulSoup(markup, "xml")

Or BeautifulSoup(markup, "xml")

Advantages

Advantages

  1. Very fast

  2. The only currently supported XML parser

Disadvantages

Disadvantages

  1. External C dependency

Parser: html5lib

Usage − BeautifulSoup(markup, "html5lib")

Usage − BeautifulSoup(markup, "html5lib")

Advantages

Advantages

  1. Extremely lenient

  2. Parses pages the same way a web browser does

  3. Creates valid HTML5

Disadvantages

Disadvantages

  1. Very slow

  2. External Python dependency