Beautiful Soup 简明教程

Beautiful Soup - Inspect Data Source

为了使用 BeautifulSoup 和 Python 抓取网页,对于任何网页抓取项目,您的第一步应该是探索您想要抓取的网站。因此,在开始提取与您相关的信息之前,请先访问网站了解网站结构。

In order to scrape a web page with BeautifulSoup and Python, your first step for any web scraping project should be to explore the website that you want to scrape. So, first visit the website to understand the site structure before you start extracting the information that’s relevant for you.

让我们访问 TutorialsPoint 的 Python 教程主页。在您的浏览器中打开 https://www.tutorialspoint.com/python3/index.htm

Let us visit TutorialsPoint’s Python Tutorial home page. Open https://www.tutorialspoint.com/python3/index.htm in your browser.

使用开发人员工具可以帮助您了解网站的结构。所有现代浏览器都安装了开发人员工具。

Use Developer tools can help you understand the structure of a website. All modern browsers come with developer tools installed.

如果使用 Chrome 浏览器,请通过右上角菜单按钮 (⋮) 打开开发者工具,然后选择更多工具 → 开发者工具。

If using Chrome browser, open the Developer Tools from the top-right menu button (⋮) and selecting More Tools → Developer Tools.

developer tools

借助开发者工具,你可以浏览网站的文档对象模型 (DOM),以更好地了解其来源。在开发者工具中选择“元素”选项卡。你将看到具有可单击 HTML 元素的结构。

With Developer tools, you can explore the site’s document object model (DOM) to better understand your source. Select the Elements tab in developer tools. You’ll see a structure with clickable HTML elements.

“教程”页面在左边的侧边栏显示目录表。右键单击任意一章并选择“检查”选项。

The Tutorial page shows the table of contents in the left sidebar. Right click on any chapter and choose Inspect option.

tutorial page

对于“元素”选项卡,找到与 TOC 列表相对应的标记,如下所示 −

For the Elements tab, locate the tag that corresponds to the TOC list, as shown in the figure below −

TOC list

右键单击 HTML 元素,复制 HTML 元素,并将其粘贴到任意编辑器中。

Right click on the HTML element, copy the HTML element, and paste it in any editor.

html element

<ul>..</ul> 元素的 HTML 脚本现已获取。

The HTML script of the <ul>..</ul> element is now obtained.

<ul class="toc chapters">
   <li class="heading">Python 3 Basic Tutorial</li>
   <li class="current-chapter"><a href="/python3/index.htm">Python 3 - Home</a></li>
   <li><a href="/python3/python3_whatisnew.htm">What is New in Python 3</a></li>
   <li><a href="/python3/python_overview.htm">Python 3 - Overview</a></li>
   <li><a href="/python3/python_environment.htm">Python 3 - Environment Setup</a></li>
   <li><a href="/python3/python_basic_syntax.htm">Python 3 - Basic Syntax</a></li>
   <li><a href="/python3/python_variable_types.htm">Python 3 - Variable Types</a></li>
   <li><a href="/python3/python_basic_operators.htm">Python 3 - Basic Operators</a></li>
   <li><a href="/python3/python_decision_making.htm">Python 3 - Decision Making</a></li>
   <li><a href="/python3/python_loops.htm">Python 3 - Loops</a></li>
   <li><a href="/python3/python_numbers.htm">Python 3 - Numbers</a></li>
   <li><a href="/python3/python_strings.htm">Python 3 - Strings</a></li>
   <li><a href="/python3/python_lists.htm">Python 3 - Lists</a></li>
   <li><a href="/python3/python_tuples.htm">Python 3 - Tuples</a></li>
   <li><a href="/python3/python_dictionary.htm">Python 3 - Dictionary</a></li>
   <li><a href="/python3/python_date_time.htm">Python 3 - Date & Time</a></li>
   <li><a href="/python3/python_functions.htm">Python 3 - Functions</a></li>
   <li><a href="/python3/python_modules.htm">Python 3 - Modules</a></li>
   <li><a href="/python3/python_files_io.htm">Python 3 - Files I/O</a></li>
   <li><a href="/python3/python_exceptions.htm">Python 3 - Exceptions</a></li>
</ul>

我们现在可以将该脚本加载到 BeautifulSoup 对象中来分析文档树。

We can now load this script in a BeautifulSoup object to parse the document tree.