Requests 简明教程
Requests - Web Scraping using Requests
我们已经了解了如何使用 python requests 库从给定的 URL 中获取数据。我们将尝试使用以下方法从 Tutorialspoint 站点的页面中提取数据,该页面网址为 https://www.tutorialspoint.com/tutorialslibrary.htm :
We have already seen how we can get data from a given URL using python requests library. We will try to scrap the data from the site of Tutorialspoint which is available at https://www.tutorialspoint.com/tutorialslibrary.htm using the following −
-
Requests Library
-
Beautiful soup library from python
我们已经安装了 Requests 库,现在让我们安装 BeautifulSoup 包。如果您想进一步了解 BeautifulSoup 的一些功能,这是 beautiful soup 的官方网站,该网站可在此处获取: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 。
We have already installed the Requests library, let us now install Beautiful soup package. Here is the official website for beautiful soup available at https://www.crummy.com/software/BeautifulSoup/bs4/doc/ in case you want to explore some more functionalities of beautiful soup.
Installing Beautifulsoup
我们将在下面讨论如何安装 Beautiful Soup:
We shall see how to install Beautiful Soup below −
E:\prequests>pip install beautifulsoup4
Collecting beautifulsoup4
Downloading https://files.pythonhosted.org/packages/3b/c8/a55eb6ea11cd7e5ac4ba
cdf92bac4693b90d3ba79268be16527555e186f0/beautifulsoup4-4.8.1-py3-none-any.whl
(
101kB)
|████████████████████████████████| 102kB 22kB/s
Collecting soupsieve>=1.2 (from beautifulsoup4)
Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0
a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.1 soupsieve-1.9.5
我们现在已经安装了 Python requests 库和 beautiful soup。
We now have python requests library and beautiful soup installed.
现在让我们编写代码,这将从给定的 URL 中提取数据。
Let us now write the code, that will scrap the data from the URL given.
Web scraping
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.tutorialspoint.com/tutorialslibrary.htm')
print("The status code is ", res.status_code)
print("\n")
soup_data = BeautifulSoup(res.text, 'html.parser')
print(soup_data.title)
print("\n")
print(soup_data.find_all('h4'))
使用 requests 库,我们可以从给定的 URL 中获取内容,beautiful soup 库有助于分析它并以我们想要的方式获取详细信息。
Using requests library, we can fetch the content from the URL given and beautiful soup library helps to parse it and fetch the details the way we want.
您可以使用 beautiful soup 库使用 Html 标记、类、id、css 选择器以及更多方式提取数据。以下是我们获得的输出,其中我们打印了页面的标题以及页面上的所有 h4 标记。
You can use a beautiful soup library to fetch data using Html tag, class, id, css selector and many more ways. Following is the output we get wherein we have printed the title of the page and also all the h4 tags on the page.
Output
E:\prequests>python makeRequest.py
The status code is 200
<title>Free Online Tutorials and Courses</title>
[<h4>Academic</h4>, <h4>Computer Science</h4>, <h4>Digital Marketing</h4>,
<h4>Monuments</h4>,<h4>Machine Learning</h4>, <h4>Mathematics</h4>,
<h4>Mobile Development</h4>,<h4>SAP</h4>,
<h4>Software Quality</h4>, <h4>Big Data & Analytics</h4>,
<h4>Databases</h4>, <h4>Engineering Tutorials</h4>,
<h4>Mainframe Development</h4>,
<h4>Microsoft Technologies</h4>, <h4>Java Technologies</h4>,
<h4>XML Technologies</h4>, <h4>Python Technologies</h4>, <h4>Sports</h4>,
<h4>Computer Programming</h4>,<h4>DevOps</h4>, <h4>Latest Technologies</h4>,
<h4>Telecom</h4>, <h4>Exams Syllabus</h4>,
<h4>UPSC IAS Exams</h4>,
<h4>Web Development</h4>,
<h4>Scripts</h4>, <h4>Management</h4>,<h4>Soft Skills</h4>,
<h4>Selected Reading</h4>, <h4>Misc</h4>]