Python Web Scraping 简明教程
Python Web Scraping - Dynamic Websites
在本章中,让我们学习如何在动态网站上执行网络抓取,以及其中涉及的概念的详细信息。
In this chapter, let us learn how to perform web scraping on dynamic websites and the concepts involved in detail.
Introduction
网络抓取是一项复杂的任务,如果网站是动态的,则复杂性会增加。根据联合国全球网络可访问性审计,超过 70% 的网站本质上是动态的,并且它们的功能依赖于 JavaScript。
Web scraping is a complex task and the complexity multiplies if the website is dynamic. According to United Nations Global Audit of Web Accessibility more than 70% of the websites are dynamic in nature and they rely on JavaScript for their functionalities.
Dynamic Website Example
让我们看一个动态网站的示例,并了解为什么很难抓取。在这里,我们将以从名为 http://example.webscraping.com/places/default/search. 的网站中搜索为例。但是我们如何判断该网站的性质是动态的?可以从以下 Python 脚本的输出中进行判断,该脚本将尝试从上述网页中抓取数据 −
Let us look at an example of a dynamic website and know about why it is difficult to scrape. Here we are going to take example of searching from a website named http://example.webscraping.com/places/default/search. But how can we say that this website is of dynamic nature? It can be judged from the output of following Python script which will try to scrape data from above mentioned webpage −
import re
import urllib.request
response = urllib.request.urlopen('http://example.webscraping.com/places/default/search')
html = response.read()
text = html.decode()
re.findall('(.*?)',text)
Approaches for Scraping data from Dynamic Websites
我们已经看到,由于数据通过 JavaScript 动态加载,因此,抓取器无法抓取动态网站中的信息。在这样的情况下,我们可以使用以下两种技术从依赖 JavaScript 的动态网站中抓取数据−
We have seen that the scraper cannot scrape the information from a dynamic website because the data is loaded dynamically with JavaScript. In such cases, we can use the following two techniques for scraping data from dynamic JavaScript dependent websites −
-
Reverse Engineering JavaScript
-
Rendering JavaScript
Reverse Engineering JavaScript
名为逆向工程的过程很有用,可以让我们了解网页如何动态加载数据。
The process called reverse engineering would be useful and lets us understand how data is loaded dynamically by web pages.
为此,我们需要为特定 URL 点击 inspect element 标签。接下来,我们将单击 NETWORK 标签来查找为该网页发出的所有请求,包括具有 /ajax 路径的 search.json。我们可以借助以下 Python 脚本访问 AJAX 数据(而不是通过浏览器或通过 NETWORK 访问),也可以使用此脚本:
For doing this, we need to click the inspect element tab for a specified URL. Next, we will click NETWORK tab to find all the requests made for that web page including search.json with a path of /ajax. Instead of accessing AJAX data from browser or via NETWORK tab, we can do it with the help of following Python script too −
import requests
url=requests.get('http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=a')
url.json()
Example
上面的脚本允许我们使用 Python json 方法访问 JSON 响应。同样,我们可以下载原始字符串响应,并使用 Python 中的 json.loads 方法加载响应。我们可以借助以下 Python 脚本执行此操作。它将通过搜索字母“a”,然后迭代 JSON 响应的结果页面,基本上抓取所有国家/地区。
The above script allows us to access JSON response by using Python json method. Similarly we can download the raw string response and by using python’s json.loads method, we can load it too. We are doing this with the help of following Python script. It will basically scrape all of the countries by searching the letter of the alphabet ‘a’ and then iterating the resulting pages of the JSON responses.
import requests
import string
PAGE_SIZE = 15
url = 'http://example.webscraping.com/ajax/' + 'search.json?page={}&page_size={}&search_term=a'
countries = set()
for letter in string.ascii_lowercase:
print('Searching with %s' % letter)
page = 0
while True:
response = requests.get(url.format(page, PAGE_SIZE, letter))
data = response.json()
print('adding %d records from the page %d' %(len(data.get('records')),page))
for record in data.get('records'):countries.add(record['country'])
page += 1
if page >= data['num_pages']:
break
with open('countries.txt', 'w') as countries_file:
countries_file.write('n'.join(sorted(countries)))
在运行上面的脚本后,我们将获得以下输出,并且记录将保存到名为 countries.txt 的文件中。
After running the above script, we will get the following output and the records would be saved in the file named countries.txt.
Rendering JavaScript
在上一个部分中,我们对网页执行了逆向工程,了解了 API 的工作原理,以及我们如何使用它来在一单个请求中检索结果。但是,在进行逆向工程时,我们可能会遇到以下困难−
In the previous section, we did reverse engineering on web page that how API worked and how we can use it to retrieve the results in single request. However, we can face following difficulties while doing reverse engineering −
-
Sometimes websites can be very difficult. For example, if the website is made with advanced browser tool such as Google Web Toolkit (GWT), then the resulting JS code would be machine-generated and difficult to understand and reverse engineer.
-
Some higher level frameworks like React.js can make reverse engineering difficult by abstracting already complex JavaScript logic.
解决上述困难的方法是使用浏览器渲染引擎,该引擎可以解析 HTML、应用 CSS 格式并执行 JavaScript 以显示网页。
The solution to the above difficulties is to use a browser rendering engine that parses HTML, applies the CSS formatting and executes JavaScript to display a web page.
Example
在此示例中,我们将使用一个知名的 Python 模块 Selenium 来渲染 Java Script。以下 Python 代码将借助 Selenium 渲染一个网页−
In this example, for rendering Java Script we are going to use a familiar Python module Selenium. The following Python code will render a web page with the help of Selenium −
首先,我们需要从 selenium 中导入 webdriver,如下所示 −
First, we need to import webdriver from selenium as follows −
from selenium import webdriver
现在,提供我们根据要求下载的 web driver 的路径 −
Now, provide the path of web driver which we have downloaded as per our requirement −
path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
driver = webdriver.Chrome(executable_path = path)
现在,提供我们想要在现在由我们的 Python 脚本控制的 web 浏览器中打开的 url。
Now, provide the url which we want to open in that web browser now controlled by our Python script.
driver.get('http://example.webscraping.com/search')
现在,我们可以使用搜索工具箱的 ID 为要选择的元素进行设置。
Now, we can use ID of the search toolbox for setting the element to select.
driver.find_element_by_id('search_term').send_keys('.')
接下来,我们可以使用 Java 脚本将选择框内容设置为如下 −
Next, we can use java script to set the select box content as follows −
js = "document.getElementById('page_size').options[1].text = '100';"
driver.execute_script(js)
下面一行代码显示,搜索已准备好点击网页 −
The following line of code shows that search is ready to be clicked on the web page −
driver.find_element_by_id('search').click()
下一行代码显示,它将等待 45 秒以完成 AJAX 请求。
Next line of code shows that it will wait for 45 seconds for completing the AJAX request.
driver.implicitly_wait(45)
现在,为了选择国家链接,我们可以使用 CSS 选择器,如下所示 −
Now, for selecting country links, we can use the CSS selector as follows −
links = driver.find_elements_by_css_selector('#results a')
现在可以提取每个链接的文本以创建国家列表 −
Now the text of each link can be extracted for creating the list of countries −
countries = [link.text for link in links]
print(countries)
driver.close()