Python Web Scraping 简明教程

Python Web Scraping - Form based Websites

在上一章,我们已经看到了抓取动态网站。在本章中,让我们了解对通过用户输入内容构建的网站进行抓取,即基于表单的网站。

In the previous chapter, we have seen scraping dynamic websites. In this chapter, let us understand scraping of websites that work on user based inputs, that is form based websites.

Introduction

如今,WWW(万维网)正在走向社交媒体以及用户生成的内容。那么问题来了,我们如何才能访问超出登录界面之外的此类信息?为此,我们需要处理表单和登录。

These days WWW (World Wide Web) is moving towards social media as well as usergenerated contents. So the question arises how we can access such kind of information that is beyond login screen? For this we need to deal with forms and logins.

在前面的章节中,我们使用 HTTP GET 方法来请求信息,但在本章中,我们将使用 HTTP POST 方法,该方法将信息推送到 Web 服务器进行存储和分析。

In previous chapters, we worked with HTTP GET method to request information but in this chapter we will work with HTTP POST method that pushes information to a web server for storage and analysis.

Interacting with Login forms

在使用互联网时,您肯定已经多次与登录表单进行了交互。它们可能非常简单,例如只包括很少几个 HTML 字段、一个提交按钮和一个操作页面,或者它们可能很复杂并且具有其他一些字段,例如出于安全原因,电子邮件、留言和验证码。

While working on Internet, you must have interacted with login forms many times. They may be very simple like including only a very few HTML fields, a submit button and an action page or they may be complicated and have some additional fields like email, leave a message along with captcha for security reasons.

在本节中,我们将借助 Python 请求库来处理简单的提交表单。

In this section, we are going to deal with a simple submit form with the help of Python requests library.

首先,我们需要导入请求库,如下所示 −

First, we need to import requests library as follows −

import requests

现在,我们需要提供登录表单字段的信息。

Now, we need to provide the information for the fields of login form.

parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’}

在下一行代码中,我们需要提供执行表单操作的 URL。

In next line of code, we need to provide the URL on which action of the form would happen.

r = requests.post(“enter the URL”, data = parameters)
print(r.text)

运行脚本后,脚本将返回操作发生的页面的内容。

After running the script, it will return the content of the page where action has happened.

假设你想用表单提交一张图片,那么使用 requests.post() 很容易。你可以借助以下 Python 脚本理解它:

Suppose if you want to submit any image with the form, then it is very easy with requests.post(). You can understand it with the help of following Python script −

import requests
file = {‘Uploadfile’: open(’C:\Usres\desktop\123.png’,‘rb’)}
r = requests.post(“enter the URL”, files = file)
print(r.text)

Loading Cookies from the Web Server

cookie 有时又称 Web cookie 或 Internet cookie,是从网站发送的一小段数据,我们的计算机将其存储在 Web 浏览器中的一个文件中。

A cookie, sometimes called web cookie or internet cookie, is a small piece of data sent from a website and our computer stores it in a file located inside our web browser.

关于登录表单处理,cookie 可以分为两种类型。我们已经解决了第一种,它允许我们提交信息给网站;第二种则允许我们在整个网站访问期间保持在一个永久的“已登录”状态。对于第二种表单,网站使用 cookie 来跟踪谁登录了,谁没有登录。

In the context of dealings with login forms, cookies can be of two types. One, we dealt in the previous section, that allows us to submit information to a website and second which lets us to remain in a permanent “logged-in” state throughout our visit to the website. For the second kind of forms, websites use cookies to keep track of who is logged in and who is not.

What do cookies do?

如今,大多数网站都使用 cookie 进行跟踪。我们可以借助以下步骤理解 cookie 的工作原理:

These days most of the websites are using cookies for tracking. We can understand the working of cookies with the help of following steps −

Step 1 - 首先,站点将验证我们的登录凭据并将其存储到浏览器的 cookie 中。此 cookie 通常包含由服务器生成的令牌、超时和跟踪信息。

Step 1 − First, the site will authenticate our login credentials and stores it in our browser’s cookie. This cookie generally contains a server-generated toke, time-out and tracking information.

Step 2 - 其次,网站将 cookie 用作身份验证的证明。在每次访问网站时,都会显示此身份验证。

Step 2 − Next, the website will use the cookie as a proof of authentication. This authentication is always shown whenever we visit the website.

cookie 对 Web 爬虫来说非常麻烦,因为如果 Web 爬虫不跟踪 cookie,提交的表单会发回,并且在下一页中似乎从未登录过。使用 Python requests 库可以非常轻松地跟踪 cookie,如下所示:

Cookies are very problematic for web scrapers because if web scrapers do not keep track of the cookies, the submitted form is sent back and at the next page it seems that they never logged in. It is very easy to track the cookies with the help of Python requests library, as shown below −

import requests
parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’}
r = requests.post(“enter the URL”, data = parameters)

在上述代码行中,URL 将是充当登录表单处理器的页面。

In the above line of code, the URL would be the page which will act as the processor for the login form.

print(‘The cookie is:’)
print(r.cookies.get_dict())
print(r.text)

在运行上述脚本后,我们将从上一次请求的结果中检索 cookie。

After running the above script, we will retrieve the cookies from the result of last request.

cookie 还有一个问题,即网站有时会在不发出警告的情况下频繁地修改 cookie。我们可以使用以下方法处理此类情况:

There is another issue with cookies that sometimes websites frequently modify cookies without warning. Such kind of situation can be dealt with requests.Session() as follows −

import requests
session = requests.Session()
parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Emailid’,’Message’:’Type your message here’}
r = session.post(“enter the URL”, data = parameters)

在上述代码行中,URL 将是充当登录表单处理器的页面。

In the above line of code, the URL would be the page which will act as the processor for the login form.

print(‘The cookie is:’)
print(r.cookies.get_dict())
print(r.text)

请注意,你可以轻松了解带会话和不带会话的脚本之间的区别。

Observe that you can easily understand the difference between script with session and without session.

Automating forms with Python

在本节中,我们将处理一个 Python 模块,名为 Mechanize,此模块将减少我们的工作并自动执行填充表单的过程。

In this section we are going to deal with a Python module named Mechanize that will reduce our work and automate the process of filling up forms.

Mechanize module

Mechanize 模块为我们提供一个高级界面来与表单进行交互。在开始使用它之前,我们需要使用以下命令安装它:

Mechanize module provides us a high-level interface to interact with forms. Before starting using it we need to install it with the following command −

pip install mechanize

请注意,此命令仅适用于 Python 2.x。

Note that it would work only in Python 2.x.

Example

在此示例中,我们将自动化填写一个具有两个字段(即电子邮件和密码)的登录表单的过程:

In this example, we are going to automate the process of filling a login form having two fields namely email and password −

import mechanize
brwsr = mechanize.Browser()
brwsr.open(Enter the URL of login)
brwsr.select_form(nr = 0)
brwsr['email'] = ‘Enter email’
brwsr['password'] = ‘Enter password’
response = brwsr.submit()
brwsr.submit()

你可以非常轻松地理解上述代码。首先,我们导入了 mechanize 模块。然后创建了一个 Mechanize 浏览器对象。然后,我们导航到登录 URL 并选择了表单。之后,直接将名称和值传递给浏览器对象。

The above code is very easy to understand. First, we imported mechanize module. Then a Mechanize browser object has been created. Then, we navigated to the login URL and selected the form. After that, names and values are passed directly to the browser object.