Python Web Scraping 简明教程

Python Web Scraping - Processing CAPTCHA

在本节中,让我们了解如何执行网络抓取和处理 CAPTCHA,它用于测试用户是人还是机器人。

In this chapter, let us understand how to perform web scraping and processing CAPTCHA that is used for testing a user for human or robot.

What is CAPTCHA?

CAPTCHA 的全称是 Completely Automated Public Turing test to tell Computers and Humans Apart ,这清楚地表明它是一种测试,用于确定用户是否是人。

The full form of CAPTCHA is Completely Automated Public Turing test to tell Computers and Humans Apart, which clearly suggests that it is a test to determine whether the user is human or not.

CAPTCHA 是一幅扭曲的图像,计算机程序通常难以检测,但人类可以设法理解。大多数网站使用 CAPTCHA 来防止机器人交互。

A CAPTCHA is a distorted image which is usually not easy to detect by computer program but a human can somehow manage to understand it. Most of the websites use CAPTCHA to prevent bots from interacting.

Loading CAPTCHA with Python

假设我们想在网站上注册,并且表单带有 CAPTCHA,那么在加载 CAPTCHA 图像之前,我们需要了解表单所需的具体信息。借助下一个 Python 脚本,我们可以了解名为 http://example.webscrapping.com. 的网站上注册表单的表单要求。

Suppose we want to do registration on a website and there is form with CAPTCHA, then before loading the CAPTCHA image we need to know about the specific information required by the form. With the help of next Python script we can understand the form requirements of registration form on website named http://example.webscrapping.com.

import lxml.html
import urllib.request as urllib2
import pprint
import http.cookiejar as cookielib
def form_parsing(html):
   tree = lxml.html.fromstring(html)
   data = {}
   for e in tree.cssselect('form input'):
      if e.get('name'):
         data[e.get('name')] = e.get('value')
   return data
REGISTER_URL = '<a target="_blank" rel="nofollow"
   href="http://example.webscraping.com/user/register">http://example.webscraping.com/user/register'</a>
ckj = cookielib.CookieJar()
browser = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckj))
html = browser.open(
   '<a target="_blank" rel="nofollow"
      href="http://example.webscraping.com/places/default/user/register?_next">
      http://example.webscraping.com/places/default/user/register?_next</a> = /places/default/index'
).read()
form = form_parsing(html)
pprint.pprint(form)

在上述 Python 脚本中,我们首先使用 lxml python 模块定义一个函数来解析表单,然后它将打印如下表单要求:

In the above Python script, first we defined a function that will parse the form by using lxml python module and then it will print the form requirements as follows −

{
   '_formkey': '5e306d73-5774-4146-a94e-3541f22c95ab',
   '_formname': 'register',
   '_next': '/places/default/index',
   'email': '',
   'first_name': '',
   'last_name': '',
   'password': '',
   'password_two': '',
   'recaptcha_response_field': None
}

您可以从上面的输出中检查,除了 recpatcha_response_field 之外的所有信息都是可以理解且直接的。现在问题是如何处理这些复杂的信息并下载 CAPTCHA。借助 pillow Python 库可以执行此操作,如下所示:

You can check from the above output that all the information except recpatcha_response_field are understandable and straightforward. Now the question arises that how we can handle this complex information and download CAPTCHA. It can be done with the help of pillow Python library as follows;

Pillow Python Package

Pillow 是 Python 图像库的一个分支,具有用于处理图像的有用函数。它可以通过以下命令安装:

Pillow is a fork of the Python Image library having useful functions for manipulating images. It can be installed with the help of following command −

pip install pillow

在下一个示例中,我们将使用它来加载 CAPTCHA:

In the next example we will use it for loading the CAPTCHA −

from io import BytesIO
import lxml.html
from PIL import Image
def load_captcha(html):
   tree = lxml.html.fromstring(html)
   img_data = tree.cssselect('div#recaptcha img')[0].get('src')
   img_data = img_data.partition(',')[-1]
   binary_img_data = img_data.decode('base64')
   file_like = BytesIO(binary_img_data)
   img = Image.open(file_like)
   return img

上述 python 脚本正在使用 pillow python 程序包并定义一个用于加载 CAPTCHA 图像的函数。必须将其与上一个脚本中定义的 form_parser() 函数一起使用,以获取有关注册表单的信息。此脚本将以有用的格式保存 CAPTCHA 图像,以后可将其提取为字符串。

The above python script is using pillow python package and defining a function for loading CAPTCHA image. It must be used with the function named form_parser() that is defined in the previous script for getting information about the registration form. This script will save the CAPTCHA image in a useful format which further can be extracted as string.

OCR: Extracting Text from Image using Python

在以有用的格式加载 CAPTCHA 之后,我们借助光学字符识别 (OCR) 来提取它,这是从图像中提取文本的过程。为此,我们将使用开源 Tesseract OCR 引擎。它可以通过以下命令安装:

After loading the CAPTCHA in a useful format, we can extract it with the help of Optical Character Recognition (OCR), a process of extracting text from the images. For this purpose, we are going to use open source Tesseract OCR engine. It can be installed with the help of following command −

pip install pytesseract

Example

在此,我们将扩展上述 Python 脚本,该脚本通过使用 Pillow Python 程序包加载 CAPTCHA,如下所示:

Here we will extend the above Python script, which loaded the CAPTCHA by using Pillow Python Package, as follows −

import pytesseract
img = get_captcha(html)
img.save('captcha_original.png')
gray = img.convert('L')
gray.save('captcha_gray.png')
bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
bw.save('captcha_thresholded.png')

以上 Python 脚本将以黑白模式读取验证码,从而清晰易于传递给 Tesseract 如下所示

The above Python script will read the CAPTCHA in black and white mode which would be clear and easy to pass to tesseract as follows −

pytesseract.image_to_string(bw)

运行以上脚本后,我们将获得注册表单的验证码作为输出

After running the above script we will get the CAPTCHA of registration form as the output.