Python Web Scraping 简明教程

Python Web Scraping - Testing with Scrapers

本章介绍如何在 Python 中使用网络抓取进行测试。

This chapter explains how to perform testing using web scrapers in Python.

Introduction

在大型网络项目中,网站后端的自动化测试会定期进行,但前端测试往往会被跳过。其主要原因在于,网站的编程就像各种标记和编程语言的网络。我们可以针对一种语言编写单元测试,但如果交互是使用另一种语言完成的,就会变得具有挑战性。这就是说,我们必须拥有一套测试,以确保我们的代码按预期执行。

In large web projects, automated testing of website’s backend is performed regularly but the frontend testing is skipped often. The main reason behind this is that the programming of websites is just like a net of various markup and programming languages. We can write unit test for one language but it becomes challenging if the interaction is being done in another language. That is why we must have suite of tests to make sure that our code is performing as per our expectation.

Testing using Python

当我们谈论测试时,意思是说单元测试。在深入探究使用 Python 进行测试之前,我们必须了解单元测试。以下是单元测试的一些特征:

When we are talking about testing, it means unit testing. Before diving deep into testing with Python, we must know about unit testing. Following are some of the characteristics of unit testing −

  1. At-least one aspect of the functionality of a component would be tested in each unit test.

  2. Each unit test is independent and can also run independently.

  3. Unit test does not interfere with success or failure of any other test.

  4. Unit tests can run in any order and must contain at least one assertion.

Unittest − Python Module

名为 Unittest 的 Python 模块用于单元测试,该模块附带所有标准 Python 安装。我们只需要导入它,其余的就是 unittest.TestCase 类的任务,它将执行以下操作:

Python module named Unittest for unit testing is comes with all the standard Python installation. We just need to import it and rest is the task of unittest.TestCase class which will do the followings −

  1. SetUp and tearDown functions are provided by unittest.TestCase class. These functions can run before and after each unit test.

  2. It also provides assert statements to allow tests to pass or fail.

  3. It runs all the functions that begin with test_ as unit test.

Example

在这个示例中,我们将结合 unittest 使用网络抓取。我们将测试 Wikipedia 页面,以搜索字符串“Python”。它主要会执行两个测试:第一个测试是,标题页面是与搜索字符串(即“Python”)相同或不同;第二个测试是,页面包含一个内容块 div。

In this example we are going to combine web scraping with unittest. We will test Wikipedia page for searching string ‘Python’. It will basically do two tests, first weather the title page is same as the search string i.e.‘Python’ or not and second test makes sure that the page has a content div.

首先,我们将导入必需的 Python 模块。我们使用 BeautifulSoup 进行网络抓取,当然我们也使用 unittest 进行测试。

First, we will import the required Python modules. We are using BeautifulSoup for web scraping and of course unittest for testing.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import unittest

现在,我们需要定义一个类,它将扩展 unittest.TestCase。全局对象 bs 将会在所有测试之间共享。由 unittest 指定的函数 setUpClass 将会完成它。这里我们将定义两个函数,一个用于测试标题页面,另一个用于测试页面内容。

Now we need to define a class which will extend unittest.TestCase. Global object bs would be shared between all tests. A unittest specified function setUpClass will accomplish it. Here we will define two functions, one for testing the title page and other for testing the page content.

class Test(unittest.TestCase):
   bs = None
   def setUpClass():
      url = '<a target="_blank" rel="nofollow" href="https://en.wikipedia.org/wiki/Python">https://en.wikipedia.org/wiki/Python'</a>
      Test.bs = BeautifulSoup(urlopen(url), 'html.parser')
   def test_titleText(self):
      pageTitle = Test.bs.find('h1').get_text()
      self.assertEqual('Python', pageTitle);
   def test_contentExists(self):
      content = Test.bs.find('div',{'id':'mw-content-text'})
      self.assertIsNotNone(content)
if __name__ == '__main__':
   unittest.main()

在运行完上述脚本后,我们将获得以下输出:

After running the above script we will get the following output −

----------------------------------------------------------------------
Ran 2 tests in 2.773s

OK
An exception has occurred, use %tb to see the full traceback.

SystemExit: False

D:\ProgramData\lib\site-packages\IPython\core\interactiveshell.py:2870:
UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
 warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

Testing with Selenium

让我们讨论如何使用 Python Selenium 进行测试。这也被称为 Selenium 测试。Python unittestSelenium 并没有太多共同点。我们知道 Selenium 将标准的 Python 命令发送到不同的浏览器,尽管它们的浏览器设计存在差异性。请记住,我们已经在之前的章节中安装了 Selenium 并使用过它。我们将在 Selenium 中创建测试脚本,并将其用于自动化。

Let us discuss how to use Python Selenium for testing. It is also called Selenium testing. Both Python unittest and Selenium do not have much in common. We know that Selenium sends the standard Python commands to different browsers, despite variation in their browser’s design. Recall that we already installed and worked with Selenium in previous chapters. Here we will create test scripts in Selenium and use it for automation.

Example

在此 Python 脚本的帮助下,我们为 Facebook 登录页面的自动化创建了测试脚本。你可以修改该示例来自动化你所选择的其他表单和登录,但其概念是一样的。

With the help of next Python script, we are creating test script for the automation of Facebook Login page. You can modify the example for automating other forms and logins of your choice, however the concept would be same.

首先要连接到 Web 浏览器,我们要从 selenium 模块导入 webdriver −

First for connecting to web browser, we will import webdriver from selenium module −

from selenium import webdriver

现在,我们需要从 selenium 模块导入 Keys。

Now, we need to import Keys from selenium module.

from selenium.webdriver.common.keys import Keys

接下来我们需要提供用于登录 Facebook 帐户的用户名和密码

Next we need to provide username and password for login into our facebook account

user = "gauravleekha@gmail.com"
pwd = ""

接下来,提供 Chrome 的 Web 驱动程序路径。

Next, provide the path to web driver for Chrome.

path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
driver = webdriver.Chrome(executable_path=path)
driver.get("http://www.facebook.com")

现在,我们将使用 assert 关键字来验证条件。

Now we will verify the conditions by using assert keyword.

assert "Facebook" in driver.title

通过下面的代码行,我们正在将值发送到电子邮件部分。我们在这里通过其 id 进行搜索,但我们可以通过按 name driver.find_element_by_name("email") 进行搜索来执行此操作。

With the help of following line of code we are sending values to the email section. Here we are searching it by its id but we can do it by searching it by name as driver.find_element_by_name("email").

element = driver.find_element_by_id("email")
element.send_keys(user)

通过下面的代码行,我们正在将值发送到密码部分。我们在这里通过其 id 进行搜索,但我们可以通过按 name driver.find_element_by_name("pass") 进行搜索来执行此操作。

With the help of following line of code we are sending values to the password section. Here we are searching it by its id but we can do it by searching it by name as driver.find_element_by_name("pass").

element = driver.find_element_by_id("pass")
element.send_keys(pwd)

下一行代码用于在电子邮件和密码字段中插入值后按回车键/登录。

Next line of code is used to press enter/login after inserting the values in email and password field.

element.send_keys(Keys.RETURN)

现在,我们将关闭浏览器。

Now we will close the browser.

driver.close()

运行完上述脚本后,Chrome Web 浏览器将会打开,并且您可以看到电子邮件和密码正在被插入并单击登录按钮。

After running the above script, Chrome web browser will be opened and you can see email and password is being inserted and clicked on login button.

facebook login

Comparison: unittest or Selenium

unittest 和 selenium 的比较很困难,因为如果您想使用大型测试套件,那么需要 unites 的句法灵活性。另一方面,如果您要测试网站灵活性,那么 Selenium 测试将是我们的首选。但是如果我们可以将它们结合起来呢。我们可以将 selenium 导入 Python unittest 并充分利用两者。Selenium 可用于获取有关网站的信息,unittest 可评估该信息是否符合通过测试的标准。

The comparison of unittest and selenium is difficult because if you want to work with large test suites, the syntactical rigidity of unites is required. On the other hand, if you are going to test website flexibility then Selenium test would be our first choice. But what if we can combine both of them. We can import selenium into Python unittest and get the best of both. Selenium can be used to get information about a website and unittest can evaluate whether that information meets the criteria for passing the test or not.

例如,我们对上述 Python 脚本重新编制,以便通过结合两者自动执行 Facebook 登录,如下所示 −

For example, we are rewriting the above Python script for automation of Facebook login by combining both of them as follows −

import unittest
from selenium import webdriver

class InputFormsCheck(unittest.TestCase):
   def setUp(self):
      self.driver = webdriver.Chrome(r'C:\Users\gaurav\Desktop\chromedriver')
      def test_singleInputField(self):
      user = "gauravleekha@gmail.com"
      pwd = ""
      pageUrl = "http://www.facebook.com"
      driver=self.driver
      driver.maximize_window()
      driver.get(pageUrl)
      assert "Facebook" in driver.title
      elem = driver.find_element_by_id("email")
      elem.send_keys(user)
      elem = driver.find_element_by_id("pass")
      elem.send_keys(pwd)
      elem.send_keys(Keys.RETURN)
   def tearDown(self):
      self.driver.close()
if __name__ == "__main__":
   unittest.main()