Python 简明教程

Python - URL Processing

在互联网世界中，不同的资源由 URL（统一资源定位符）标识。Python 的标准库包括 urllib 包，其中包含用于处理 URL 的模块。它可以帮助你解析 URL、获取网络内容和管理错误。

In the world of Internet, different resources are identified by URLs (Uniform Resource Locators). Python’s standard library includes the urllib package, which has modules for working with URLs. It helps you parse URLs, fetch web content, and manage errors.

本教程介绍了 urllib 基础知识，以帮助你开始使用它。使用 urllib 提升你在网络抓取、数据获取和 URL 管理方面的 Python 技能。

This tutorial introduces urllib basics to help you start using it. Improve your skills in web scraping, fetching data, and managing URLs with Python using urllib.

urllib 包包含以下用于处理 URL 的模块 −

The urllib package contains the following modules for processing URLs −

urllib.parse module is used for parsing a URL into its parts.
urllib.request module contains functions for opening and reading URLs
urllib.error module carries definitions of the exceptions raised by urllib.request
urllib.robotparser module parses the robots.txt files

The urllib.parse Module

此模块用作一个标准接口，以从 URL 字符串获取各个部分。该模块包含以下函数−

This module serves as a standard interface to obtain various parts from a URL string. The module contains following functions −

urlparse(urlstring)

将 URL 解析成六个组件，返回一个 6 项命名字段的元组。每个元组项是一个字符串，对应于以下属性−

Parse a URL into six components, returning a 6-item named tuple. Each tuple item is a string corresponding to following attributes −

Example

from urllib.parse import urlparse
url = "https://example.com/employees/name/?salary>=25000"
parsed_url = urlparse(url)
print (type(parsed_url))
print ("Scheme:",parsed_url.scheme)
print ("netloc:", parsed_url.netloc)
print ("path:", parsed_url.path)
print ("params:", parsed_url.params)
print ("Query string:", parsed_url.query)
print ("Frgment:", parsed_url.fragment)

它将生成以下 output −

It will produce the following output −

<class 'urllib.parse.ParseResult'>
Scheme: https
netloc: example.com
path: /employees/name/
params:
Query string: salary>=25000
Frgment:

parse_qs(qs))

此函数解析给定为字符串参数的查询字符串。数据作为字典返回。字典键是唯一的查询变量名，值是每个名称的值列表。

This function Parse a query string given as a string argument. Data is returned as a dictionary. The dictionary keys are the unique query variable names and the values are lists of values for each name.

要进一步从查询字符串提取查询参数到字典中，请按如下方式使用 ParseResult 对象的 query 属性的 parse_qs() 函数 −

To further fetch the query parameters from the query string into a dictionary, use parse_qs() function of the query attribute of ParseResult object as follows −

Example

from urllib.parse import urlparse, parse_qs
url = "https://example.com/employees?name=Anand&salary=25000"
parsed_url = urlparse(url)
dct = parse_qs(parsed_url.query)
print ("Query parameters:", dct)

它将生成以下 output −

It will produce the following output −

Query parameters: {'name': ['Anand'], 'salary': ['25000']}

urlsplit(urlstring)

这类似于 urlparse()，但不会从 URL 中分离参数。如果想要较新的 URL 语法，允许将参数应用于 URL 的路径部分的每个部分，则通常应该使用它代替 urlparse()。

This is similar to urlparse(), but does not split the params from the URL. This should generally be used instead of urlparse() if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL is wanted.

urlunparse(parts)

此函数与 urlparse() 函数相反。它根据 urlparse() 返回的元组构造一个 URL。parts 参数可以是任何六项可迭代对象。这会返回一个等效的 URL。

This function is the opposite of urlparse() function. It constructs a URL from a tuple as returned by urlparse(). The parts argument can be any six-item iterable. This returns an equivalent URL.

Example

from urllib.parse import urlunparse

lst = ['https', 'example.com', '/employees/name/', '', 'salary>=25000', '']
new_url = urlunparse(lst)
print ("URL:", new_url)

它将生成以下 output −

It will produce the following output −

URL: https://example.com/employees/name/?salary>=25000

urlunsplit(parts)

将通过 urlsplit() 返回的元组的元素组合成一个完整的 URL，作为字符串。parts 参数可以是任何五项可迭代对象。

Combine the elements of a tuple as returned by urlsplit() into a complete URL as a string. The parts argument can be any five-item iterable.

The urllib.request Module

此模块通过使用 urlopen() 函数提供了用于处理 URL 的打开和读取操作的函数和类。

This module offers the functions and classes for handling the URL’s opening and reading operations by using the urlopen() function.

urlopen() function

此函数打开给定的 URL，该 URL 可以是字符串或请求对象。可选的 timeout 参数指定一个超时时间（以秒为单位），用于阻塞操作。这实际上只适用于 HTTP、HTTPS 和 FTP 连接。

This function opens the given URL, which can be either a string or a Request object. The optional timeout parameter specifies a timeout in seconds for blocking operations This actually only works for HTTP, HTTPS and FTP connections.

此函数始终返回一个对象，该对象可用作上下文管理器，并具有 url、headers 和 status 属性。对于 HTTP 和 HTTPS URL，此函数返回一个稍加修改的 http.client.HTTPResponse 对象。

This function always returns an object which can work as a context manager and has the properties url, headers, and status. For HTTP and HTTPS URLs, this function returns a http.client.HTTPResponse object slightly modified.

Example

以下代码使用 urlopen() 函数从图像文件的二进制数据读取，并将其写入本地文件。您可以使用任何图像查看器在您的计算机打开图像文件。

The following code uses urlopen() function to read the binary data from an image file, and writes it to local file. You can open the image file on your computer using any image viewer.

from urllib.request import urlopen
obj = urlopen("https://www.tutorialspoint.com/images/logo.png")
data = obj.read()
img = open("img.jpg", "wb")
img.write(data)
img.close()

它将生成以下 output −

It will produce the following output −

The Request Object

urllib.request 模块包括 Request 类。此类是 URL 请求的抽象类。构造函数需要一个有效的 URL 字符串参数。

The urllib.request module includes Request class. This class is an abstraction of a URL request. The constructor requires a mandatory string argument a valid URL.

Syntax

urllib.request.Request(url, data, headers, origin_req_host, method=None)

Parameters

url − A string that is a valid URL
data − An object specifying additional data to send to the server. This parameter can only be used with HTTP requests. Data may be bytes, file-like objects, and iterables of bytes-like objects.
headers − Should be a dictionary of headers and their associated values.
origin_req_host − Should be the request-host of the origin transaction
method − should be a string that indicates the HTTP request method. One of GET, POST, PUT, DELETE and other HTTP verbs. Default is GET.

Example

from urllib.request import Request
obj = Request("https://www.tutorialspoint.com/")

此 Request 对象现在可用作 urlopen() 方法的参数。

This Request object can now be used as an argument to urlopen() method.

from urllib.request import Request, urlopen
obj = Request("https://www.tutorialspoint.com/")
resp = urlopen(obj)

urlopen() 函数返回一个 HttpResponse 对象。调用其 read() 方法可抓取指定 URL 的资源。

The urlopen() function returns a HttpResponse object. Calling its read() method fetches the resource at the given URL.

from urllib.request import Request, urlopen
obj = Request("https://www.tutorialspoint.com/")
resp = urlopen(obj)
data = resp.read()
print (data)

Sending Data

如果为 Request 构造函数定义数据参数，将向服务器发送 POST 请求。数据应为任何以字节表示的对象。

If you define data argument to the Request constructor, a POST request will be sent to the server. The data should be any object represented in bytes.

from urllib.request import Request, urlopen
from urllib.parse import urlencode

values = {'name': 'Madhu',
   'location': 'India',
   'language': 'Hindi' }
data = urlencode(values).encode('utf-8')
obj = Request("https://example.com", data)

Sending Headers

Request 构造函数还接受标头参数以将标头信息推入请求中。它应包含在一个字典对象中。

The Request constructor also accepts header argument to push header information into the request. It should be in a dictionary object.

headers = {'User-Agent': user_agent}
obj = Request("https://example.com", data, headers)

The urllib.error Module

urllib.error 模块中定义了以下异常 −

Following exceptions are defined in urllib.error module −

URLError

由于网络未连接（找不到路由到指定服务器）或指定的服务器不存在，因此会引发 URLError。在此情况下，引发的异常将具有“原因”属性。

URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn’t exist. In this case, the exception raised will have a 'reason' attribute.

Example

from urllib.request import Request, urlopen
import urllib.error as err

obj = Request("http://www.nosuchserver.com")
try:
   urlopen(obj)
except err.URLError as e:
   print(e)

它将生成以下 output −

It will produce the following output −

HTTP Error 403: Forbidden

HTTPError

服务器每次发送 HTTP 响应时都会与一个数字“状态码”相关联。该代码指示服务器无法满足请求的原因。默认处理程序将为您处理其中一些响应。对于它无法处理的响应，urlopen() 函数会引发 HTTPError。HTTPErrors 的典型示例为“404”（页面未找到）、“403”（禁止请求）和“401”（需要身份验证）。

Every time the server sends a HTTP response it is associated with a numeric "status code". It code indicates why the server is unable to fulfill the request. The default handlers will handle some of these responses for you. For those it can’t handle, urlopen() function raises an HTTPError. Typical examples of HTTPErrors are '404' (page not found), '403' (request forbidden), and '401' (authentication required).

Example

from urllib.request import Request, urlopen
import urllib.error as err

obj = Request("http://www.python.org/fish.html")
try:
   urlopen(obj)
except err.HTTPError as e:
   print(e.code)

它将生成以下 output −

It will produce the following output −