Python 简明教程
Python - Unicode System
What is Unicode System?
软件应用程序通常需要在多种不同的语言中显示消息输出,例如英语、法语、日语、希伯来语或印地语。 Python’s string 类型使用 Unicode 标准表示字符。它使程序能够处理所有这些不同的可能字符。
Software applications often require to display messages output in a variety in different languages such as in English, French, Japanese, Hebrew, or Hindi. Python’s string type uses the Unicode Standard for representing characters. It makes the program possible to work with all these different possible characters.
字符是文本中最小可能的组成部分。“A”、“B”、“C”等都是不同的字符。同样还有“È”和“Í”。Unicode 字符串是代码点的序列,代码点是从 0 到 0x10FFFF(十进制 1,114,111)的数字。代码点的这个序列需要在内存中表示为一组代码单元,然后将代码单元映射到 8 位字节。
A character is the smallest possible component of a text. 'A', 'B', 'C', etc., are all different characters. So are 'È' and 'Í'. A unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.
Character Encoding
代码点的序列在内存中表示为一组代码单元,映射到 8 位字节。将 Unicode 字符串转换为字节序列的规则称为字符编码。
A sequence of code points is represented in memory as a set of code units, mapped to 8-bit bytes. The rules for translating a Unicode string into a sequence of bytes are called a character encoding.
存在三种类型的编码:UTF-8、UTF-16 和 UTF-32。UTF 代表 Unicode Transformation Format 。
Three types of encodings are present, UTF-8, UTF-16 and UTF-32. UTF stands for Unicode Transformation Format.
Python’s Unicode Support
Python 3.0 及更高版本已内置支持 Unicode。 str 类型包含 Unicode 字符,因此使用单引号、双引号或三重引号字符串语法创建的任何字符串都将存储为 Unicode。Python 源代码的默认编码为 UTF-8。
Python 3.0 onwards has built-in support for Unicode. The str type contains Unicode characters, hence any string created using single, double or the triple-quoted string syntax is stored as Unicode. The default encoding for Python source code is UTF-8.
因此,字符串可能包含 Unicode 字符的文字表示形式 (3/4) 或其 Unicode 值 (\u00BE)。
Hence, string may contain literal representation of a Unicode character (3/4) or its Unicode value (\u00BE).
Example
var = "3/4"
print (var)
var = "\u00BE"
print (var)
以上代码将产生以下 output −
This above code will produce the following output −
3/4
¾
Example
在下例中,一个字符串“10”使用 1 和 0 的 Unicode 值(分别为 \u0031 和 u0030)进行存储。
In the following example, a string '10' is stored using the Unicode values of 1 and 0 which are \u0031 and u0030 respectively.
var = "\u0031\u0030"
print (var)
它将生成以下 output −
It will produce the following output −
10
字符串以人类可读的格式显示文字,字节将字符存储为二进制数据。编码将数据从字符字符串转换为一系列字节。解码将字节转换回人类可读字符和符号。重要的是不要
Strings display the text in a human-readable format, and bytes store the characters as binary data. Encoding converts data from a character string to a series of bytes. Decoding translates the bytes back to human-readable characters and symbols. It is important not
将这两种方法混淆。encode 是字符串方法,而 decode 是 Python 字节对象的方法。
to confuse these two methods. encode is a string method, while decode is a method of the Python byte object.
Example
在下面的示例中,我们有一个由 ASCII 字符组成的字符串变量。ASCII 是 Unicode 字符集的一个子集。可以使用 encode() 方法将其转换为字节对象。
In the following example, we have a string variable that consists of ASCII characters. ASCII is a subset of Unicode character set. The encode() method is used to convert it into a bytes object.
string = "Hello"
tobytes = string.encode('utf-8')
print (tobytes)
string = tobytes.decode('utf-8')
print (string)
反解码 (decode()) 方法将字节对象转回 str 对象。所使用的反解码方法为 utf-8。
The decode() method converts byte object back to the str object. The encodeing method used is utf-8.
b'Hello'
Hello
Example
下列示例中,卢比符号 (₹) 根据 Unicode 值存储在 variable 中。我们将字符串转换为字节,再转换为 str。
In the following example, the Rupee symbol (₹) is stored in the variable using its Unicode value. We convert the string to bytes and back to str.
string = "\u20B9"
print (string)
tobytes = string.encode('utf-8')
print (tobytes)
string = tobytes.decode('utf-8')
print (string)
执行上述代码时,将会生成以下 output −
When you execute the above code, it will produce the following output −
₹
b'\xe2\x82\xb9'
₹