Html5 简明教程

HTML - Character Encodings

字符编码是一种将字节转换为字符的方法。为了正确验证或显示 HTML 文档,程序必须选择正确的字符编码。

Character encoding is a method of converting bytes into characters. To validate or display an HTML document properly, a program must choose a proper character encoding.

HTML Charset Attribute

元标记的 HTML 字符集属性用于提及网页的字符编码。

The HTML charset attribute of meta tag is used to mention character encoding of webpage.

<meta charset="UTF-8">

The ASCII Character Set

计算机上使用最常见的字符集或字符编码为 ASCII (The American Standard Code for Information Interchange) ,这可能是用于以电子方式对文本进行编码的最广泛使用的字符集。ASCII 编码由 128 个字符(0-127)组成。

The most common character set or character encoding in use on computers is ASCII (The American Standard Code for Information Interchange), and this is probably the most widely used character set for encoding text electronically. ASCII encoding consist of 128 characters(0-127).

  1. English Alphabets (A-Z and a-z)

  2. Numbers(0-9)

  3. Special Characters (@, #, $, %, etc)

你可以查看 * Printable ASCII Characters* 的完整集

You can have a look at complete set of Printable ASCII Characters

The ANSI Character Set

ANSI 字符集通常用于 Windows 系统,它也称为 windows-1252。它包括

ANSI character set is generally used in windows systems, it is also called as windows-1252. This includes

  1. From 0 to 127 ANSI follows ASCII characters.

  2. From 128 to 159 some extra special characters are added.

  3. From 160 to 255 it’s identical to UTF-8.

The ISO-8859-1 Character Set

ISO-8859-1 是 HTML 4 的默认字符集。此字符集支持 256 个不同的字符代码。

ISO-8859-1 was the default character set for HTML 4. This character set supported 256 different character codes.

  1. Same as ASCII for the first 128 characters

  2. Does not use the characters from 128 to 159

  3. Same as ANSI and UTF-8 from 160 to 255

The UTF-8 Character Set

HTML5 规范建议开发人员在网页中使用 UTF-8 编码,因为 UTF-8 涵盖了世界上的所有字符和符号。UTF-8 的字符是。

The HTML5 specifications recommends developers to use UTF-8 encodings in webpages, because UTF-8 covers all character and symbols in the world. The characters of UTF-8 are.

  1. Identical to ASCII for 0 to 127 characters

  2. Characters 128 to 159 are empty

  3. Uses same characters as ANSI and 8859-1 from 160 to 255

  4. Characters from other language are specified using 256 to 1000

国际标准化组织创建了一系列字符集来处理不同的国家字符。对于英语和大多数其他西欧语言中的文档,使用广泛支持的编码 ISO-8859-1。

The International Standards Organization created a range of character sets to deal with different national characters. For the documents in English and most other Western European languages, the widely supported encoding ISO-8859-1 is used.

ISO Character Sets

世界上使用的字符集列表及其说明。

Here is the list of Character Set being used around the world along with their description.

Character Set

Description

ISO-8859-1

Latin alphabet part 1 Covering North America,Western Europe, Latin America, theCaribbean, Canada, Africa

ISO-8859-2

Latin alphabet part 2 Covering Eastern Europe

ISO-8859-3

Latin alphabet part 3 Covering SE Europe, Esperanto, miscellaneous others

ISO-8859-4

Latin alphabet part 4 Covering Scandinavia/Baltics (and others not in ISO-8859-1)

ISO-8859-5

Latin/Cyrillic alphabet part 5

ISO-8859-6

Latin/Arabic alphabet part 6

ISO-8859-7

Latin/Greek alphabet part 7

ISO-8859-8

Latin/Hebrew alphabet part 8

ISO-8859-9

Latin 5 alphabet part 9 Same as ISO-8859-1 except Turkish characters replace Icelandic ones

ISO-8859-10

Latin 6 Latin 6 Lappish, Nordic, and Eskimo

ISO-8859-15

The same as ISO-8859-1 but with more characters added

ISO-2022-JP

Latin/Japanese alphabet part 1

ISO-2022-JP-2

Latin/Japanese alphabet part 2

ISO-2022-KR

Latin/Korean alphabet part 1

然后成立 Unicode Consortium,以设计一种方法来显示不同语言的所有字符,而不必为不同的语言使用这些不同的不兼容字符代码。

The Unicode Consortium was then set up to devise a way to show all characters of different languages, rather than have these different incompatible character codes for different languages.

因此,如果您想创建使用多个字符集中的字符的文档,您将能够使用单个 Unicode 字符编码来实现。

Therefore, if you want to create documents that use characters from multiple character sets, you will be able to do so using the single Unicode character encodings.

因此,Unicode 规定了可以用特殊方式处理字符串的编码,以便为其包含的大量字符集留出足够的空间。它们被称为 UTF8、UTF-16 和 UTF-32。

Unicode therefore specifies encodings that can deal with a string in special ways so as to make enough space for the huge character set it encompasses. These are known as UTF8, UTF-16, and UTF-32.

UTF Character Sets

Character Set

Description

UTF-8

A Unicode Translation Format that comes in 8-bit units that is, it comes in bytes. A character in UTF8 can be from 1 to 4 bytes long, making UTF8 variable width.

UTF-16

A Unicode Translation Format that comes in 16-bit units that is, it comes in shorts. It can be 1 or 2 shorts long, making UTF16 variable width.

UTF-32

A Unicode Translation Format that comes in 32-bit units that is, it comes in longs. It is a fixed-width format and is always 1 "long" in length.

Unicode 字符集的前 256 个字符对应于 ISO-8859-1 的 256 个字符。默认情况下,HTML 4 处理器应该支持 UTF-8,而且 XML 处理器应该支持 UTF-8 和 UTF-16; 因此,所有符合 XHTML 的处理器也应该支持 UTF-16。

The first 256 characters of Unicode character sets correspond to the 256 characters of ISO-8859-1. By default, HTML 4 processors should support UTF-8, and XML processors are supposed to support UTF-8 and UTF-16; therefore all XHTML-compliant processors should also support UTF-16.