The Web Design Group

Using Character Encodings

Other languages: français

A character encoding is a method of converting bytes into characters. To validate or display an HTML document, a program must choose a character encoding. For documents in English and most other Western European languages, the widely supported encoding ISO-8859-1 is typically used.

Versions of HTML prior to HTML 4.0 supported a limited character set, only allowing those characters that could be encoded using ISO-8859-1. Documents that use Chinese, Cyrillic, Greek, or other non-Latin characters must therefore use HTML 4.0 or later (or a custom version of HTML). Attempting to validate non-Latin documents against HTML 3.2 or earlier versions will result in an error for each non-Latin character.

An HTML document must specify its character encoding. The preferred method of indicating the encoding is by using the charset parameter of the Content-Type HTTP header. For example, to specify that an HTML document uses ISO-8859-1, a server would send the following header:

Content-Type: text/html; charset=ISO-8859-1

How to make a server send this header depends on the particular server; check your server documentation if necessary. With the widely used Apache server, one can add a file named ".htaccess" to any directory to set the Content-Type of files in that directory and any sub-directories. To specify the Content-Type used in the previous example for all files with the extension "html", one would use the following in the ".htaccess" file:

AddType 'text/html; charset=ISO-8859-1' html

A less preferred method of setting the character encoding is by using the following tag in the HEAD of an HTML document:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

This method requires that ASCII characters stand for themselves until after the <META> tag and often causes an annoying redraw with Netscape. The META HTTP-EQUIV method should only be used if one cannot set the charset parameter using the server.

More Information