Character sets and encoding
- Damodar Chetty
- October 07, 2007

The topic of character sets is one that causes a lot of heartburn for Web developers trying to internationalize their sites. I've spent countless hours trying to get these concepts straight, and in the next few paragraphs, I'm going to attempt to unravel some of the arcana that this involves.

Languages have often been written to persistent storage (clay tablets, stone, parchment, paper, etc.) using their associated script. A language's script comprises a set of symbols that represent its consonants and vowels (e.g., the symbol 'a', for the English language, or the letter 'व' in Devanagari.)

The complete set of all these symbols (or characters) for a given script are completely independent of any computer usage - and are what a child might learn in kindergarten.

In order for a symbol within a script to be represented on a computer, we need to perform the following tasks:

  1. Collect all the symbols together along with their natural ordering
  2. Assign a unique numeric code (aka a 'code point') to represent each symbol. This is typically based on their natural ordering (e.g., 'A' is 65, 'B' is 66, etc.) The set of all associations of symbols -> numbers for a given script is called a character set.
  3. Specify a rule describing how each code will be transformed into its byte representation (number of bytes that will be required, endian ordering, serialization mechanism, etc.) This rule is termed an encoding. The same number can be represented by different byte-combinations depending on the chosen encoding. This gives rise to a key concept - a given character set may have different encodings.
  4. Conjure up a graphical representation for each symbol, called a glyph. Note that a font is nothing but a collection of glyphs that are written in a particular point size, type face, and style. I.e., the letters 'a' through 'z' written in [10, courier, bold] form one font and have one set of glyphs, whereas another exists for [10, arial, bold]. Glyphs may be either bitmaps or vector drawings, and tell the computer how each code point should be rendered.

Computers work with numbers, so the only way they can work with natural languages is if we have a mechanism of encoding each character symbol into a byte-representation. This byte-representation (aka an encoding) would have to be a standard so that my computer interprets the characters in the same manner that yours does.

So far, we took written characters, mapped them to numbers (their code points), and determined an encoding. We also tied each code point to various glyphs, one for each point size, type face, and style combination that is supported.

Then, so long as we have a keyboard (or other input device) that can convert the symbols we want (as marked on the individual keys) into code points, and as long as our editor converts those code points into the appropriate byte representation (based on the current encoding), and as long as the graphics card is able to determine the glyph associated with that symbol - we are done.

In other words, the process that is followed during the encoding process is as follows:

  1. A character symbol is typed in by the user (using a physical or a virtual keyboard)
  2. A character set map is consulted to determine the code point (numeric value) associated with this symbol.
  3. The code point is passed to the desired encoding algorithm
  4. The encoding algorithm uses rules that define the number of bytes to be used, the endian-ness of the byte ordering, etc. to convert the code point into a byte sequence.
  5. The bytes are written to a file, or stored in a String

The reverse occurs during the decoding process (as a file is being read, as text is being rendered to a screen, etc.).

  1. A byte sequence is read from a file, or from a String.
  2. The encoding algorithm used defines how many bytes make up a single character, as well as the byte ordering, etc. This is used to convert the byte sequence back into a single code point.
  3. A character set map is consulted to determine the symbol associated with this decoded code point (numeric value).
  4. A character symbol is rendered to the user (using a monitor, printer, etc.)

At this point, I'm going to take a step back and hammer through a few more details in a couple critical areas. Viz., character sets, and encodings.

Character Sets

As we saw earlier, a character set defines a mapping between virtual symbols ('0', 'a', 'Y', '@', etc.) and code points (numbers).

The primordial character set (at least in most everyone's living memory) is ASCII - the American Standard Code for Information Interchange. This character set was designed primarily to encode US English. To that end, the numbers '0' through '9' are represented by the logical code points 48 through 57; the letters 'A'-'Z' by 65-90, and 'a'-'z' by 97-122. Interspersed in there were other printable characters (:, -, +, etc.) as well as a number of control characters (mostly obsolete now) that had special hardware meaning to the computer (e.g., tab, bell, end of transmission, etc.)

The major drawback for ASCII was that while it was sufficient for writing American English, its support for other languages and locales was non existent. For example, it did not even have a pound character represented for UK English. Of course, this was hardly a problem where documents were primarily generated and consumed within a single locale .

However, all the standard characters used in US English were easily accommodated in 7 bits, leaving the code points above 127 up for grabs. Unfortunately, there were only 128 more code points available - surely not enough room for all the languages in the world. One compromise was to use this area, called the code page, to hold code points for one additional language - that of the country in which the computer was likely to be used. This worked well when exchanging documents between computers that shared the same code page. However, any document sent to a computer that used a different code page would be rendered incorrectly. After all, the correct glyph can be chosen only when both computers can agree on which symbol is represented by a particular code point. It is interesting to note that computers across the world agreed on the code points below 128 - and so any 2 computers could safely exchange documents that were restricted to the ASCII set (US English). It was only when characters beyond 127 were used, that the confusion arose. In addition, it was impossible to have a document that incorporated multiple languages that conflicted in the use of these code points. E.g., Latin-1 combines most Western languages and Icelandic, but Latin-5 replaces Icelandic with Turkish. So, a document that needed to use both Icelandic and Turkish would have to be written in Unicode.

Unicode was introduced as an attempt to eliminate this confusion. The Unicode character set encompasses characters from almost every language in the world. This has the disadvantage that multiple bytes are now required to represent each code point (since it supports over a million characters, but a byte can encode only 255 code points). For historical reasons, the first 255 characters of the Unicode character set map directly to the Latin-1 characters (ISO 8859-1). The code points 256 to 383 support languages like Afrikaans, Czech, Turkish, Welsh, etc.; Tamil is encoded in the code points 2944 to 3071, Thai in 3584 to 3711; Hiragana and Katakana in 12352 to 12543, and so on. Geometric shapes (9632 to 9727), box drawing elements (9472 - 9599) and Zapf Dingbats (9984 - 10175) also find representation in this set.

The major advantage, however, is that as long as you are using an Unicode encoding, you can mix characters from any of these languages in the same document, and the receiver will be able to decode it appropriately for rendering to output.

Character Encoding

To summarize, a "character set" encompasses two concepts: a collection of characters from one or more languages that you intend to use in a document, and a mapping of each of those characters to a code point - i.e., a numeric code that uniquely identifies each character within that character set. ASCII or Latin-1 are smaller collection maps (<256 symbols), whereas Unicode is a mucho-grande map (> 1,000,000 symbols).

So, if you just say "Unicode", all you are referring to is the mapping between the individual character symbols and their correspoding integer code points. Before a character set can be used by a computer, you need to specify an encoding - i.e., how these integers will be represented as bytes in memory.

A "character encoding" adds yet another element to this mix - an algorithm that determines how each code point will be represented in terms of bits and bytes. I.e., this comprises the number of bytes that will be required and the endian-ordering of the bytes as they are written.

The simplest character encodings are those where there is a trivial one-to-one mapping between each code point (number) in the character set and a single byte. E.g., all the characters in ASCII can be represented in a single byte.

In the late 80s the ISO as well as the Unicode consortium began work on a unified character set that would support multilingual software. The combined efforts bore fruit in 1993 with the ISO 10646-1 which defines the Universal Character Set (UCS) - which contains the characters required to represent most known languages, many historic (e.g., Hieroglyphs) and fictional ones (e.g., Klingon), as well as mathematical and graphical symbols. It was designed as a 31-bit character set (code points U-00000000 to U-7FFFFFFF) allowing up to 2 billion 2000 million characters.

The most commonly used characters (e.g., Latin-1, etc.) are found in the first plane (where each plane is represented by groups of 216 characters - as represented by the least significant 16 bits) - called the Basic Multilingual Plane (BMP). ISO 10646-2 (2001) defined characters outside the BMP. Until 2001 (Unicode 3.1), a common misconception was that Unicode only defined up to 65535 characters, and so a 2-byte encoding (UTF-2) would suffice. Unfortunately, this misconception continues through today.

The UCS-4 encoding can represent all Unicode characters, the UCS-2 encoding however can represent only those from the BMP (U+0000 to U+FFFF). This led to the popular misconception that Unicode only needed an unsigned integer based 2-byte encoding. However, Unicode's valid code points far exceed 65535 (over a million possible code points), and as a result, you do need 4-bytes per character (UTF-32 or UCS-4).

The full usage of 31 bits would allow representation of over 2 billion different symbols. However, the standard today is codified as ISO 10646 - (using 21 bits) and is expected to hold over 1 million code points (0x000000 to 0x10FFFF). A Unicode character is preceded by U+, e.g., U+0041 represents the character 'A'.

This is unfortunately wasteful when dealing with English text, for example, where the characters lie between 0 and 127 (U+0000 to U+007F which corresponds to ASCII) or 0-255 (U+0000 to U+00FF, which corresponds to Latin-1), and so can be represented in just 1 byte. I.e., a file encoded in Latin-1 grows to be 4 times it size when re-encoded in UTF-32. Hence, additional encodings were proposed. In particular, UTF-8 which uses 1 byte to represent the standard ASCII set, and can use up to 4 bytes to represent all Unicode characters; and UTF-16 which uses 2 bytes for characters in the Basic Multilingual Plane, and 4 bytes for the supplementary characters.

In UTF-8, the original ASCII characters (U+0000 to U+007F) are encoded simply as bytes (0x00 to 0x7F), making it byte-compatible with the ASCII encoding. All UCS characters beyond this range are encoded using several bytes each.

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Of course, your choice of "encoding flavor" will depend on the specific language you will be using. I.e., if your document has a lot of characters that lie in the last range in the above table, you end up using 6 bytes per character, which is 50% worse than using UTF-32.

It is hard to reiterate this enough ... a character set can have multiple encodings. For example, the Japanese character sets can be represented using either EUC-JP or Shift_JIS encodings.

There is a lot of misinformation floating around regarding this. E.g., even the HTTP Content-Type header refers to the encoding as a charset:
Content-Type: text/html; charset=utf-8, when it really means encoding.

This has been corrected in the declaration used in XML files, where it is finally called an encoding.

When dealing with internationalization of Web applications, you need to clearly specify the character encoding to be used. Else, the characters will be rendered as unrecognizable characters.

An interesting aside is that you can encode any Unicode character into an ISO 8859-1 document using character escapes. I.e., you simply enclose the hexadecimal code point value for that character within &#x and ;. E.g., as &#x05D0; is rendered as א.

Web Development and Character encodings

So far, we have discovered that a character encoding refers to how symbols in some script are converted to byte sequences, i.e.:
Character Symbol -mapped to-> code point -converted to->sequence of 1 or more bytes

The reverse process (decoding) converts a byte sequence back into the appropriate character symbol that should be rendered:
Sequence of 1 or more bytes-convert to>code point -mapped to->Character Symbol

In other words, to a computer, there really is no such as thing as "text" - whether in a file or in a String object. All content is ["text" + "encoding"]. The encoding determines how character symbols are converted to bytes, and how the bytes are converted back into character symbols. Without this additional information, any conversion most often results in gibberish.

With a Web application, the two communicating parties comprise the Web server/container and the client browser. The client browser sends the server a HTTP request that either requests a page, or passes in form field parameters; and the server returns a HTTP response with the requested information or form. If the information sent across the wire must be understandable to the receiver, each sender must clearly indicate the encoding being used for the content of the communication.

By default, Web applications assume that any HTTP request is encoded using ISO 8859-1 (Latin-1). Either party may choose to use a different encoding - but it is then up to that part to ensure that this fact is communicated to the other. This is usually done via the Content-type HTTP header which may be set to, say, text/plain;charset=UTF-8 to indicate that all characters in the request are encoded in UTF-8.

Unless both parties have a common understanding of the encoding being used over a connection, the decoding process will not re-assemble the original plain text sent.

Note that in addition to knowing the encoding, the client browser must also have a font installed that can display the characters, so that it may render the symbols appropriately.

Server to Client (Multilingual Response)

JSP

To use a particular encoding within a JSP, set the page directive's pageEncoding and contentType attributes.

With JSPs there are 3 encodings that come into play -

  1. the encoding that the Web container uses internally - i.e., Unicode (since this is a Java based container),
  2. the encoding of the JSP page itself, i.e., the encoding of the textual content of the JSP page. If this is a non-Unicode encoding, the container will decode the JSP file and convert it to Unicode prior to processing it. The JSP file's encoding is specified using the page directive's pageEncoding attribute, and
  3. the encoding that should be used for the response that is sent to the client. The container will transcode its internal Unicode encoding of the page into the encoding specified by the contentType attribute.

E.g., <%@ page pageEncoding="Shift_JIS" contentType="text/html;charset=UTF-8"%>

If neither a pageEncoding or a contentType attribute is specified, then the default encoding of ISO 8859-1 is used to decode the bytes of the JSP file as well as for encoding the response returned to the client.
If a pageEncoding is not specified, the charset specified by the contentType attribute is used to decode the bytes in the JSP file as well.
If a pageEncoding is specified, but a contentType is not, then the charset specified by pageEncoding is used for both.
The Web container will raise a translation time error if an unrecognized page encoding is specified.

Given its importance in correctly decoding a JSP file's contents, and for encoding the response to the client, the page directive along with its charset specification must ideally appear as the first line of the JSP page. At the latest, it should appear before any characters appear that can only be interpreted when the charset is known. I.e., before any non ASCII characters are encountered.

Servlets

There are two mechanisms by which a server can inform the client browser about a non-default encoding being used in a response.


	

	<html>

	  <head>

	    <%= response.setLocale(Locale.KOREAN); %>

	  </head>

  	  <body>

	    <%="\uc548\ub155\uc138\uc694"%>

	  </body>

	</html>



	This is equivalent to:

	<html>

	  <head>

	    <%@ page contentType='text/html;charset=EUC-KR'%>

	    <% response.setHeader("Content-Language", "ko"; %>

	  </head>

  	  <body>

	    <%="\uc548\ub155\uc138\uc694"%>

	  </body>

	</html>



	

	

Client to Server (Internationalized Requests)

The encoding of a request is the character encoding that should be used to decode the parameters contained in that request. An internationalized request is one that contains a form that allows users to enter characters in a non Latin-1 character set. I.e., characters that are not supported in HTTP's default encoding.

In this case, the first step is for the server to inform the browser which encoding it should use to encode the user input.

For a JSP, the page directive's contentType attribute that we met earlier does double duty in this case. I.e., it not only informs the server which encoding should be used to encode the characters being returned to the client, but also tells the client which encoding is to be used to encode the characters being submitted to the server.
The same applies for servlets - which directly set the Content-type header for this same purpose.

E.g., <%@ page pageEncoding="Shift_JIS" contentType="text/html;charset=UTF-8"%>

A HTTP request can only contain parameter values made up from the characters defined by the ISO 8859-1 character set. Hence, the browser must encode all other characters entered in input fields in terms of the allowed characters. It encodes each non standard character as a string, starting with a % sign followed by a hex value for that character. The problem is that the hex value only makes sense if you know which charset it comes from.

Luckily, most browsers use the charset of the response containing the form to encode the parameter values when the form is submitted. As long as you keep track of the response encoding, you can tell the container which encoding to use to decode the parameter values.

Assume that the encoding is Unicode, i.e., the user can enter values using any Unicode character. Then, when the user submits the form, each character in the form field payload is first encoded into bytes using the UTF-8 encoding.
It then uses the HTTP standard URL encoding scheme to encode the resulting bytes. This encoding scheme dictates that even when the default ISO 8859-1 encoding is used, the bytes for all characters other than a-z, A-Z, and 0-9 must be converted into the byte's value in hexadecimal preceded by a % sign.

For a charset of UTF-8, each Japanese character symbol is represented by 3 bytes each. Each of these bytes would be converted to a % followed by its value in hex. E.g., %E4%BB%8A would represent a single Japanese character.

When the container receives this information, it must know which charset the browser used to encode it. As stated earlier, some browsers don't return a Content-Type request header - so, it is up to you to keep track of which encoding was used by a particular form, and to use that encoding to process the input. Once the container is told which charset to use, it can decode any parameter values correctly.


	  

  	  String value = request.getParameter("employeeFirstName");

  	  emp.firstName = new String(value.getBytes(), request.getCharacterEncoding());

  	  

  	

Here we use HttpServletRequest.getCharacterEncoding() to obtain the encoding being used in the request from the content-type header.

The String(byte[], String) constructor uses the specified character set to decode the specified array of bytes.

You can also use ServletRequest.setCharacterEncoding(String enc) to override the character encoding supplied by the container. This method must be called prior to parsing any request parameters.

Entering Unicode characters

You have a number of options when trying to use Unicode characters in a document.

  1. Use an UTF-8 capable editor:
    This is the simplest option since you simply type in the desired Unicode character symbol (using a physical or virtual keyboard). If this is a JSP, simply set the page directive's pageEncoding attribute to UTF-8. This ensures that the JSP will be parsed correctly. If this is a HTML file, set the content-type header appropriately.
  2. HTML Character Entities
    While only valid for HTML, they do provide a simple mnemonic-based mechanism for representing the Latin-1 characters that are beyond 127 (the ASCII characters). I.e., they provide a means to represent the additional Latin-1 characters using plain ASCII.
    To use this mechanism, you first determine the nickname for the particular character you want, and then enclose it within a & and a semicolon. For a list of nicknames see 'Character Entities' in the references below. E.g., 'æ' represents a ligature, and is represented using '&aelig;'.
  3. Character Escapes
    These are preferable to character entities since the entities are only valid for HTML. On the other hand, character escapes can be used in both servlets and JSP pages.
    You can use character escapes to represent any Unicode character, by enclosing its code point within a &#x and a semicolon. E.g., 'ர' represents the Tamil consonant 'ra'.
  4. Java Escape Sequence
    When the Java program source file's encoding doesn't support Unicode, you can represent Unicode characters by preceding the character's hex code point with a \u, e.g., \uXXXX.
    out.println("<h1>\uf460</h1>");

Determining Locales on the Web

An internationalized application must determine the encoding of the incoming request parameters. An HTML browser encodes each request using the encoding of the page that was the source of the request, but this is only useful if the original page's encoding is known. The following options are available for you to determine a request's locale:

  1. Define an application-wide encoding
    If an application transmits every one of its pages using a given encoding (e.g., UTF-8), then requests from those pages will always be in that encoding. This simplifies design, but requires that each page set this common encoding. A servlet filter can be used to set the response encoding to a single value before a servlet or JSP page receives the request. This enforces correct response encoding.
  2. Provide a separate entry point for each locale
    Map requests for a given locale to a separate servlet. E.g., http://myApp/login/en_US/ for US English, and http://myApp/login/de_CH for Swiss German.

References:

  1. XML Bible, Elliotte Rusty Harold
  2. http://www.joelonsoftware.com/articles/Unicode.html
  3. Servlets and JavaServer Pages: The J2EE Technology Web Tier. Faulkner, Jones
  4. JavaServer Pages, Bergsten
  5. The J2EE Tutorial, Bodoff et al.
  6. Advanced Java Server Pages, Geary.
  7. W3C Unicode tutorial
  8. Unicode FAQ
  9. Unicode character charts
  10. Character entities

Valid XHTML 1.0 Strict