A Basic Guide About Encodings

As well said by Joel Spolsky^[1], it is almost unacceptable for a developer to be that reckless and ignore the existence of encodings when all we deal with is information and encodings are the base of representing that information.

I’ve faced lots of problems with roots in encodings and wasn’t aware of that, also, even when I realized the existence of encodings, I was told that it was almost impossible to learn about it. Well, it is very confusing, but I found out that some of the main reasons are the terms and the confusion that a lot of sources make using them.

This post will point a basic structure and a few definitions and then use examples and the most common questions about encodings. Feel free to suggest more questions in the comments.

A basic structure

To understand what encodings are, it is necessary to understand the information flow in which the idea of a character turns into a byte (and vice-versa). The RFC 2130^[2], defines 3 abstraction levels, here we will put a level before the others and make them four, based on the Unicode Model^[3], to improve the understanding:

Abstract Character Repertoire (ACR)
Coded Character Set (CCS)
Character Encoding Scheme (CES)
Transfer Encoding Syntax (TES)

The idea of a letter, for example, passes through these 4 layers to turn into a piece of information manageable by a computer. Let’s see what each of these layers means and, by the end, illustrate with a few examples.

Abstract Characters Repertoire

An abstract character, or shortly, a character, is a minimal unit of text that has semantic value^[5], thus, an Abstract Characters Repertoire is a set of characters, that will be encoded, like some alphabet or symbol set.

A {, ~, ã or a are examples of characters. But as a character is an abstract definition, in other words, the char is not this a, but any representation of this letter with the same meaning, it may happen that the glyphs can be different, in another font, for instance. It is also possible that the apparently the same glyph represent two different characters, like the letter A and the Greek Uppercase alpha Α (check it, they have a different code).

Examples of ACRs are the Unicode/10646 (below we will explain why we treat Unicode and ISO10646 as the same) repertoire and the Western European alphabets and symbols of Latin-1 (CS 00697). We must pay attention that these standards some times have the same name for repertoires and for the character set.

Coded Character Set

A Coded Character Set (CCS) is a mapping from a set of abstract characters to a set of integers. Examples of coded character sets are ISO 10646, US-ASCII, and ISO-8859 series.^[2]

Character Encoding Scheme

A Character Encoding Scheme (CES) is a mapping from a Coded Character Set or several coded character sets to a set of octets. Examples of Character Encoding Schemes are ISO 2022 and UTF-8. A given CES is typically associated with a single CCS; for example, UTF-8 applies only to ISO 10646.^[2]

The Unicode Model^[3] defines two different layers within this one, but this post will consider it as only one for simplicity’s sake.

Transfer Encoding Syntax

It is frequently necessary to transform encoded text into a format which is transmissible by specific protocols. The Transfer Encoding Syntax (TES) is a transformation applied to character data encoded using a CCS and possibly a CES to allow it to be transmitted. Examples of Transfer Encoding Syntaxes are Base64 Encoding, gzip encoding, and so forth.

Examples

Suppose we have a character Á, representing the latin letter A with an accent ' in it. It belongs to various different Characters Repertoires, for example, the Unicode/10646 and Latin-1 repertoires. Let’s use the Unicode/10646.

Looking into the Coded Character Set Table (also called as codepage, Charset Table, etc) we find that the integer that represents it is the U+00C1 (the Unicode Standard puts the U+ before all code points representations). The next step is to convert it into an octet, using the Character Encoding Scheme (also know as encode the string), we can use UTF-8, and have a result like 0xc381, UTF-16 and have 0xfffec100, or any other Encoding Scheme defined by the chosen CCS. For this example, let’s use the UTF-8.

The use of the Transfer Encoding Syntax is commonly related to the transmission form. The HTML may define the TES in the Content-transfer-encoding header^[4], for instance, as a Base64, and we would have w4E=.

Normally we deal mostly with the ACS, CCS, and CES, calling it the encoding process, and let the TES to be dealt with by the machine.

So we have:

(ABS) Abstract Character Á
(CCS) Code point U+00C1
(CES) The octets 0xC381
(TES) Base64 w4E=

The decoding just follows the inverse path.

Some elements that may cause confusion

The explanation above seems quite simple, so where the confusion lives? I will try to list a few below.

The charset name in MIME header

According to the RFC 2130^[2]

The term ‘Character Set’ means many things to many people. Even the MIME registry of character sets registers items that have great differences in semantics and applicability.

This causes a lot of confusion, like when you read Content-Type: text/plain; charset=ISO-8859-1 in an HTML page, because “in MIME, the Coded Character Set and Character Encoding Scheme are specified by the Charset parameter to the Content-Type header field”^[2].

What are the other Coded Character Sets? I only know Unicode!

There are lots of CCSs, for example, in the beginning, there were standards like EBCDIC and ASCII (the name is the same across all the layers), and others CCS that completed ASCII in terms of languages arise later, like Cyrillic (ISO 8859-5) and Latin1 (ISO 8859-1). The effort to unify it gave birth to the Unicode and ISO 10646, that wrapped the other standards. There is also another CCS standard used in China called GB18030^[9].

Important to emphasize that the Unicode and the ISO 10646 are now synchronized^[8] and the names are almost interchangeable.

There are CCS inside others CCS? Like Latin1 is inside Unicode?

As the technical report for Unicode Character Encoding Model^[3] notes:

Subsetting is a major formal aspect of ISO/IEC 10646. The standard includes a set of internal catalog numbers for named subsets and further makes a distinction between subsets that are fixed collections and those that are open collections, defined by a range of code positions. Open collections are extended any time in addition to the repertoire gets encoded in a code position between the range limits defining the collection. When the last of its open code positions are filled, an open collection automatically becomes a fixed collection.

So, there is various Latin CCS and also there is a subset of Unicode/10646 that is called Latin. It is not like the Latin1 (ISO 8859-1) itself is inside Unicode, but all of its characters are and they form a collection.

And what is “plain text”?

The use of the “plain text” expression is controversial^[1], but still, it is largely used, and it may be useful to at least try to understand what it should mean.

When the ASCII was firstly implanted, the CCS/CES had 256 possible codepoints but only 128 being used, what left 128 spare positions. So a lot of people had a lot of different ideas with what to do with the spare bits. Western Europe created the Latin and others, the Russian created the Cyrillic and almost every other culture made their version. Despite this Babel Tower, almost all the encodings preserved the ASCII characters with the same octets. The newer encodings like UTF-8 also followed this convention.

According to Jukka Korpela^[6]

ASCII has been used and is used so widely that often the word ASCII refers to “text” or “plain text” in general, even if the character code is something else! The words “ASCII file” quite often mean any text file as opposed to a binary file.

However, stay aware that using ASCII characters does not guarantee that your string will be read correctly by anyone.

Conclusion

Seems like encodings are not so hard to understand, right? Next time your text look like a mess in the screen you may know what to do.

Also the next time you have to choose what encoding to use, remember that the recommendation from the RFC 2130 is to use CCS ISO 10646 and encoding UTF-8 as default.

References

[1] The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky
[2] RFC 2130 - The Report of the IAB Character Set Workshop
[3] Unicode Character Encoding Model
[4] RFC 1341 - The Content-Transfer-Encoding Header Field
[5] Wikipedia - Character encoding
[6] A tutorial on character code issues, by Jukka “Yucca” Korpela
[7] RFC 3629 - UTF-8, a transformation format of ISO 10646
[8] Wikipedia - Unicode
[9] Wikipedia - GB 18030

Lucas' Blog

cat /dev/thoughts