Character encodings

The topic of characters and fonts is very complex. It started out easy, when all (American) computers used ASCII characters displayed in a terminal window. Now we have graphical displays that can draw any character from any alphabet in the world. But we don't sit at keyboards with thousands of keys. That just begins to hint at the complexity of letting documents use any character from any alphabet.

This document is a short introduction to characters, character sets, character encodings, glyphs, and fonts.

All the example code mentioned this document is in the sub folder called "character_sets" from the following zip file.

Code page

We mentioned in the last document that ASCII uses 7 bits to encode 128 characters. This covers the characters on a standard keyboard. Since a byte has one more bit in it, Extended ASCII uses that bit to encode 128 more characters (which are not on a standard keyboard). Unlike ASCII, which is an international standard, Extended ASCII was never standardized. There are hundreds of ways to choose 128 characters and create an Extended ASCII character set.

A choice of 128 characters, along with a choice of which number between 128 and 255 will represent each of those characters, is called a code page. Every code page uses ASCII for the character numbers from 0 to 127.

Code pages are not important to GUI programs, though some text editors use them. Code pages are important when you open a console window. A console window will display text data using a particular code page. The same text data may appear different in different console windows if the consoles are using different code pages. As we have seen many times, byte data is always ambiguous. There must always be an agreement about how to interpret bytes. A code page is an agreement about how to display character bytes.

Do the following experiments in a console window opened to the "character_sets" folder.

Windows has a command-line program, "chcp", that tells you what code page the console is currently using and it also lets you change the current code page.

    character_sets> chcp

In the "character_sets" folder there is a data file that contains character bytes with values in the Extended ASCII range from 128 to 254. The contents of this file will look different if you look at it using different code pages.

    character_sets> chcp 437
    character_sets> type CharacterData_Ex_ASCII.txt
    character_sets> chcp 1252
    character_sets> type CharacterData_Ex_ASCII.txt
    character_sets> chcp 869
    character_sets> type CharacterData_Ex_ASCII.txt
    character_sets> chcp 65000
    character_sets> type CharacterData_Ex_ASCII.txt
    character_sets> chcp 65001
    character_sets> type CharacterData_Ex_ASCII.txt

The default code page for the cmd terminal is usually code page 437. Sometimes it is code page 1252. It should be code page 65001, the UTF-8 "code page" (which is not really a code page).

Let's look at an example of why understanding code pages can be useful.

A Java source file, a file with the extension ".java" is usually thought of as a "text file". But there is no such thing as a "text file". Every file is just a sequence of bytes stored in a file system. As the compiler reads bytes from a source file, it needs to follow some agreement on how to interpret those bytes as a sequence of text characters. The Java compiler uses a code page as its agreement on how to interpret a sequence of bytes as a sequence of characters.

Which code page the Java compiler uses depends on both the version of Java and the operating system. Starting with Java 18, the Java compiler uses UTF-8 as its agreement for translating byte sequences into character sequences. Before Java 18, the Java compiler on Windows used code page Cp1252.

We can tell the Java compiler to use a specific code page by using the compiler's -encoding command-line option.

We can tell the JVM to use a specific default code page by using the JVM's -Dfile.encoding command-line option.

Character set

A "character set" is a choice of characters. A character set is usually some alphabet (like the Latin, Greek, or Cyrillic alphabets) combined with useful symbols (like punctuation or arithmetic symbols).

Each code page is a character set. Many code pages were created by countries to contain their native alphabet along with their most important symbols.

  • List of code pages
  • The Java language has a notion of a "charset". Most people read charset as "character set", but a Java charset is not exactly a character set. It is more like a "character encoding".

    Character encoding

    Once you have chosen the characters that will be in your character set, you then need to choose a binary encoding for each of those characters. Exactly what byte value (or combination of byte values) do you want each character to be represented by.

    Choosing a binary representation for each character in a character set is called a character encoding.

    Every code page is both a character set and a character encoding. In a code page, every character is encoded as a single byte. There are other character encodings that use more than one byte to encode a character.

    Coded character set

    If you take the characters in a character set and put them in a specific order, so you have a character that comes first, and a character that comes second, and a character that comes last, then that ordering makes the character set into a coded character set.

    This terminology can be confusing, because a "coded character set" is not a "character encoding". A "character encoding" means that you have assigned a binary value to represent each character in your character set. A "coded character set" means that you have put your characters in an ordering.

    When you have a coded character set, you have assigned a number to each character (but not a binary representation). The number assigned to a character is called its code point (in that coded character set).

    If we go back to extended ASCII, every code page is simultaneously a character set, a coded character set, and a character encoding. If you look at the code page's table, the table shows you the character set. The table puts the character in an ordering, from the table's upper left-hand corner to the table's lower right-hand corner. The code point for each character is determined by its position in the table, starting with 0 at the upper left-hand corner. Each character's binary encoding is just the 8-bit binary number for its code point.

    When we get to a more complex character set, like Unicode, then the difference between "code point" and "character encoding" becomes important.

    Unicode

    Unicode is a character set. It is supposed to be the set of all characters and symbols used by any, and every, language on Earth. Unicode presently has 159,801 characters in its character set.

    Unicode is a coded character set. The 159,801 characters in Unicode are in a specific order. The order a character has in this ordering is called the character's code point.

    Unicode is also a character encoding, but with several different encodings. The two most common encodings of Unicode are UTF-8 and UTF-16, and there is also UTF-32, but it is not used much. Java, JavaScript, and C# all define their char data type in terms of UTF-16. On the other hand, the Internet uses UTF-8, and most newer programming language are based on the UTF-8 character encoding.

    Let's look at Unicode code points first, and then we will look at Unicode encodings.

    Since Unicode is a coded character set, all 159,801 character are in an ordering from the first character to the last one. The first 127 characters (or code points) in the Unicode ordering are the 127 ASCII characters. The next 128 characters (or code points) are called "Latin-1". These are the most common characters used in Europe. The first 255 characters (code points) in Unicode are essentially Windows code page 28591 (or code page 1252).

    Unicode has a notation for Unicode code points. The notation uses the hexadecimal value of the code point. For example, the letter 'a' in ASCII has the code 97 in decimal, which is 0x61 in hexadecimal. Since the first 127 code points in Unicode are the same as ASCII, the Unicode code point for the letter 'a' is written U+0061.

    UTF-8

    UTF-8 is a binary encoding of all the code points in Unicode. UTF-8 uses an 8-but code unit. That means that every Unicode code point is encoded as either one, two, three, or four bytes. UTF-8 is called a "variable length encoding" because the encoding of a code point can have a length that is between 1 and 4 bytes. If a string has five characters, the UTF-8 encoding of that string can be between 5 and 20 bytes long, depending on the exact characters in the string.

    UTF-16

    UTF-16 is a binary encoding of all the code points in Unicode. UTF-16 uses a 16-bit code unit. That means that every Unicode code point is encoded as either one or two 16-bit (two byte) words. Because the UTF-16 code unit is two bytes, byte ordering is important. That leads to two additional encodings, UTF-16BE and UTF-16LE. In UTF-16BE, the big-endian byte order is always used. In UTF-16LE, the little-endian byte order is always used. In UTF-16, the byte order used by a sequence of bytes is declared by a byte order mark (BOM) at the beginning o the sequence.

    Java uses UTF-16BE. The char data type represents a UTF-16BE code unit (not a Unicode character!). If a Unicode code point is represented in UTF-16BE by a single code unit, then that char value does represent a Unicode character. But some Unicode code points require two code units in UTF-16BE. In that case, we need two char values to represent that character.

    Fonts

    When we work with computer representations of text (or what are generally called "writing systems") we need to make many detailed distinctions in order to be clear about what we are saying. We need to define a number of terms that describe text and its appearance, words like, character, glyph, font, and typeface. We will not give precise definitions from these terms, just definitions that are good enough to talk reasonably about text.

    A character is a letter from an alphabet, or a common symbol like a numeral or a punctuation mark.

    A glyph is a drawing of a character. Here is a drawing of the letter 'a'. But that is not the only way to draw the letter 'a'. Here are nine different glyphs for the letter 'a'.

    a a a a a a a a a

    You could use any one of those glyphs to spell the word "cat". The choice of a glyph does not change the meaning of the letter or the word. Most importantly, all nine of these glyphs have the same character encoding. They are all the ASCII character with hexadecimal code 0x61 (decimal code 97).

    Here are nine glyphs for the letter 'A'. Every one of these glyphs is represented by the ASCII character code hexadecimal 0x41 (decimal code 65).

    A A A A A A A A A

    If every one of those glyphs has the same ASCII code, then how does the display system know that it should draw them differently. The letters 'a' and 'A' have different ASCII codes, so we expect the display system to draw something different for each one. But if we use the same ASCII code, 0x41, nine times, how do we get nine different drawings? To (partially) answer this question, we need some more terminology.

    When you choose a specific glyph for each letter in an alphabet, that collection of glyphs is called a font. Usually, all the glyphs in a font have the same size and are in a similar style.

    A collection of fonts in different sizes, where each font is, more or less, a scaled version of the other fonts, is called a typeface.

    In most current uses of these terms, the distinction between a font and a typeface is blurred. When people talk about a font, they usually mean a typeface.

    In the above display of the nine glyphs for the letter 'A", just before each character 'A' there is special code embedded into this text that tells the display system to switch to a different font. Once the display system is told to switch to a different font, it will use that font to display every character it decodes. But in the above display, the font was changed nine times, once just before each occurrence of the character 'A'.

    If you want to see the hidden code that changes the font for each 'A', use your mouse to point at one of the glyphs, then "right-mouse-click" on the glyph and choose the menu item "Inspect".

    Not all display systems can switch fonts character by character. For example, most text editors cannot do that (but word processing programs can). In a text editor, you chose a global font and then all the text in your document is displayed using that font. Similarly for a console window. The font is a global setting and all the text in the console window is shown in the same font. In some console programs, you need to restart the program in order to be able to change the display font.

    Exercise: How does "changing the font" compare with "changing the code page"? Both change the appearance of what you see on the screen without changing the underlying data. Are they similar ideas? Are they the same thing? (Hint: This is a subtle question.)

    Exercise: Find out how to change the font in your text editor. Find out how to change the code page in your text editor.

    ** Exercise:** Look up "ligatures". Install into your text editor a "programmer's font" that uses ligatures (such as Fira Code). How do ligatures fit in to the ideas of character encodings, code points, and glyphs? Does Unicode have code points for ligatures?