CITS2002 Systems Programming, Lecture 22, p11

CITS2002 Systems Programming

CITS2002

CITS2002 schedule

Unicode support in C11

One of the long-overdue features added to the C11 standard is support for Unicode character sets, through UTF-8, UTF-16, and UTF-32 encodings.

C was missing this feature for a long time, and C programmers had to use third-party libraries such as IBM's International Components for Unicode (ICU).

Before C11, we only had char and unsigned char types, 8-bit integer variables used to store ASCII and Extended ASCII characters. By creating arrays of these ASCII characters, we could create ASCII strings.

Portable programs should not be limited to communicating only in English, or ISO-Latin languages. There are thousands of other natural languages, employing character sets other than the English alphabet. Portable program should support these without requiring a different program, or source-code base, for each language.

ASCII and Extended-ASCII - 8-bit character sets

The ASCII standard has 128 characters each stored in 7 bits. Extended-ASCII adds another 128 characters to total 256 characters; an 8-bit or one-byte variable is sufficient. See man ascii.

Support for ASCII characters and strings is fundamental, and will never be removed from C. C11 adds support for new character sets and, therefore, new strings require a different number of bytes, not just one byte, for each character.

Suddenly, characters may be of different lengths (1-, 2- or 4-bytes long), and it's the value of the character that determines its length. Consider how this would affect an inplementation of, say, the standard C11 strlen() function, which just counts the bytes found until the NULL-byte!

CITS2002 Systems Programming, Lecture 22, p11, 16th October 2023.