CITS2002 Systems Programming  
prev
CITS2002 CITS2002 schedule  

Unicode support in C11, continued

The Unicode standard introduced mechanisms supporting more than one byte to encode all characters in ASCII, Extended-ASCII, and 'wide' characters in thousands of different natural languages. These methods are termed encodings.

Unicode defines 3 well-known encodings: UTF-8, UTF-16, and UTF-32:

  • UTF-8 uses the first byte for storing the first half of ASCII characters, and following next bytes, usually up 4, for the other half of ASCII characters together with all other wide characters. Hence, UTF-8 is considered as a variable-sized encoding.

  • Like UTF-8, UTF-16 uses one or two words (each word occupying 16 bits) for storing all characters - In both UTF-8 and UTF-16, a smaller number of bytes are used for more frequent characters. Most of the characters require up to two bytes. Hence it is also a variable-sized encoding.

  • UTF-32 uses exactly 4 bytes for storing the values of all characters; therefore, it is a fixed-sized encoding. UTF-32 uses a fixed number of bytes (4) even for ASCII characters, but does restore our idea of 'counting' charcaters, and enables individual characters to act as array indicies..

Note that C11 does not define new standard functions to operate on Unicode strings, therefore we have to write a new strlen() function for them.

However, many Unicode conversion functions are defined in the new <uchar.h> header file.


An excellent introduction to Unicode - unicodebook.readthedocs.io/unicode_encodings.html
some thoughts on their support in C11: Unicode operators for C,
and some example code: Unicode in C11

 


CITS2002 Systems Programming, Lecture 22, p12, 16th October 2023.