UTF-12

I've recently been reading about the PDP-8, which is a 12-bit computer. This made me think about non-8-bit character encodings. For computers whose word size is a multiple of 9 (18-bit, 36-bit) there already exist UTF-9 and UTF-18, but they aren't any better suited for 12-bit computers than UTF-8. Additionally, they have some design decisions that make them bad models for other non-8-bit Unicode Transformation Formats1. Because of this, I decided to make my own variable-length self-synchronizing Unicode Transformation Format for natively 12-bit computers.

Unicode is logically divided into 17 planes of 65536 codepoints each. This means that, rounding up to the next whole bit, 21 bits are required to store a single Unicode codepoint. This fits in two 12-bit words, which was thus my goal for the maximum length of a UTF-12 sequence. All sequences in UTF-12 are therefore either one or two 12-bit words long.

For a self-synchonizing code, the one-word sequences must be distinct from both the leading and the trailing words used in two-word sequences. Following UTF-8, I decided to use the high bits of the word for distinguishing between these three categories. 22 = 4 is the smallest power of two ≥3, so I dedicated 2 high bits for the tag and the 10 low bits for the payload.

With 10 bits of payload one-word sequences can represent the range U+0000 to U+01FF. I chose the tag bits 00 for one-word sequences, to further allow compatibility with 7-bit ASCII and 8-bit ISO-8859-1 stored zero-extended.

Under this scheme two-word sequences have a total payload of 2×10 = 20 bits, which is too little to represent the full Unicode range even if we apply an offset of 0x200 so that there are no overlong encodings. However, for now let's ignore this problem and assign the tag bits 10 to the leading word and 01 to the trailing word.

Preliminary design for UTF-12 which cannot encode plane 17
11 10 9 8 7 6 5 4 3 2 1 0
one-word 0 0 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
two-word trailing 0 1 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
two-word leading 1 0 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10

Looking at the table, there is a simple modification to the design that allows two-word sequences to have 21 bits of total payload – I changed the leading word of a two-word sequence to have the tag bit 1 and 11 bits of payload. Since the other tag bits start with 0, the encoding remains self-synchronizing.

Final design for UTF-12
11 10 9 8 7 6 5 4 3 2 1 0
one-word 0 0 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
two-word trailing 0 1 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
two-word leading 1 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10

Footnotes

  1. UTF-9 is not self-synchronizing since it only distinguishes between non-final and final nonets in an encoding that can have anywhere between 1 and 3 nonets per codepoint. For example, 'a' is encoded (in octal) as 141, while 'š' is encoded as 401 141, meaning that for example a simple string search for 'a' would also find the latter half of 'š'. UTF-18 can only encode codepoints from planes 0, 1, 2, and 14 which means that it cannot even encode all the non-private-use codepoints of Unicode 13.0, which uses plane 3.