I've recently been reading about the PDP-8, which is a 12-bit computer. This made me think about non-8-bit character encodings. For computers whose word size is a multiple of 9 (18-bit, 36-bit) there already exist UTF-9 and UTF-18, but they aren't any better suited for 12-bit computers than UTF-8. Additionally, they have some design decisions that make them bad models for other non-8-bit Unicode Transformation Formats^{1}. Because of this, I decided to make my own variable-length self-synchronizing Unicode Transformation Format for natively 12-bit computers.
Unicode is logically divided into 17 planes of 65536 codepoints each. This means that, rounding up to the next whole bit, 21 bits are required to store a single Unicode codepoint. This fits in two 12-bit words, which was thus my goal for the maximum length of a UTF-12 sequence. All sequences in UTF-12 are therefore either one or two 12-bit words long.
For a self-synchonizing code, the one-word sequences must be distinct from both the leading and the trailing words used in two-word sequences. Following UTF-8, I decided to use the high bits of the word for distinguishing between these three categories. 2^{2} = 4 is the smallest power of two ≥3, so I dedicated 2 high bits for the tag and the 10 low bits for the payload.
With 10 bits of payload one-word sequences can represent the range U+0000 to U+01FF. I chose the tag bits 00 for one-word sequences, to further allow compatibility with 7-bit ASCII and 8-bit ISO-8859-1 stored zero-extended.
Under this scheme two-word sequences have a total payload of 2×10 = 20 bits, which is too little to represent the full Unicode range even if we apply an offset of 0x200 so that there are no overlong encodings. However, for now let's ignore this problem and assign the tag bits 10 to the leading word and 01 to the trailing word.
11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
one-word | 0 | 0 | b_{9} | b_{8} | b_{7} | b_{6} | b_{5} | b_{4} | b_{3} | b_{2} | b_{1} | b_{0} |
two-word trailing | 0 | 1 | b_{9} | b_{8} | b_{7} | b_{6} | b_{5} | b_{4} | b_{3} | b_{2} | b_{1} | b_{0} |
two-word leading | 1 | 0 | b_{19} | b_{18} | b_{17} | b_{16} | b_{15} | b_{14} | b_{13} | b_{12} | b_{11} | b_{10} |
Looking at the table, there is a simple modification to the design that allows two-word sequences to have 21 bits of total payload – I changed the leading word of a two-word sequence to have the tag bit 1 and 11 bits of payload. Since the other tag bits start with 0, the encoding remains self-synchronizing.
11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
one-word | 0 | 0 | b_{9} | b_{8} | b_{7} | b_{6} | b_{5} | b_{4} | b_{3} | b_{2} | b_{1} | b_{0} |
two-word trailing | 0 | 1 | b_{9} | b_{8} | b_{7} | b_{6} | b_{5} | b_{4} | b_{3} | b_{2} | b_{1} | b_{0} |
two-word leading | 1 | b_{20} | b_{19} | b_{18} | b_{17} | b_{16} | b_{15} | b_{14} | b_{13} | b_{12} | b_{11} | b_{10} |