Thursday, March 17, 2016

d800 + dc00 = 10000

Unicode strings will not always have length equal to the number of characters inside them.  (This probably depends on the unicode library Python was compiled with.)

Two one character unicodes:
>>> u'\U00010000'
u'\U00010000'

>>> u'\U00008000'
u'\U00008000'


But they aren't exactly the same:
>>> len(u'\U00008000')
1
>>> len(u'\U00010000')
2


Can you guess what the two characters will be?
>>> u'\U00010000'[0]
u'\ud800'
>>> u'\U00010000'[1]
u'\udc00'

>>> u'\ud800' + u'\udc00'
u'\U00010000'


(Mahmoud):

The length of unicode characters is actually their length as represented in memory. The first character (耀 for the curious) is half the size of the second character (𐀀). They were arbitrarily chosen because one fits into two bytes in memory, and the other, spills over into three bytes.

You can check how your Python build stores these characters in memory by running

>>> import sys
>>> sys.maxunicode

If it's > 65536 then you've got UCS-4 (wide) in-memory representation and will get a len of 1 for the characters above. If it's <= 65536, then you've got UCS-2 (narrow), and you'll get the confusing and arguably wrong lengths.

These settings are configured when Python is built, and cannot be changed at runtime. Future versions of Python seek to eliminate this distinction altogether.

unicode + ord

The ord() built-in may return very large values when handed a 1-character unicode string:

>>> ord(u'\U00008000')
32768


This means that chr(ord(s)) will not always work.

>>> chr(ord(u'\U00008000'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(256)