Tuesday, April 12, 2016

Base64 vs UTF-8

Often when dealing with binary data in a unicode context (e.g. JSON serialization) the data is first base64 encoded.  However, Python unicode objects can also use escape sequences.

What is the size relationship for high-entropy (e.g. compressed) binary data?

>>> every_byte = ''.join([chr(i) for i in range(256)])
>>> every_unichr = u''.join([(unichr(i) for i in range(256)])
>>> import base64
>>> len(every_unichr.encode('utf-8'))
>>> len(base64.b64encode(every_byte))

Surprisingly close!  Unicode has the advantage that many byte values are encoded 1:1; however, if it does have to encode it will be 2:1 as opposed to 3:4 of base64.  JSON serializing shifts the balance dramatically in favor of base64 however:

>>> import json
>>> len(json.dumps(every_unichr))
>>> len(json.dumps(base64.b64encode(every_byte))

For the curious, here is what the encoded bytes looks like:

>>> every_unichr.encode('utf-8')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xc2\x80\xc2\x81\xc2\x82\xc2\x83\xc2\x84\xc2\x85\xc2\x86\xc2\x87\xc2\x88\xc2\x89\xc2\x8a\xc2\x8b\xc2\x8c\xc2\x8d\xc2\x8e\xc2\x8f\xc2\x90\xc2\x91\xc2\x92\xc2\x93\xc2\x94\xc2\x95\xc2\x96\xc2\x97\xc2\x98\xc2\x99\xc2\x9a\xc2\x9b\xc2\x9c\xc2\x9d\xc2\x9e\xc2\x9f\xc2\xa0\xc2\xa1\xc2\xa2\xc2\xa3\xc2\xa4\xc2\xa5\xc2\xa6\xc2\xa7\xc2\xa8\xc2\xa9\xc2\xaa\xc2\xab\xc2\xac\xc2\xad\xc2\xae\xc2\xaf\xc2\xb0\xc2\xb1\xc2\xb2\xc2\xb3\xc2\xb4\xc2\xb5\xc2\xb6\xc2\xb7\xc2\xb8\xc2\xb9\xc2\xba\xc2\xbb\xc2\xbc\xc2\xbd\xc2\xbe\xc2\xbf\xc3\x80\xc3\x81\xc3\x82\xc3\x83\xc3\x84\xc3\x85\xc3\x86\xc3\x87\xc3\x88\xc3\x89\xc3\x8a\xc3\x8b\xc3\x8c\xc3\x8d\xc3\x8e\xc3\x8f\xc3\x90\xc3\x91\xc3\x92\xc3\x93\xc3\x94\xc3\x95\xc3\x96\xc3\x97\xc3\x98\xc3\x99\xc3\x9a\xc3\x9b\xc3\x9c\xc3\x9d\xc3\x9e\xc3\x9f\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\xc3\xa6\xc3\xa7\xc3\xa8\xc3\xa9\xc3\xaa\xc3\xab\xc3\xac\xc3\xad\xc3\xae\xc3\xaf\xc3\xb0\xc3\xb1\xc3\xb2\xc3\xb3\xc3\xb4\xc3\xb5\xc3\xb6\xc3\xb7\xc3\xb8\xc3\xb9\xc3\xba\xc3\xbb\xc3\xbc\xc3\xbd\xc3\xbe\xc3\xbf'


  1. Are you *sure* you want to use `every_byte.decode()`, and then pass the lossy 'replace' error-handling strategy to it?

    1. Thanks! I somehow got turned around and published the wrong draft. I fixed it and included the encoded string.

  2. @Marius, I very much doubt that he did. I'll check with Kurt!

  3. Hey, great blog, but I don’t understand how to add your site in my rss reader. Can you Help me please?
    Data Science Course in Bangalore

  4. Hi, I log on to your new stuff like every week. Your humoristic style is witty, keep it up
    Data Science Training in Bangalore

  5. Just contact our siteassignment help companies, and you will approach the schoolwork task help of these extraordinary experts from the UK, the USA, and different nations. It doesn't make any difference whether you are in Singapore or Canada, you'll find the opportunity to have your task in on schedule.

  6. This blog contains valuable information on Python. I just got done with giving the best medical essay writing service, and now I have time to focus on learning this software. I have postponed it for so long because all the time people are telling me that it is an impossible task, but upon learning it, the case is not like that at all. So far, I am enjoying my learning!

  7. Shop with each day reestablished progress codes for 4000 retailers offering unequivocal theory holds. Free working improvement codes for markdown horse tack.