Tuesday, April 12, 2016

Base64 vs UTF-8

Often when dealing with binary data in a unicode context (e.g. JSON serialization) the data is first base64 encoded.  However, Python unicode objects can also use escape sequences.

What is the size relationship for high-entropy (e.g. compressed) binary data?

>>> every_byte = ''.join([chr(i) for i in range(256)])
>>> every_unichr = u''.join([(unichr(i) for i in range(256)])
>>> import base64
>>> len(every_unichr.encode('utf-8'))
384
>>> len(base64.b64encode(every_byte))
344

Surprisingly close!  Unicode has the advantage that many byte values are encoded 1:1; however, if it does have to encode it will be 2:1 as opposed to 3:4 of base64.  JSON serializing shifts the balance dramatically in favor of base64 however:

>>> import json
>>> len(json.dumps(every_unichr))
1045
>>> len(json.dumps(base64.b64encode(every_byte))
346

For the curious, here is what the encoded bytes looks like:

>>> every_unichr.encode('utf-8')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xc2\x80\xc2\x81\xc2\x82\xc2\x83\xc2\x84\xc2\x85\xc2\x86\xc2\x87\xc2\x88\xc2\x89\xc2\x8a\xc2\x8b\xc2\x8c\xc2\x8d\xc2\x8e\xc2\x8f\xc2\x90\xc2\x91\xc2\x92\xc2\x93\xc2\x94\xc2\x95\xc2\x96\xc2\x97\xc2\x98\xc2\x99\xc2\x9a\xc2\x9b\xc2\x9c\xc2\x9d\xc2\x9e\xc2\x9f\xc2\xa0\xc2\xa1\xc2\xa2\xc2\xa3\xc2\xa4\xc2\xa5\xc2\xa6\xc2\xa7\xc2\xa8\xc2\xa9\xc2\xaa\xc2\xab\xc2\xac\xc2\xad\xc2\xae\xc2\xaf\xc2\xb0\xc2\xb1\xc2\xb2\xc2\xb3\xc2\xb4\xc2\xb5\xc2\xb6\xc2\xb7\xc2\xb8\xc2\xb9\xc2\xba\xc2\xbb\xc2\xbc\xc2\xbd\xc2\xbe\xc2\xbf\xc3\x80\xc3\x81\xc3\x82\xc3\x83\xc3\x84\xc3\x85\xc3\x86\xc3\x87\xc3\x88\xc3\x89\xc3\x8a\xc3\x8b\xc3\x8c\xc3\x8d\xc3\x8e\xc3\x8f\xc3\x90\xc3\x91\xc3\x92\xc3\x93\xc3\x94\xc3\x95\xc3\x96\xc3\x97\xc3\x98\xc3\x99\xc3\x9a\xc3\x9b\xc3\x9c\xc3\x9d\xc3\x9e\xc3\x9f\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\xc3\xa6\xc3\xa7\xc3\xa8\xc3\xa9\xc3\xaa\xc3\xab\xc3\xac\xc3\xad\xc3\xae\xc3\xaf\xc3\xb0\xc3\xb1\xc3\xb2\xc3\xb3\xc3\xb4\xc3\xb5\xc3\xb6\xc3\xb7\xc3\xb8\xc3\xb9\xc3\xba\xc3\xbb\xc3\xbc\xc3\xbd\xc3\xbe\xc3\xbf'

12 comments:

  1. Are you *sure* you want to use `every_byte.decode()`, and then pass the lossy 'replace' error-handling strategy to it?

    ReplyDelete
    Replies
    1. Thanks! I somehow got turned around and published the wrong draft. I fixed it and included the encoded string.

      Delete
  2. @Marius, I very much doubt that he did. I'll check with Kurt!

    ReplyDelete
  3. Such a very useful article. Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article.
    Data Science Course in Pune
    Data Science Training in Pune

    ReplyDelete
  4. Hey, great blog, but I don’t understand how to add your site in my rss reader. Can you Help me please?
    Data Science Course in Bangalore

    ReplyDelete
  5. Hi, I log on to your new stuff like every week. Your humoristic style is witty, keep it up
    Data Science Training in Bangalore

    ReplyDelete
  6. I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.
    Data Science Institute in Bangalore

    ReplyDelete
  7. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
    Data Science Certification in Bangalore

    ReplyDelete
  8. Just contact our siteassignment help companies, and you will approach the schoolwork task help of these extraordinary experts from the UK, the USA, and different nations. It doesn't make any difference whether you are in Singapore or Canada, you'll find the opportunity to have your task in on schedule.

    ReplyDelete
  9. This blog contains valuable information on Python. I just got done with giving the best medical essay writing service, and now I have time to focus on learning this software. I have postponed it for so long because all the time people are telling me that it is an impossible task, but upon learning it, the case is not like that at all. So far, I am enjoying my learning!

    ReplyDelete
  10. Just contact our sitePay Someone To Write My Essay, and you will approach the schoolwork task help of these extraordinary experts from the UK, the USA, and different nations. It doesn't make any difference whether you are in Singapore or Canada, you'll find the opportunity to have your task in on schedule.

    ReplyDelete
  11. Shop with each day reestablished progress codes for 4000 retailers offering unequivocal theory holds. Free working improvement codes for markdown horse tack.

    ReplyDelete