Tuesday, April 12, 2016

Base64 vs UTF-8

Often when dealing with binary data in a unicode context (e.g. JSON serialization) the data is first base64 encoded.  However, Python unicode objects can also use escape sequences.

What is the size relationship for high-entropy (e.g. compressed) binary data?

>>> every_byte = ''.join([chr(i) for i in range(256)])
>>> every_unichr = u''.join([(unichr(i) for i in range(256)])
>>> import base64
>>> len(every_unichr.encode('utf-8'))
384
>>> len(base64.b64encode(every_byte))
344

Surprisingly close!  Unicode has the advantage that many byte values are encoded 1:1; however, if it does have to encode it will be 2:1 as opposed to 3:4 of base64.  JSON serializing shifts the balance dramatically in favor of base64 however:

>>> import json
>>> len(json.dumps(every_unichr))
1045
>>> len(json.dumps(base64.b64encode(every_byte))
346

For the curious, here is what the encoded bytes looks like:

>>> every_unichr.encode('utf-8')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xc2\x80\xc2\x81\xc2\x82\xc2\x83\xc2\x84\xc2\x85\xc2\x86\xc2\x87\xc2\x88\xc2\x89\xc2\x8a\xc2\x8b\xc2\x8c\xc2\x8d\xc2\x8e\xc2\x8f\xc2\x90\xc2\x91\xc2\x92\xc2\x93\xc2\x94\xc2\x95\xc2\x96\xc2\x97\xc2\x98\xc2\x99\xc2\x9a\xc2\x9b\xc2\x9c\xc2\x9d\xc2\x9e\xc2\x9f\xc2\xa0\xc2\xa1\xc2\xa2\xc2\xa3\xc2\xa4\xc2\xa5\xc2\xa6\xc2\xa7\xc2\xa8\xc2\xa9\xc2\xaa\xc2\xab\xc2\xac\xc2\xad\xc2\xae\xc2\xaf\xc2\xb0\xc2\xb1\xc2\xb2\xc2\xb3\xc2\xb4\xc2\xb5\xc2\xb6\xc2\xb7\xc2\xb8\xc2\xb9\xc2\xba\xc2\xbb\xc2\xbc\xc2\xbd\xc2\xbe\xc2\xbf\xc3\x80\xc3\x81\xc3\x82\xc3\x83\xc3\x84\xc3\x85\xc3\x86\xc3\x87\xc3\x88\xc3\x89\xc3\x8a\xc3\x8b\xc3\x8c\xc3\x8d\xc3\x8e\xc3\x8f\xc3\x90\xc3\x91\xc3\x92\xc3\x93\xc3\x94\xc3\x95\xc3\x96\xc3\x97\xc3\x98\xc3\x99\xc3\x9a\xc3\x9b\xc3\x9c\xc3\x9d\xc3\x9e\xc3\x9f\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\xc3\xa6\xc3\xa7\xc3\xa8\xc3\xa9\xc3\xaa\xc3\xab\xc3\xac\xc3\xad\xc3\xae\xc3\xaf\xc3\xb0\xc3\xb1\xc3\xb2\xc3\xb3\xc3\xb4\xc3\xb5\xc3\xb6\xc3\xb7\xc3\xb8\xc3\xb9\xc3\xba\xc3\xbb\xc3\xbc\xc3\xbd\xc3\xbe\xc3\xbf'

6 comments:

  1. Are you *sure* you want to use `every_byte.decode()`, and then pass the lossy 'replace' error-handling strategy to it?

    ReplyDelete
    Replies
    1. Thanks! I somehow got turned around and published the wrong draft. I fixed it and included the encoded string.

      Delete
  2. @Marius, I very much doubt that he did. I'll check with Kurt!

    ReplyDelete
  3. Machacx
    Get unlimited free tools and cracks for mac os. You can download all premium mac and windows files
    machacx

    ReplyDelete
  4. Torrentsformac
    A free download hub to activate, crack your softwares for windows and you mac. you can download keygen, generator and activators to crack your desired VCE Exam Simulator Crack

    ReplyDelete