Often when dealing with binary data in a unicode context (e.g. JSON serialization) the data is first base64 encoded. However, Python unicode objects can also use escape sequences.
What is the size relationship for high-entropy (e.g. compressed) binary data?
>>> every_byte = ''.join([chr(i) for i in range(256)])
>>> every_unichr = u''.join([(unichr(i) for i in range(256)])
>>> import base64
>>> len(every_unichr.encode('utf-8'))
384
>>> len(base64.b64encode(every_byte))
344
Surprisingly close! Unicode has the advantage that many byte values are encoded 1:1; however, if it does have to encode it will be 2:1 as opposed to 3:4 of base64. JSON serializing shifts the balance dramatically in favor of base64 however:
>>> import json
>>> len(json.dumps(every_unichr))
1045
>>> len(json.dumps(base64.b64encode(every_byte))
346
For the curious, here is what the encoded bytes looks like:
>>> every_unichr.encode('utf-8')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xc2\x80\xc2\x81\xc2\x82\xc2\x83\xc2\x84\xc2\x85\xc2\x86\xc2\x87\xc2\x88\xc2\x89\xc2\x8a\xc2\x8b\xc2\x8c\xc2\x8d\xc2\x8e\xc2\x8f\xc2\x90\xc2\x91\xc2\x92\xc2\x93\xc2\x94\xc2\x95\xc2\x96\xc2\x97\xc2\x98\xc2\x99\xc2\x9a\xc2\x9b\xc2\x9c\xc2\x9d\xc2\x9e\xc2\x9f\xc2\xa0\xc2\xa1\xc2\xa2\xc2\xa3\xc2\xa4\xc2\xa5\xc2\xa6\xc2\xa7\xc2\xa8\xc2\xa9\xc2\xaa\xc2\xab\xc2\xac\xc2\xad\xc2\xae\xc2\xaf\xc2\xb0\xc2\xb1\xc2\xb2\xc2\xb3\xc2\xb4\xc2\xb5\xc2\xb6\xc2\xb7\xc2\xb8\xc2\xb9\xc2\xba\xc2\xbb\xc2\xbc\xc2\xbd\xc2\xbe\xc2\xbf\xc3\x80\xc3\x81\xc3\x82\xc3\x83\xc3\x84\xc3\x85\xc3\x86\xc3\x87\xc3\x88\xc3\x89\xc3\x8a\xc3\x8b\xc3\x8c\xc3\x8d\xc3\x8e\xc3\x8f\xc3\x90\xc3\x91\xc3\x92\xc3\x93\xc3\x94\xc3\x95\xc3\x96\xc3\x97\xc3\x98\xc3\x99\xc3\x9a\xc3\x9b\xc3\x9c\xc3\x9d\xc3\x9e\xc3\x9f\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\xc3\xa6\xc3\xa7\xc3\xa8\xc3\xa9\xc3\xaa\xc3\xab\xc3\xac\xc3\xad\xc3\xae\xc3\xaf\xc3\xb0\xc3\xb1\xc3\xb2\xc3\xb3\xc3\xb4\xc3\xb5\xc3\xb6\xc3\xb7\xc3\xb8\xc3\xb9\xc3\xba\xc3\xbb\xc3\xbc\xc3\xbd\xc3\xbe\xc3\xbf'
Are you *sure* you want to use `every_byte.decode()`, and then pass the lossy 'replace' error-handling strategy to it?
ReplyDeleteThanks! I somehow got turned around and published the wrong draft. I fixed it and included the encoded string.
Delete@Marius, I very much doubt that he did. I'll check with Kurt!
ReplyDeleteSuch a very useful article. Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article.
ReplyDeleteData Science Course in Pune
Data Science Training in Pune
Hey, great blog, but I don’t understand how to add your site in my rss reader. Can you Help me please?
ReplyDeleteData Science Course in Bangalore
Hi, I log on to your new stuff like every week. Your humoristic style is witty, keep it up
ReplyDeleteData Science Training in Bangalore
I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.
ReplyDeleteData Science Institute in Bangalore
I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
ReplyDeleteData Science Certification in Bangalore
Just contact our siteassignment help companies, and you will approach the schoolwork task help of these extraordinary experts from the UK, the USA, and different nations. It doesn't make any difference whether you are in Singapore or Canada, you'll find the opportunity to have your task in on schedule.
ReplyDeleteThis blog contains valuable information on Python. I just got done with giving the best medical essay writing service, and now I have time to focus on learning this software. I have postponed it for so long because all the time people are telling me that it is an impossible task, but upon learning it, the case is not like that at all. So far, I am enjoying my learning!
ReplyDeleteJust contact our sitePay Someone To Write My Essay, and you will approach the schoolwork task help of these extraordinary experts from the UK, the USA, and different nations. It doesn't make any difference whether you are in Singapore or Canada, you'll find the opportunity to have your task in on schedule.
ReplyDeleteShop with each day reestablished progress codes for 4000 retailers offering unequivocal theory holds. Free working improvement codes for markdown horse tack.
ReplyDelete