Python Does What?!?: Base64 vs UTF-8

Tuesday, April 12, 2016

Base64 vs UTF-8

Often when dealing with binary data in a unicode context (e.g. JSON serialization) the data is first base64 encoded. However, Python unicode objects can also use escape sequences.

What is the size relationship for high-entropy (e.g. compressed) binary data?

>>> every_byte = ''.join([chr(i) for i in range(256)])
>>> every_unichr = u''.join([(unichr(i) for i in range(256)])
>>> import base64
>>> len(every_unichr.encode('utf-8'))
384
>>> len(base64.b64encode(every_byte))
344

Surprisingly close! Unicode has the advantage that many byte values are encoded 1:1; however, if it does have to encode it will be 2:1 as opposed to 3:4 of base64. JSON serializing shifts the balance dramatically in favor of base64 however:

>>> import json
>>> len(json.dumps(every_unichr))
1045
>>> len(json.dumps(base64.b64encode(every_byte))
346

For the curious, here is what the encoded bytes looks like:

>>> every_unichr.encode('utf-8')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xc2\x80\xc2\x81\xc2\x82\xc2\x83\xc2\x84\xc2\x85\xc2\x86\xc2\x87\xc2\x88\xc2\x89\xc2\x8a\xc2\x8b\xc2\x8c\xc2\x8d\xc2\x8e\xc2\x8f\xc2\x90\xc2\x91\xc2\x92\xc2\x93\xc2\x94\xc2\x95\xc2\x96\xc2\x97\xc2\x98\xc2\x99\xc2\x9a\xc2\x9b\xc2\x9c\xc2\x9d\xc2\x9e\xc2\x9f\xc2\xa0\xc2\xa1\xc2\xa2\xc2\xa3\xc2\xa4\xc2\xa5\xc2\xa6\xc2\xa7\xc2\xa8\xc2\xa9\xc2\xaa\xc2\xab\xc2\xac\xc2\xad\xc2\xae\xc2\xaf\xc2\xb0\xc2\xb1\xc2\xb2\xc2\xb3\xc2\xb4\xc2\xb5\xc2\xb6\xc2\xb7\xc2\xb8\xc2\xb9\xc2\xba\xc2\xbb\xc2\xbc\xc2\xbd\xc2\xbe\xc2\xbf\xc3\x80\xc3\x81\xc3\x82\xc3\x83\xc3\x84\xc3\x85\xc3\x86\xc3\x87\xc3\x88\xc3\x89\xc3\x8a\xc3\x8b\xc3\x8c\xc3\x8d\xc3\x8e\xc3\x8f\xc3\x90\xc3\x91\xc3\x92\xc3\x93\xc3\x94\xc3\x95\xc3\x96\xc3\x97\xc3\x98\xc3\x99\xc3\x9a\xc3\x9b\xc3\x9c\xc3\x9d\xc3\x9e\xc3\x9f\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\xc3\xa6\xc3\xa7\xc3\xa8\xc3\xa9\xc3\xaa\xc3\xab\xc3\xac\xc3\xad\xc3\xae\xc3\xaf\xc3\xb0\xc3\xb1\xc3\xb2\xc3\xb3\xc3\xb4\xc3\xb5\xc3\xb6\xc3\xb7\xc3\xb8\xc3\xb9\xc3\xba\xc3\xbb\xc3\xbc\xc3\xbd\xc3\xbe\xc3\xbf'

8 comments:

Marius GedminasApril 13, 2016 at 11:53 PM
Are you *sure* you want to use `every_byte.decode()`, and then pass the lossy 'replace' error-handling strategy to it?
ReplyDelete
Replies
Mahmoud HashemiApril 14, 2016 at 2:59 PM
@Marius, I very much doubt that he did. I'll check with Kurt!
ReplyDelete
Replies
DataScience SpecialistJuly 5, 2020 at 7:23 AM
Hey, great blog, but I don’t understand how to add your site in my rss reader. Can you Help me please?
Data Science Course in Bangalore
ReplyDelete
Replies
DataScience SpecialistJuly 5, 2020 at 7:24 AM
Hi, I log on to your new stuff like every week. Your humoristic style is witty, keep it up
Data Science Training in Bangalore
ReplyDelete
Replies
Happy petOctober 6, 2021 at 10:31 AM
Just contact our siteassignment help companies, and you will approach the schoolwork task help of these extraordinary experts from the UK, the USA, and different nations. It doesn't make any difference whether you are in Singapore or Canada, you'll find the opportunity to have your task in on schedule.
ReplyDelete
Replies
SamuelmillerNovember 18, 2021 at 1:50 AM
This blog contains valuable information on Python. I just got done with giving the best medical essay writing service, and now I have time to focus on learning this software. I have postponed it for so long because all the time people are telling me that it is an impossible task, but upon learning it, the case is not like that at all. So far, I am enjoying my learning!
ReplyDelete
Replies
Jennifer AnistonMarch 15, 2022 at 5:53 AM
Shop with each day reestablished progress codes for 4000 retailers offering unequivocal theory holds. Free working improvement codes for markdown horse tack.
ReplyDelete
Replies

Add comment