Monday, November 21, 2016

first hash past the post

Numeric types in Python have the interesting property that their hash() is often their value:

>>> hash(1)
1
>>> hash(1.0)
1

Python also considers floating point and integers of the same value to be equal:

>>> 1 == 1.0
True

Two things with the same hash that are equal count as the same key in a dict:

>>> 1 in {1.0: 'cat'}
True

However, it is possible for the key to be either an int or a float:

>>> {1: 'first'}
{1: 'first'}
>>> {1.0: 'second'}
{1.0: 'second'}

Whichever key is used first sticks.  Later writes to the dict can change the value, but the int or float key remains:

>>> {1: 'first', 1.0: 'second'}
{1: 'second'}
>>> {1.0: 'first', 1: 'second'}
{1.0: 'second'}

Wednesday, June 8, 2016

imports of no import

The Python standard library has hundreds of built-in modules (mine has 688 by one count). Some are more useful than others.

The most famous is "import this", which prints out the Zen of Python. The next most popular fun module could well be antigravity. Try importing it on a browser-capable machine, and (spoiler), you'll be taken here.

And Randall is right. As most people know, Python's "Hello world" is just one line: "print 'Hello world'"

But what if there were another, more confusing way to do it? Taking a page out of ow about:

>>> import __hello__
"Hello world..."

And, because it's a one-time module import, this super-useful module only works the one time:

>>> import __hello__
>>>

It even breaks that behavior on Python 3:

>>> import __hello__
"Hello world!"

And if that wasn't esoteric enough of an import, how about even more new syntax in Python 3:

>>> from __future__ import barry_as_FLUFL
>>> 'a' <> 'b'
True
>>> 'a' <> 'a'
False
 
This unfortunate syntax is the result of an April Fools PEP from 2009. Before this, previous attempts at introducing new syntax were met with a stiffer upper lip:

>>> from __future__ import braces
  File "<stdin>", line 1
SyntaxError: not a chance


All of which raises the question: how many undocumented jokes have made their way into Python?

Mahmoud
https://github.com/mahmoud
https://twitter.com/mhashemi

Credit to Python core dev Raymond Hettinger and other Twitter friends for details and inspiration.

Thursday, May 5, 2016

String optimization in Python

Strings are terribly important in programming. A program without some form of string input, manipulation, and output is a rarity.

Of course this means that speed and sanity surrounding string features is important. One important feature of Python is string immutability. This opens up dozens of features, such as using strings as dictionary keys, but there are some downsides.

Immutable strings means that any string manipulation, such as splitting or appending, is making a copy of that string. This can become a performance problem, especially in a world where zero-copy is one of the favorite general optimization techniques. If you've done enough string mutation, you're probably aware of the following techniques:
But in some cases Python uses the immutability to avoid making copies:
>>> a = 'a' * 1024 * 1024  # a 1 megabyte string
>>> z = '' + a
>>> z is a
True
Here, because adding an empty string does not change the value, z is the same exact string object as a. And it doesn't matter how many times you append an empty string:
>>> z = '' + '' + '' + a
>>> z is a
True
It even works when a is the only item in a list:
>>> z = ''.join([a])
>>> z is a
True
But it falls apart when you put an empty string in the list with a:
>>> z = ''.join(['', a])
>>> z is a
False
And unfortunately even the first example seems to make a copy on PyPy:
>>>> a = 'a' * 1024 * 1024  # a 1 megabyte string again
>>>> z = '' + a
>>>> z is a 
False 
Although something more advanced may be going on under the covers, as is often the case with PyPy.

I'm almost done stringing you along, but as a corollary reminder:
Never rely on "is" checks with ints, floats, and strings. "==" and other value checks are what you need. As a general rule, "is" is for objects, None, and sometimes True/False.

Keep on stringifying!

Mahmoud
http://sedimental.org/
https://github.com/mahmoud
https://twitter.com/mhashemi

Tuesday, April 12, 2016

Base64 vs UTF-8

Often when dealing with binary data in a unicode context (e.g. JSON serialization) the data is first base64 encoded.  However, Python unicode objects can also use escape sequences.

What is the size relationship for high-entropy (e.g. compressed) binary data?

>>> every_byte = ''.join([chr(i) for i in range(256)])
>>> every_unichr = u''.join([(unichr(i) for i in range(256)])
>>> import base64
>>> len(every_unichr.encode('utf-8'))
384
>>> len(base64.b64encode(every_byte))
344

Surprisingly close!  Unicode has the advantage that many byte values are encoded 1:1; however, if it does have to encode it will be 2:1 as opposed to 3:4 of base64.  JSON serializing shifts the balance dramatically in favor of base64 however:

>>> import json
>>> len(json.dumps(every_unichr))
1045
>>> len(json.dumps(base64.b64encode(every_byte))
346

For the curious, here is what the encoded bytes looks like:

>>> every_unichr.encode('utf-8')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xc2\x80\xc2\x81\xc2\x82\xc2\x83\xc2\x84\xc2\x85\xc2\x86\xc2\x87\xc2\x88\xc2\x89\xc2\x8a\xc2\x8b\xc2\x8c\xc2\x8d\xc2\x8e\xc2\x8f\xc2\x90\xc2\x91\xc2\x92\xc2\x93\xc2\x94\xc2\x95\xc2\x96\xc2\x97\xc2\x98\xc2\x99\xc2\x9a\xc2\x9b\xc2\x9c\xc2\x9d\xc2\x9e\xc2\x9f\xc2\xa0\xc2\xa1\xc2\xa2\xc2\xa3\xc2\xa4\xc2\xa5\xc2\xa6\xc2\xa7\xc2\xa8\xc2\xa9\xc2\xaa\xc2\xab\xc2\xac\xc2\xad\xc2\xae\xc2\xaf\xc2\xb0\xc2\xb1\xc2\xb2\xc2\xb3\xc2\xb4\xc2\xb5\xc2\xb6\xc2\xb7\xc2\xb8\xc2\xb9\xc2\xba\xc2\xbb\xc2\xbc\xc2\xbd\xc2\xbe\xc2\xbf\xc3\x80\xc3\x81\xc3\x82\xc3\x83\xc3\x84\xc3\x85\xc3\x86\xc3\x87\xc3\x88\xc3\x89\xc3\x8a\xc3\x8b\xc3\x8c\xc3\x8d\xc3\x8e\xc3\x8f\xc3\x90\xc3\x91\xc3\x92\xc3\x93\xc3\x94\xc3\x95\xc3\x96\xc3\x97\xc3\x98\xc3\x99\xc3\x9a\xc3\x9b\xc3\x9c\xc3\x9d\xc3\x9e\xc3\x9f\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\xc3\xa6\xc3\xa7\xc3\xa8\xc3\xa9\xc3\xaa\xc3\xab\xc3\xac\xc3\xad\xc3\xae\xc3\xaf\xc3\xb0\xc3\xb1\xc3\xb2\xc3\xb3\xc3\xb4\xc3\xb5\xc3\xb6\xc3\xb7\xc3\xb8\xc3\xb9\xc3\xba\xc3\xbb\xc3\xbc\xc3\xbd\xc3\xbe\xc3\xbf'

Thursday, March 17, 2016

d800 + dc00 = 10000

Unicode strings will not always have length equal to the number of characters inside them.  (This probably depends on the unicode library Python was compiled with.)

Two one character unicodes:
>>> u'\U00010000'
u'\U00010000'

>>> u'\U00008000'
u'\U00008000'


But they aren't exactly the same:
>>> len(u'\U00008000')
1
>>> len(u'\U00010000')
2


Can you guess what the two characters will be?
>>> u'\U00010000'[0]
u'\ud800'
>>> u'\U00010000'[1]
u'\udc00'

>>> u'\ud800' + u'\udc00'
u'\U00010000'


(Mahmoud):

The length of unicode characters is actually their length as represented in memory. The first character (耀 for the curious) is half the size of the second character (𐀀). They were arbitrarily chosen because one fits into two bytes in memory, and the other, spills over into three bytes.

You can check how your Python build stores these characters in memory by running

>>> import sys
>>> sys.maxunicode

If it's > 65536 then you've got UCS-4 (wide) in-memory representation and will get a len of 1 for the characters above. If it's <= 65536, then you've got UCS-2 (narrow), and you'll get the confusing and arguably wrong lengths.

These settings are configured when Python is built, and cannot be changed at runtime. Future versions of Python seek to eliminate this distinction altogether.

unicode + ord

The ord() built-in may return very large values when handed a 1-character unicode string:

>>> ord(u'\U00008000')
32768


This means that chr(ord(s)) will not always work.

>>> chr(ord(u'\U00008000'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(256)

Tuesday, February 9, 2016

Undecoratable

Decorators are one of Python's bigger success stories, and many programmers' first experience with higher-order programming. Most practiced and prolific Python programmers will find themselves making good use of them regularly.

But every feature has its limits, and here's a new one to try on for size:

>>> @x().y()
  File "<stdin>", line 1
    @x().y()
        ^
SyntaxError: invalid syntax

That's right, decoration is not an arbitrary Python expression. It doesn't matter what x and y were, or even if they were defined. You can't follow a function call with a dot. @x() works fine, @x.y() would work fine, too. But @x().y(), that's only for mad Pythonists who would take things TOO FAR.

Decorator invocations, defined at the top of the Python grammar, can only be followed by class definitions and function definitions.

Well, now we know, and now we can all say we've been there

-- Mahmoud
http://sedimental.org/
https://github.com/mahmoud
https://twitter.com/mhashemi