Thursday, May 5, 2016

String optimization in Python

Strings are terribly important in programming. A program without some form of string input, manipulation, and output is a rarity.

Of course this means that speed and sanity surrounding string features is important. One important feature of Python is string immutability. This opens up dozens of features, such as using strings as dictionary keys, but there are some downsides.

Immutable strings means that any string manipulation, such as splitting or appending, is making a copy of that string. This can become a performance problem, especially in a world where zero-copy is one of the favorite general optimization techniques. If you've done enough string mutation, you're probably aware of the following techniques:
But in some cases Python uses the immutability to avoid making copies:
>>> a = 'a' * 1024 * 1024  # a 1 megabyte string
>>> z = '' + a
>>> z is a
True
Here, because adding an empty string does not change the value, z is the same exact string object as a. And it doesn't matter how many times you append an empty string:
>>> z = '' + '' + '' + a
>>> z is a
True
It even works when a is the only item in a list:
>>> z = ''.join([a])
>>> z is a
True
But it falls apart when you put an empty string in the list with a:
>>> z = ''.join(['', a])
>>> z is a
False
And unfortunately even the first example seems to make a copy on PyPy:
>>>> a = 'a' * 1024 * 1024  # a 1 megabyte string again
>>>> z = '' + a
>>>> z is a 
False 
Although something more advanced may be going on under the covers, as is often the case with PyPy.

I'm almost done stringing you along, but as a corollary reminder:
Never rely on "is" checks with ints, floats, and strings. "==" and other value checks are what you need. As a general rule, "is" is for objects, None, and sometimes True/False.

Keep on stringifying!

Mahmoud
http://sedimental.org/
https://github.com/mahmoud
https://twitter.com/mhashemi

24 comments:

  1. I think the cpython behaviour is an optimisation possible because of reference counting. Cpython can tell when adding strings if it's the only reference, and can reuse the memory rather than copying in cases like above. Pypy doesn't use reference counting, so can't do the same trick, aiui

    ReplyDelete
    Replies
    1. While it's true that PyPy does not use CPython-style reference counting, I don't think that implies that PyPy can't have this optimization, per se. PyPy still has immutable strings, as that's a property of the Python language, not the runtime.

      Delete
  2. This could be related to interned strings: you can force any string to be unique and in a global table using the intern() function.

    ReplyDelete
    Replies
    1. You are correct about intern() and that may be the culprit with shorter strings, but you'll notice I created a 1MB string to start with. Very much hope our relatively lightweight CPython doesn't magically intern that :)

      Delete
  3. Thanks for sharing the information about the Python and keep updating us.This information is really useful to me.

    ReplyDelete
  4. I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
    python training in chennai

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    microsoft azure training in bangalore
    rpa training in bangalore
    best rpa training in bangalore
    rpa online training

    ReplyDelete

  7. Inspiring writings and I greatly admired what you have to say , I hope you continue to provide new ideas for us all and greetings success always for you.
    Keep update more information..


    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training
    Selenium interview questions and answers

    ReplyDelete
  8. Алюминиевый светодиодный профиль для led ленты я обычно беру в Ekodio там достойное качество и отличные цены.

    ReplyDelete
  9. Really very great information for that post, am amazed and then more new information are get after refer that post. I like that post.
    Python Training in Chennai | Python Training Institute in Chennai

    ReplyDelete
  10. You are doing a great job. I would like to appreciate your work for good accuracy

    Machine Learning in Chennai

    ReplyDelete
  11. definately enjoy every little bit of it and I have you bookmarked to check out new stuff of your blog a must read blog! https://thebestvpn.uk

    ReplyDelete
  12. I love the blog. Great post. It is very true, people must learn how to learn before they can learn. lol i know it sounds funny but its very true. . .

    informatica mdm online training

    apache spark online training

    apache spark online training

    devops online training

    aws online training

    ReplyDelete
  13. This is only the data I am discovering all over the place. A debt of gratitude is in order for your website, I simply subscribe your online journal. This is a decent blog.. lemigliorivpn.com

    ReplyDelete
  14. Thanks For Sharing The Information The Information Shared Is Very Valuable Please Keep Updating Us Time Just Went On Reading The article Python Online Course

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete