Allison Kaptur

An occasional blog on programming

Python Bytecode: Fun With Dis

Last week at Hacker School I did a quick presentation on python bytecode and the dis module. The disassembler is a very powerful tool with a gentle learning curve – that is, you can get a fair amount out of it without really knowing much about what’s going on. This post is a quick introduction to how and why you should use it.

What’s bytecode?

Bytecode is the internal representation of a python program in the compiler. Here, we’ll be looking at bytecode from cpython, the default compiler. If you don’t know what compiler you’re using, it’s probably cpython.

How do I get bytecode?

You already have it! Bytecode is what’s contained in those .pyc files you see when you import a module. It’s also created on the fly by running any python code.

Disassembling

Ok, so you have some bytecode, and you want to understand it. Let’s look at it without using the dis module first.

1
2
3
4
5
6
7
8
9
10
11
>>> def foo():
...     a = 2
...     b = 3
...     return a + b
...
>>> foo.func_code
<code object foo at 0x106353530, file "<stdin>", line 1>
>>> foo.func_code.co_code
'd\x01\x00}\x00\x00d\x02\x00}\x01\x00|\x00\x00|\x01\x00\x17S'
>>> print [ord(x) for x in foo.func_code.co_code]
[100, 1, 0, 125, 0, 0, 100, 2, 0, 125, 1, 0, 124, 0, 0, 124, 1, 0, 23, 83]

Hmm, that was … not very enlightening. We can see that we have a bunch of bytes (some printable, others not), but we have no idea what they mean.

Let’s run it through dis.dis instead.

1
2
3
4
5
6
7
8
9
10
11
12
>>> import dis
>>> dis.dis(foo)
  2           0 LOAD_CONST               1 (2)
              3 STORE_FAST               0 (a)

  3           6 LOAD_CONST               2 (3)
              9 STORE_FAST               1 (b)

  4          12 LOAD_FAST                0 (a)
             15 LOAD_FAST                1 (b)
             18 BINARY_ADD
             19 RETURN_VALUE

Now this starts to make some sense. dis takes each byte, finds the opcode that corresponds to it in opcodes.py, and prints it as a nice, readable constant. If we look at opcodes.py we see that LOAD_CONST is 100, STORE_FAST is 125, etc. dis also shows the line numbers on the left and the values or names on the right. So without ever seeing something like before, we have an idea what’s going on: we first load a constant, 2, then somehow store it as a. Then we repeat this with 3 and b. We load a and b back up, do BINARY_ADD, which presumably adds the numbers, and then do RETURN_VALUE.

Examining the bytecode can sometimes increase your understanding of python code. Here is one example.

elif

elif is identical in bytecode to else ... if. Take a look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>>> def flat(num):
...     if num % 3 == 0:
...         print "Fizz"
...     elif num % 5 == 0:
...         print "Buzz"
...     else:
...         print num
>>>
>>> def nested(num):
...     if num % 3 == 0:
...         print "Fizz"
...     else:
...         if num % 5 == 0:
...             print "Buzz"
...         else:
...             print num

We’ve read PEP 8 so we know that flat is better than nested for style and readability. But is there a performance difference? Not at all – in fact, these two functions have identical bytecode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
  2           0 LOAD_FAST                0 (num)
              3 LOAD_CONST               1 (3)
              6 BINARY_MODULO
              7 LOAD_CONST               2 (0)
             10 COMPARE_OP               2 (==)
             13 POP_JUMP_IF_FALSE       24

  3          16 LOAD_CONST               3 ('Fizz')
             19 PRINT_ITEM
             20 PRINT_NEWLINE
             21 JUMP_FORWARD            29 (to 53)

  4     >>   24 LOAD_FAST                0 (num)
             27 LOAD_CONST               4 (5)
             30 BINARY_MODULO
             31 LOAD_CONST               2 (0)
             34 COMPARE_OP               2 (==)
             37 POP_JUMP_IF_FALSE       48

  5          40 LOAD_CONST               5 ('Buzz')
             43 PRINT_ITEM
             44 PRINT_NEWLINE
             45 JUMP_FORWARD             5 (to 53)

  7     >>   48 LOAD_FAST                0 (num)
             51 PRINT_ITEM
             52 PRINT_NEWLINE
        >>   53 LOAD_CONST               0 (None)
             56 RETURN_VALUE

That makes sense – else just means “start executing here if the if was false” – there’s no more computation to do. elif is just syntactic sugar.

Further reading:

This just scratches the surface of what’s interesting about python bytecode.

If you enjoyed this, you might enjoy diving into Yaniv Aknin’s series on python internals. If you’re excited about bytecode, you should contribute to Ned Batchelder’s byterun.

Systematic Debugging

We’ve asked many of our residents at Hacker School what qualities all great programmers share. There’s very little agreement – clearly, there are a multitude of ways to be a great programmer, and you can think of a counter-example for almost every quality you can name. One of the rare non-controversial statements came first from Jessica McKellar, who identified systematic debugging as a key skill.

What does systematic debugging look like? At this point, I’m focused on two aspects: asking a clear question, and keeping track of my mental “stack”. I had a particularly fun and interesting bug yesterday that I think illustrates this nicely.

The problem at hand was a brainteaser from Jessica:

Using the official SOWPODS Scrabble dictionary, what letters, if any, never appear doubled? (By that I mean — “AA” does appear doubled because it is in “AARDVARK”, “BB” does appear doubled because it is in “BUBBLE” — are there any letters that never appear doubled in a word?)

I came up with a solution, but it was kind of slow, so I was trying to find a faster one. Also, my original solution used a regular expression to match a doubled letter in a word, but it would only find the first pair (e.g. in BOOKKEEPER only the ‘OO’ would be caught).

I decided to try taking the dictionary as a single string, rather than a list of words, and consuming it one letter-pair at a time, as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def find_all_doubles_as_word_mash():
    words = open('sowpods.txt').read()
    letters = set(string.uppercase)
    dub_re = re.compile(r"([A-Z])(\1)")
    end = 0
    dubbed = re.search(dub_re, words[end:])

    while dubbed and letters:
        letters -= set(dubbed.group())
        end += dubbed.end()
        dubbed = re.search(dub_re, words[end:])
        print end, '/', len(words)

    return letters

If you’re thinking, “Hmm, that looks like it might be slow,” congratulations! It is indeed quite slow. I had two guesses for why it was slow. (It wasn’t the print statement.) First, maybe I was copying the string over and over, and that was slowing down the function. Second, maybe I misunderstodd the regular expression – for example, maybe it was taking a longer time to operate on a longer string. I decided to investigate the second possibility first, since regular expressions were more of a mystery to me.

At this point in the problem solving, my mental stack looks something like this:

 ---------------------------
| regex taking longer on a  | (newest)
| longer-length string?     |
 ---------------------------
| reg exp? or string copy?  |
 ---------------------------
| function is so slow, why? |
 ---------------------------
|   find a faster solution  |
 ---------------------------
|  solve wordplay problem   | (oldest)
 ---------------------------

The next step was to test my theory that my regular expression should not take longer on a longer string. Importantly, the regex had no “greedy” elements – it should match and return the first time it encounters two identical characters, regardless of how long the string is after those matching characters. Here was the experiment I wrote:

test_with_timeit.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import timeit

SETUP = """import re;
import string;
double = re.compile(r"([a-z])(\1)");
short_test = "hee" + string.lowercase;
long_test = "hee" + string.lowercase*100
"""

N_TIMES = 10000

def test():
    short_time = timeit.timeit(stmt="double.search(short_test)", setup=SETUP, number=N_TIMES)
    long_time = timeit.timeit(stmt="double.search(long_test)", setup=SETUP, number=N_TIMES)

    print "short:", short_time
    print "long:", long_time
    print "time ratios:", long_time / short_time, "x"

if __name__ == '__main__':
    test()

In both cases, I thought, the regex should check “he”, fail, check “ee”, succeed, and return. To my great surprise, the long string test took 70-80x as long to run as the short string test!

test_with_timeit output
1
2
3
short: 0.0243809223175
long: 1.77788496017
time ratios: 72.9211527366 x

Clearly, I was missing something about regular expressions. I googled around and learned some details about backreferences, the (\1) in my code (which makes my regular expression not actually “regular”), but that wasn’t very illuminating. (You can picture “something about Finite State Automata” being pushed onto and quickly popped back off the stack.)

My next step was to run timing experiments with a variety of regular expressions to see if I could find an element that made the difference. At this point yesterday, I was still trying to pass strings to timeit, which was getting unwieldy. (Stay tuned to learn two better ways to do timing.) I switched to a simpler timing script, just taking time.time() at the start and end of the algorithm.

test_without_timeit.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import re
import string
import time
from collections import defaultdict

everybody_stand_back = {
    "double" : re.compile(r"([a-z])(\1)"),
    "double_var" : re.compile(r"([a-z])\1"),
    "single" : re.compile(r"([a-z])"),
    "verbose" : re.compile("|".join(let*2 for let in string.lowercase)),
}

test_strings = {
    "short" : "hee",
    "long" : "hee" + string.lowercase*100,
}

outcomes = defaultdict(dict)

N_TIMES = 100000

def test():
    for name, regex in everybody_stand_back.iteritems():
        for string_name, test_string in test_strings.iteritems():
            start = time.time()
            for _ in range(N_TIMES):
                re.search(regex, test_string)
            end = time.time()
            delta = end - start
            outcomes[name][string_name] = delta

    return outcomes

if __name__ == '__main__':
    outcomes = test()
    for name, regex in outcomes.iteritems():
        print name, regex["long"]/regex["short"]

Hopefully the differences would start to become clear, and I could easily extend this to add more variations on the regular expression.

It printed:

test_without_timeit.py
1
2
3
4
double 0.953508019384
single 1.15989795352
double_var 1.04188410251
verbose 0.861312402506

Uh-oh, wait a minute, I thought this was identical code. Where did the 70x time difference go?

I couldn’t see any obvious errors, so I went back to the timeit code. Could I reproduce these results while still using the timeit module?

Let’s look back at the problem stack.

 ---------------------------
| what's up with timeit?    |    
 ---------------------------
| is this code different?   |    
 ---------------------------
| timing programs give      |
| different results, uh-oh  |
 ---------------------------
| regex taking longer on a  |
| longer-length string      |
 ---------------------------
| reg exp? or string copy?  |
 ---------------------------
| function is so slow, why? |
 ---------------------------
|   find a faster solution  |
 ---------------------------
|  solve wordplay problem   | (oldest)
 ---------------------------

At this point Brandon Rhodes joined me in pair-debugging, and we made a couple of discoveries. The most interesting of these was that two different ways of calling timeit generated wildly different results.

test_with_timeit.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import re
import timeit
import string


double = re.compile(r"([a-z])(\1)")


SETUP = """import re;
import string;
double = re.compile(r"([a-z])(\1)");
short_test = "hee" + string.lowercase;
long_test = "hee" + string.lowercase*200
"""

N_TIMES = 10000

short_test = "hee" + string.lowercase
long_test = "hee" + string.lowercase*200

def test():
    short_time = timeit.timeit(stmt="double.search(short_test)", setup=SETUP, number=N_TIMES)
    long_time = timeit.timeit(stmt="double.search(long_test)", setup=SETUP, number=N_TIMES)

    print "short:", short_time
    print "long:", long_time


def short_string_test():
    double.search(short_test)

def long_string_test():
    double.search(long_test)

if __name__ == '__main__':
    print "method 1: strings"
    test()

    print "method 2: functions"
    print timeit.timeit(short_string_test, number=N_TIMES)
    print timeit.timeit(long_string_test, number=N_TIMES)

    print "method 3: build timer explicitly"
    s = timeit.Timer(stmt="double.search(long_test)", setup=SETUP)
    print "short:", s.timeit(number=N_TIMES)
    t = timeit.Timer(stmt="double.search(long_test)", setup=SETUP)
    print "long:", t.timeit(number=N_TIMES)
test_with_timeit output
1
2
3
4
5
6
7
8
9
method 1: strings
short: 0.0235750675201
long: 3.41090488434
method 2: functions
0.0102179050446
0.00959801673889
method 3: build timer explicitly
short: 0.0263011455536
long: 3.42960190773

So now we know something is up with passing strings to timeit.

 ---------------------------
| timeit string version - ? |        
 ---------------------------
| what's up with timeit?    |    
 ---------------------------
| is this code different?   |    
 ---------------------------
          ....

Ok, we’ve narrowed it down this far – it’s time to take a look at the source. timeit is a module written in python (yay!), and the relevant parts look pretty straightforward:

timeit.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
template = """
def inner(_it, _timer):
    %(setup)s
    _t0 = _timer()
    for _i in _it:
        %(stmt)s
    _t1 = _timer()
    return _t1 - _t0
"""

class Timer:
    """Class for timing execution speed of small code snippets.

    The constructor takes a statement to be timed, an additional
    statement used for setup, and a timer function.  Both statements
    default to 'pass'; the timer function is platform-dependent (see
    module doc string).

    To measure the execution time of the first statement, use the
    timeit() method.  The repeat() method is a convenience to call
    timeit() multiple times and return a list of results.

    The statements may contain newlines, as long as they don't contain
    multi-line string literals.
    """

    def __init__(self, stmt="pass", setup="pass", timer=default_timer):
        """Constructor.  See class doc string."""
        self.timer = timer
        ns = {}
        if isinstance(stmt, basestring):
            stmt = reindent(stmt, 8)
            if isinstance(setup, basestring):
                setup = reindent(setup, 4)
                src = template % {'stmt': stmt, 'setup': setup}
            elif hasattr(setup, '__call__'):
                src = template % {'stmt': stmt, 'setup': '_setup()'}
                ns['_setup'] = setup
            else:
                raise ValueError("setup is neither a string nor callable")
            self.src = src # Save for traceback display
            code = compile(src, dummy_src_name, "exec")
            exec code in globals(), ns
            self.inner = ns["inner"]
        elif hasattr(stmt, '__call__'):
            self.src = None
            if isinstance(setup, basestring):
                _setup = setup
                def setup():
                    exec _setup in globals(), ns
            elif not hasattr(setup, '__call__'):
                raise ValueError("setup is neither a string nor callable")
            self.inner = _template_func(setup, stmt)
        else:
            raise ValueError("stmt is neither a string nor callable")

    def timeit(self, number=default_number):
        """Time 'number' executions of the main statement.

        To be precise, this executes the setup statement once, and
        then returns the time it takes to execute the main statement
        a number of times, as a float measured in seconds.  The
        argument is the number of times through the loop, defaulting
        to one million.  The main statement, the setup statement and
        the timer function to be used are passed to the constructor.
        """
        if itertools:
            it = itertools.repeat(None, number)
        else:
            it = [None] * number
        gcold = gc.isenabled()
        gc.disable()
        timing = self.inner(it, self.timer)
        if gcold:
            gc.enable()
        return timing

def timeit(stmt="pass", setup="pass", timer=default_timer,
           number=default_number):
    """Convenience function to create Timer object and call timeit method."""
    return Timer(stmt, setup, timer).timeit(number)

Let’s trace the part that’s giving us the long execution string. We call the module-level function timeit.timeit, which creates a Timer object and returns the result of calling its timeit method. No surprise that we got identical results between our method #1 and method #3 above.

The Timer object checks to see if you’ve passed it strings or callables as its stmt and setup parameters. In our case both are strings. The init method splices the strings into the source code here:

timeit.py
1
src = template % {'stmt': stmt, 'setup': setup}

We can isolate this in the REPL if we want to be extra-sure of how it works:

timeit.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> template = """
... def inner(_it, _timer):
...     %(setup)s
...     _t0 = _timer()
...     for _i in _it:
...         %(stmt)s
...     _t1 = _timer()
...     return _t1 - _t0
... """
>>> stmt = "pass"
>>> setup = "print 'here we go'"
>>> print template % {'stmt': stmt, 'setup': setup}

def inner(_it, _timer):
    print 'here we go'
    _t0 = _timer()
    for _i in _it:
        pass
    _t1 = _timer()
    return _t1 - _t0

Got it – simple, legal python code is generated. So where is our regular expression getting screwed up? Let’s take a look at the code generated in our actual test by inserting a print statement into the timeit module. (First we’ll copy the module over to our current working directory, to avoid modifying our actual standard library.)

timeit.py
1
2
src = template % {'stmt': stmt, 'setup': setup}
print src

And now calling test_with_timeit.py again, it prints:

1
2
3
4
5
6
7
8
9
10
11
12
def inner(_it, _timer):
    import re;
    import string;
    double = re.compile(r"([a-z])()");
    short_test = "hee" + string.lowercase;
    long_test = "hee" + string.lowercase*200

    _t0 = _timer()
    for _i in _it:
        double.search(short_test)
    _t1 = _timer()
    return _t1 - _t0

This looks mostly reasonable … but hey, wait, what happened to the regular expression?

1
double = re.compile(r"([a-z])()");

That’s not right! Changing print src to print repr(src) gives us a better idea what’s up – the relevant line is now:

1
double = re.compile(r"([a-z])(\x01)");

And looking back at the SETUP string … uh-oh, this isn’t a raw string, and it contains an escape character ((\1)). We can tell we’re in trouble immediately in the REPL:

1
2
3
4
>>> print "\1"

>>> repr("\1")
"'\\x01'"

Why does this happen? We can consult the docs on string literals to find out. Unless you’re working with a raw string (one prefixed with r or R), escape characters are interpreted the same way they are in C. Our text, a backslash followed by a number, matches the pattern for an octal digit: a backslash followed by 1-3 integers. \ooo indicates a character with the octal value ooo. And that’s exactly what we’ve done:

1
2
3
4
>>> ord("\1")
1
>>> chr(1)
'\x01'

\x01 is a legal (though unprintable) character, and it doesn’t appear in our long string to test the regular expression. So of course the longer one took a longer time – it had to traverse the entire string looking for \x01, which it never found.

The fix is quite easy – one character, in fact. Make the SETUP string raw:

1
2
3
4
5
6
SETUP = r"""import re;
import string;
double = re.compile(r"([a-z])(\1)");
short_test = "hee" + string.lowercase;
long_test = "hee" + string.lowercase*200
"""

… and the problem is solved. So we can pop a bunch of frame off our problem stack and go investigate the string copying, which is in fact the issue.

 ---------------------------
| reg exp? or string copy?  |
 ---------------------------
| function is so slow, why? |
 ---------------------------
|   find a faster solution  |
 ---------------------------
|  solve wordplay problem   | (oldest)
 ---------------------------

Lessons learned

Interesting bugs are a ton of fun! Now for a bit of introspection: what could have gone better?

  • Most obviously, I could have spent less time on the regular-expression branch of why the function was slow, especially as the stack grew. Whether I should have done this depends on your perspective: at Hacker School, I’m almost 100% learning motivated and about 0% get-it-done motivated, so my time to solve the bug is virtually unbounded.

  • I was slightly overconfident about understanding how the timeit code splicing happened. The code looked so simple that I didn’t bother to check my understanding at first.

  • Passing timeit.timeit strings is an unwieldy way of doing timing. I should have been using either the callable option, or just timing with time python test_file.py from the command line. Both of these would have been easier to work with and would have sidestepped the bug.

  • My mental model of regular expressions was largely correct. That’s satisfying.

  • Strings! Ouch. This may not seem like much of a lesson, but it’s useful to add to my mental list of Places That Bugs Lurk.

More Side Effecting

Nick Coghlan added another interesting note in response to my last post about creating side-effecting assignments in python. Here’s Nick:

Let’s look at that gist:

1
2
3
4
5
6
7
8
9
10
11
12
13
>>> class Madness(dict):
...     def __setitem__(self, attr, value):
...         if isinstance(value, type):
...             value.__name__ = attr
...         dict.__setitem__(self, attr, value)
...
>>> code = """\
... class Example(object): pass
... NewName = Example
... print(Example.__name__)
... """
>>> exec code in Madness()
NewName

This is really fun stuff, and nicely illustrates a core feature of python – that namespaces are basically just dictionaries.

So what’s going on here? In this case, I think it’s easiest to work backwards from the end.

Last we have exec code in Madness(). The exec keyword in python 21 executes a string as code.

1
2
3
>>> a = "hello"
>>> exec "print a"
hello

exec optionally takes a context in which to execute the code. A dictionary is a perfectly legal context to use:

1
2
>>> exec "print b" in {'b' : "hi there"}
hi there

All the code is shown here – b is not defined elsewhere in the REPL session.

If a dictionary is provided, it’s presumed to contain both the global and the local variables. This dictionary is the only context in which the code will be executed (it won’t check the scope from which the exec statement was made).

1
2
3
4
5
6
>>> a = "hello"
>>> exec "print a" in {'b': "hi there"}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
NameError: name 'a' is not defined

So without knowing anything about the Madness() object, we’d expect it to be a dictionary-like thing.

And sure enough, it is:

1
2
>>> class Madness(dict):
...     def __setitem__(self, attr, value):

__setitem__ is the standard way to override setting a key:value pair in a dictionary. It works just like you’d expect:

1
2
3
4
>>> d = {}
>>> d.__setitem__('hello', 2)
>>> d
{'hello': 2}

We’re intercepting any attempt to write to the Madness dictionary-like object.

1
2
3
4
...     def __setitem__(self, attr, value):
...         if isinstance(value, type):
...             value.__name__ = attr
...         dict.__setitem__(self, attr, value)

If the value of the key:value pair that we’re setting is an instance of the type type – that is, if the object in question is a class – then grab that class’s name and set it to the key. (This step will blow up if the key in question isn’t a legal name in the first place.) Then use the parent class (dict)’s __setitem__ to actually write to the Madness object.

We can execute the code string one line at a time to better see what’s happening.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
>>> m = Madness()
>>> m.keys()
[]
>>> exec "class Example(object): pass" in m
>>> m.keys()
['__builtins__', 'Example']
>>> m['Example']
<class 'Example'>
>>> exec "NewName = Example" in m
>>> m.keys()
['__builtins__', 'NewName', 'Example']
>>> m['Example']
<class 'NewName'>
>>> m['NewName']
<class 'NewName'
>>> exec "OtherName = Example" in m
>>> m['Example']
<class 'OtherName'>

(I’m using m.keys() to examine the state of the namespace instead of just printing m because the reference to __builtins__ pukes a bunch of tiresome definitions and copyright statements into the REPL. exec adds a reference to __builtins__ on the execution of any statement.)

We’re now quite close to the original Ruby behavior – with a deeper understanding of python namespaces!


  1. All of this works in python 3, too, with slightly different syntax because exec is a function, not a keyword. Nick’s gist includes the python 3 version too.

Side-effecting Assignment

Surprises

My colleague Dave told me about an interesting bit of Ruby the other day. If you make a new Class object, it initially has no name. If you then assign a name to it, with your new class on the right-hand side of the assignment, the name is attached to the Class object.

1
2
3
4
5
6
7
8
9
10
11
12
[1] pry(main)> c = Class.new
=> #<Class:0x007ff19bc46988>
[3] pry(main)> c.name
=> nil
[4] pry(main)> Foo = c
=> Foo
[5] pry(main)> Foo
=> Foo
[6] pry(main)> c
=> Foo
[7] pry(main)> c.name
=> "Foo"

I found this shockingly non-intuitive. Really, an assignment statement permanently modified the object on the right-hand side?

In a word, yes.

1
2
3
4
5
6
7
8
9
10
11
12
13
[8] pry(main)> Bar = c
=> Foo
[9] pry(main)> c
=> Foo
[10] pry(main)> Foo
=> Foo
[11] pry(main)> Bar
=> Foo

[12] pry(main)> Baz = Foo
=> Foo
[13] pry(main)> Baz
=> Foo

To recap:

Step 1: c is some object

Step 2: set Foo = c

Step 3: c is permanently altered.

Wow.

Luckily for our intuition, this is a pretty special case, and not a way that Ruby generally behaves. But it got me wondering if there’s something similar in python.

What about python?

The simplest answer is “no”. In python, assignment is a simple statement, not an operator, so you can’t do things like operator overloading. This also means we can’t somehow add a hook to assignment.

Ok, fine. But wait – we’re programmers! We don’t give up that easily.

What about hooking on __getattribute__? (Defining your own __getattribute__ is “the bad way” to do it. You almost always want to define your own __getattr__, unless you’re doing something silly and/or malicious, like we are. Note that we have to use the parent class (object)’s __getattribute__ to extract and save the attribute we want. If we didn’t do that, we’d trigger an infinitely recursive lookup.)

1
2
3
4
5
6
7
8
class Foo(object):
    def __init__(self):
        self.some_attr = "sup foo"

    def __getattribute__(self, attr):
        attribute = object.__getattribute__(self, attr)
        object.__setattr__(self, attr, "you're too late!")
        return attribute

This looks like it works!

1
2
3
4
5
6
7
8
>>> f = Foo()
>>> f
<obj.Foo object at 0x10253dd10>
>>> word = f.some_attr
>>> print word
sup foo
>>> print f.some_attr
you're too late!

Unfortunately, this behavior has nothing to do with assignment. The object was mutated when __getattribute__ was called, which happened to be in an assignment statement in this code above. We’d get exactly the same behavior without the assignment:

1
2
3
4
5
6
7
>>> g = Foo()
>>> g
<obj.Foo object at 0x10253dd10>
>>> g.some_attr
sup foo
>>> g.some_attr
you're too late!

Darn.

But we’re programmers, right? We don’t give up that easily.

Trace functions to the rescue

There’s a settrace function in python’s sys module that we can use to examine stack frames, events, and lots of other code data while our program is running. sys.settrace(fn) takes a trace function as an argument, and that trace function must take frame, event, args as arguments. It will then get called every time an event happens. A line of code, a function call, a function return, and an exception are all “events”. We can use this to inspect each line of code before it runs and check manually if it’s an assignment statement.

(I know this is silly, but isn’t it fun?)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import linecache

class Tracer(object):
    def __init__(self, program):
        self.program = program

    def traceit(self, frame, event, arg):
        if event == 'line':
            linenum = frame.f_lineno
            linetext = linecache.getline(self.program, linenum)
            vprint( 'line', linenum, linetext )

            is_assignment = "=" in linetext and "==" not in linetext

            if is_assignment:
                self.mess_up_on_assignment(frame, linetext)

        return self.traceit

(The trace function returns a reference to itself to indicate that tracing should continue.)

Now let’s write our code to mess up assignment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Tracer(object):
    [...]

    def mess_up_on_assignment(self, frame, linetext):
        rhs = linetext[linetext.index("=")+1:]
        names = rhs.split()
        local_objs = [name for name in names if name in frame.f_locals.keys() + frame.f_globals.keys()]
        for name in local_objs:
            if name not in JANKY_NAMESPACE_MANAGER:
                box = Box(frame.f_locals[name], name)
                frame.f_locals[name] = box
            else:
                JANKY_NAMESPACE_MANAGER[name].assignments += 1

JANKY_NAMESPACE_MANAGER = {}

class Box(object):
    def __init__(self, obj, name):
        self.obj = obj
        self.name = name
        self.assignments = 1
        JANKY_NAMESPACE_MANAGER[name] = self

    def __repr__(self):
        raise NameError("this message has self-destructed.")

Here our “box” object is pretty silly. It takes the original object and wraps it in a box. (You could implement __getattr__ and __setattr__ so that the box behaves basically like the original object, which I haven’t done.) The box then explodes when you try to print it. We’re also tracking which names we’ve seen with the aptly-named JANKY_NAMESPACE_MANAGER (which has all kinds of scope issues, but whatever).

Meanwhile, in the mess_up_on_assignment code, we check to see if any of the whitespace-separated words on the right-hand side of the equation are names we recognize from frame.f_locals or frame.f_globals. If so, and if we haven’t seen them before, throw them in a self-destructing box!

Add a simple helper to get things set up.

1
2
3
def assignment(orig_file):
    trace_obj = Tracer(orig_file)
    sys.settrace(trace_obj.traceit)

We can now sabotage simple-looking programs!

This one runs fine:

1
2
3
4
5
6
7
8
9
10
11
import sabotage
sabotage.assignment(__file__)

def greet():
    name = raw_input("Enter your name: ")
    print "hello,", name

    print "goodbye, ", name

if __name__ == '__main__':
    greet()

And this one blows up:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import sabotage
sabotage.assignment(__file__)

print "done sabotaging"
raw_input("Hit enter to continue")

def greet():
    name = raw_input("Enter your name: ")
    print "hello,", name

    name_copy = name

    print "goodbye, ", name

if __name__ == '__main__':
    greet()

And there you have it – a (incredibly goofy) side-effecting assignment that mutates the object on the right-hand side of the assignment statement.

New Git Tricks

I picked up a couple of new tricks at Hacker School’s intermediate/advanced Git seminar, led by Peter Bell.

1
$ git checkout -

This checks out whatever you had checked out last. Thanks, Nina!

Draw a graph of the commit history:

1
2
3
4
5
6
7
8
9
10
git-advanced [master]\ $ git log --graph --oneline --all
*   b2da179 Merge branch 'test-graph'
|\  
| * d71e4fc testing graph thing
* | 8535311 still testing graph
|/  
* b90b105 Added contact form
* b36ba51 added gitignore to ignore generated log files
* 70461dc added about
* c57f995 added homepage

You can alias this with git config --global alias.plog "log --graph --oneline --all". (Looking at this graph may motivate you to start using git merge --no-ff.) You can substitute whatever you want for plog above.

If you get really screwed up, it’s reflog time! (This is pronounced ref-log, not re-flog as I’d been reading it. On the other hand, as my colleague Alan notes, “You damn well ought to feel penitent if you’re wedged enough to need the reflog.”) Here, we’re using git reset to undo the last commit. Even though the commit doesn’t appear in git log anymore, there’s still a references to it in the reflog.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
git-advanced [master %]\ $ git add tester.css
git-advanced [master +]\ $ git commit -m "add some file"
git-advanced [master]\ $ git reset HEAD~1
git-advanced [master %]\ $ git add tester.css
git-advanced [master +]\ $ git commit -m "redoing some file"

git-advanced [master]\ $ git log --oneline
8039ec5 redoing some file
b2da179 Merge branch 'test-graph'
8535311 still testing graph
d71e4fc testing graph thing

git-advanced [master]\ $ git reflog
8039ec5 HEAD@{0}: commit: redoing some file
b2da179 HEAD@{1}: reset: moving to HEAD~1
848874d HEAD@{2}: commit: add some file

It turns out that there is a better way to get a commit sha than running git log and copying and pasting the relevant shas. You can use rev-parse instead.

1
2
3
4
git-advanced [master]\ $ git rev-parse HEAD~2
0fafe464f4b7d524c8ba707ea2115950d6dd4bdf
git-advanced [master]\ $ git rev-parse master
8039ec584485540f07aac1209389022f627bc8c5

Then you can inline with backticks!

1
git-advanced [master]\ $ git diff `git rev-parse HEAD~2` `git rev-parse HEAD`

Now go forth and git in peace.

Surprised by Stable Sort

I encountered some surprising behavior in python’s list.sort() method. I was calling list.sort() with a function, then list.reverse(). This was silly – I forgot I could call list.sort(reverse=True). As it turns out, these are not the always the same, or even usually the same.

Here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def two_sort(ex):
    ex_copy = ex[:]
 
    ex.sort(key=lambda tup: tup[1])
    ex.reverse()
    print "Ex:  ", ex
 
    ex_copy.sort(key=lambda tup: tup[1], reverse=True)
    print "Copy:", ex_copy
 
    assert ex == ex_copy # fails
 
ex = [('a', 0), ('b', 0), ('c', 2), ('d', 3)]
two_sort(ex)
# Ex:   [('d', 3), ('c', 2), ('b', 0), ('a', 0)]
# Copy: [('d', 3), ('c', 2), ('a', 0), ('b', 0)]

What’s happening here is that .sort() and sorted() in python are “stable sorts”, meaning that it’ll preserve the existing order when faced with a tie. This leads to a nice side effect, as described in the python wiki: you can sort by multiple criteria. So if we wanted to sort primarily by number (ascending), and secondarily by letter, we could do that quite easily. Just sort by the secondary key first, then by the primary key. Since the sort is stable, you know that a tie in the second (primary) sort will preserve the order you had going in.

1
2
3
4
5
6
7
>>> a = [('d', 3), ('c', 5), ('b', 0), ('a', 0)]
>>> a.sort(key=lambda tup: tup[0])
>>> a
[('a', 0), ('b', 0), ('c', 5), ('d', 3)]
>>> a.sort(key=lambda tup:tup[1])
>>> a
[('a', 0), ('b', 0), ('d', 3), ('c', 5)]

Some Wacky Shell Scripting

I’m currently digging into the source of virtualenv, a python utility for managing environments. It’s incredibly handy, and I’d been using it for a while without really understanding how it worked. I’ll have more later on the details, but in the meantime, I hit a particularly puzzling bit of code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
if [ -z "$VIRTUAL_ENV_DISABLE_PROMPT" ] ; then
    _OLD_VIRTUAL_PS1="$PS1"
    if [ "x" != x ] ; then
        PS1="$PS1"
    else
    if [ "`basename \"$VIRTUAL_ENV\"`" = "__" ] ; then
        # special case for Aspen magic directories
        # see http://www.zetadev.com/software/aspen/
        PS1="[`basename \`dirname \"$VIRTUAL_ENV\"\``] $PS1"
    else
        PS1="(`basename \"$VIRTUAL_ENV\"`)$PS1"
    fi
    fi
    export PS1
fi

This is largely intelligible – clearly we’re modifying the $PS1 variable, which sets the bash prompt. But what on earth is this line?

1
2
if [ "x" != x ] ; then
    PS1="$PS1"

x is not a variable defined elsewhere in the script, and bash will interpret both "x" and x as string literals anyway. Is there some bizarre shell in which this test can possibly return true?

As it turns out, there isn’t. The reason this code looks so bizarre is that it’s automatically generated by virtualenv. The original line of code is this:

1
2
if [ "x__VIRTUAL_PROMPT__" != x ] ; then
    PS1="__VIRTUAL_PROMPT__$PS1"

And then elsewhere we find its complement:

1
    content = content.replace('__VIRTUAL_PROMPT__', prompt or '')

Together, this code is entirely reasonable.

Thanks to user ruakh on Stack Overflow for the key insight here.

Git Add -p: The Wave of the Future

I recently started using git’s patch mode via git add -p. Patch mode makes my commits more granular, which means my commit messages are more accurate. Better still, patch mode allows you to abstract away from adding “files” – you’re adding changes to be committed. This is a closer mental model to what git is actually doing. I’m finding it particularly useful for code review: make several changes in a friend’s program, group the changes by concept, and commit one concept at a time.

That’s the why – here’s the how.

git add -p launches what git calls the “interactive hunk selector.” This leads me to my other favorite reason to use -p: besides being useful, all the docs sound like an episode of The Bachelorette. “Once you have decided the fate of all the hunks, if there is any hunk that is chosen, the index is updated with the selected hunks.” And if not, our bachelorette takes the million dollars!

In practice, I only end up using three of the interactive hunk selector options: [y]es, [n]o, and [s]plit. Split takes the current hunk and divides it into smaller hunks, then presents them for selection one at a time. (If you’re attempting to continue the Bachelorette metaphor, now is a good time to stop.)

Let’s look at an example. Suppose I’m working on a function that recursively calculates the sum of element of a list. My first draft looks like this:

1
2
3
4
5
6
def sum(list_in):
    """ Returns the sum of elements of a list."""
    if list_in == []:
        return 0
    else:
        return list_in.pop() + sum(list_in)

Not bad, but it’s a little verbose. Let’s take advantage of the fact that an empty list is falsey. And, oops, we’re overwriting the built-in sum function in python – we probably don’t want to do that.

1
2
3
4
5
6
def summer(list_in):
    """ Returns the sum of elements of a list."""
    if not list_in:
        return 0
    else:
        return list_in.pop() + summer(list_in)

Ok, these are two different thoughts, but we didn’t commit in between. Interactive mode to the rescue – let’s stage those hunks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
sum [master]\ ⚲git add -p
diff --git a/sum.py b/sum.py
index fdccd3f..6d000f7 100644
--- a/sum.py
+++ b/sum.py
@@ -1,6 +1,6 @@
-def sum(list_in):
+def summer(list_in):
     """ Returns the sum of elements of a list."""
-    if list_in == []:
+    if not list_in:
         return 0
     else:
-        return list_in.pop() + sum(list_in)
+        return list_in.pop() + summer(list_in)
Stage this hunk [y,n,q,a,d,/,s,e,?]? 

Our changes aren’t in the right hunks, so we’ll pick s to split them.

1
2
3
4
5
6
7
Stage this hunk [y,n,q,a,d,/,s,e,?]? s
Split into 3 hunks.
@@ -1,2 +1,2 @@
-def sum(list_in):
+def summer(list_in):
     """ Returns the sum of elements of a list."""
Stage this hunk [y,n,q,a,d,/,j,J,g,e,?]? 

Let’s deal with the naming issue first. y. Next up:

1
2
3
4
5
6
@@ -2,4 +2,4 @@
     """ Returns the sum of elements of a list."""
-    if list_in == []:
+    if not list_in:
         return 0
     else:

This one’s different. n.

1
2
3
4
5
6
Stage this hunk [y,n,q,a,d,/,K,j,J,g,e,?]? n
@@ -4,3 +4,3 @@
         return 0
     else:
-        return list_in.pop() + sum(list_in)
+        return list_in.pop() + summer(list_in)

Back to the naming issue. y.

1
2
3
sum [master *+]\ ⚲git commit -m "fix naming to not conflict with builtin"
[master caaa300] fix naming to not conflict with builtin
 1 file changed, 2 insertions(+), 2 deletions(-)

Now we can repeat this process for our other change.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 sum [master *]\ ⚲git add -p
diff --git a/sum.py b/sum.py
index 77676b1..6d000f7 100644
--- a/sum.py
+++ b/sum.py
@@ -1,6 +1,6 @@
 def summer(list_in):
     """ Returns the sum of elements of a list."""
-    if list_in == []:
+    if not list_in:
         return 0
     else:
         return list_in.pop() + summer(list_in)
Stage this hunk [y,n,q,a,d,/,e,?]? 

This is our only hunk now, so y.

1
2
3
sum [master +]\ ⚲git commit -m "use falseyness of empty list"
[master 5209728] use falseyness of empty list
 1 file changed, 1 insertion(+), 1 deletion(-)

Of course, you may not want your commits to be quite this granular – but hopefully the example has demonstrated the strength of git add -p in creating single-concept commits.