Last Thursday I gave a lightning talk at Hacker School about the peephole optimizer in Python. A “peephole optimization” is a compiler optimization that looks at a small chunk of code at a time and optimizes in that little spot. This post explains one surprising side-effect of an optimization in CPython.
Writing a test coverage tool
Suppose that we’re setting out to write a test coverage tool. Python provides an easy way to trace execution using sys.settrace
, so a simple version of a coverage analyzer isn’t too hard.
Our code to test is one simple function:
1 2 3 4 5 |
|
Then we’ll write the world’s simplest testing framework:
1 2 3 4 5 6 7 8 |
|
Now for the simplest possible coverage tool. We can pass sys.settrace
any tracing function, and it’ll be called with the arguments frame
, event
, and arg
every time an event happens in the execution. Lines of code being executed, function calls, function returns, and exceptions are all events. We’ll filter out everything but line
and call
events, then keep track of what line of code was executing.1 Then we run the tests while the trace function is tracing, and finally report which (non-empty lines) failed to execute.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
Let’s try it. We’re pretty confident in our test coverage – there are only two branches in the code, and we’ve tested both of them.
1 2 3 |
|
Why didn’t the else
line execute? To answer this, we’ll run our function through the disassembler.2
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
You don’t need to follow exactly what’s going on in this bytecode, but note that the first column is the line numbers of source code and line 4, the one containing the else
, doesn’t appear. Why not? Well, there’s nothing to do with an else statement – it’s just a separator between two branches of an if
statement. The second line in the disassembly, POP_JUMP_IF_FALSE 10
, means that the interpreter will pop the top thing off of the virtual machine stack, jump to bytecode index ten if that thing is false, or continue with the next instruction if it’s true.
From the bytecode’s perspective, there’s no difference at all between writing this:
1 2 3 4 5 6 7 |
|
and this:
1 2 3 4 5 6 |
|
(even though the second is better style).
We’ve learned we need to special-case else
statements in our code coverage tool. Since there’s no logic in them, let’s just drop lines that only contain else:
. We can revise our unexecuted_code
method accordingly:
1 2 3 4 5 6 7 8 |
|
Then run it again:
1 2 |
|
Success!
Complications arise
Our previous example was really simple. Let’s add a more complex one.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
continuer
will increment a
on all odd numbers and increment b
and c
for all even numbers. Don’t forget to add a test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
1 2 3 |
|
Hmm. The test we wrote certainly did involve the continue
statement – if the interpreter hadn’t skipped the bottom half of the loop, the test wouldn’t have passed. Let’s use the strategy we used before to understand what’s happening: examining the output of the disassembler.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
There’s a lot more going on here, but you don’t need to understand all of it to proceed. Here are the things we need to know to make sense of this:
- The second column in the output is the index in the bytecode, the third is the byte name, and the fourth is the argument. The fifth, when present, is a hint about the meaning of the argument.
POP_JUMP_IF_FALSE
,POP_JUMP_IF_TRUE
, andJUMP_ABSOLUTE
have the jump target as their argument. So, e.g.POP_JUMP_IF_TRUE 27
means “if the popped expression is true, jump to position 27.”JUMP_FORWARD
’s argument specifies the distance to jump forward in the bytecode, and the fifth column shows where the jump will end.- When an iterator is done,
FOR_ITER
jumps forward the number of bytes specified in its argument.
Unlike the else
case, the line containing the continue
does appear in the bytecode. But trace through the bytecode using what you know about jumps: no matter how hard you try, you can’t end up on bytes 66 or 69, the two that belong to line 13.
The continue
is unreachable because of a compiler optimization. In this particular optimization, the compiler notices that two instructions in a row are jumps, and it combines these two hops into one larger jump. So, in a very real sense, the continue
line didn’t execute – it was optimized out – even though the logic reflected in the continue
is still reflected in the bytecode.
What would this bytecode have looked like without the optimizations? There’s not currently an option to disable the peephole bytecode optimizations, although there will be in a future version of Python (following an extensive debate on the python-dev list). For now, the only way to turn off optimizations is to comment out the relevant line in compile.c
, the call to PyCode_Optimize
, and recompile Python. Here’s the diff, if you’re playing along at home.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
Just as we expected, the jump targets have changed. The instruction at position 50, POP_JUMP_IF_FALSE
, now has 66 as its jump target – a previously unreachable instruction associated with the continue
. Instruction 63, JUMP_FORWARD
, is also targeting 66. In both cases, the only way to reach this instruction is to jump to it, and the instruction itself jumps away.3
Now we can run our coverage tool with the unoptimized Python:
1 2 |
|
Complete success!
So is this a good idea or not?
Compiler optimizations are often a straightforward win. If the compiler can apply simple rules that make my code faster without requiring work from me, that’s great. Almost nobody requires a strict mapping of code that they write to code that ends up executing. So, peephole optimization in general: yes! Great!
But “almost nobody” is not nobody, and one kind of people who do require strict reasoning about executed code are the authors of test coverage software. In the python-dev thread I linked to earlier, there was an extensive discussion over whether or not serving this demographic by providing an option to disable to optimizations was worth increasing the complexity of the codebase. Ultimately it was decided that it was worthwhile, but this is a fair question to ask.
Further reading
Beyond the interesting Python-dev thread linked above, my other suggestions are mostly code. CPython’s peephole.c
is pretty readable C code, and I encourage you to take a look at it. (“Constant folding” is a great place to start.) There’s also a website compileroptimizations.com which has short examples and discussion of 45 different optimizations. If you’d like to play with these code examples, they’re all available on my github.
-
We need to include
call
events to capture the first line of a function declaration,def fn(...):
↩ -
I’ve previously written an introduction to the disassembler here.↩
-
You may be wondering what the
JUMP_ABSOLUTE
instruction at position 66 is doing. This instruction does nothing unless a particular compiler optimization is turned on. The optimization support faster loops, but creates restrictions on what those loops can do. Seeceval.c
for more. Edit: This footnote previously incorrectly referencedJUMP_FORWARD
.↩