[Edit: A significantly expanded version of this series appears as a chapter in The Architecture of Open Source Applications, volume 4, as A Python Interpreter Written in Python.]
This is Part 3 in a series on the Python interpreter. Part 1 here, Part 2 here. If you’re enjoying this series, consider applying to Hacker School, where I work as a facilitator.
Bytecode
When we left our heroes, they had come across some odd-looking output:
1 2 |
|
This is python bytecode.
You recall from Part 2 that “python bytecode” and “a python code object” are not the same thing: the bytecode is an attribute of the code object, among many other attributes. Bytecode is found in the co_code
attribute of the code object, and contains instructions for the interpreter.
So what is bytecode? Well, it’s just a series of bytes. They look wacky when we print them because some bytes are printable and others aren’t, so let’s take the ord
of each byte to see that they’re just numbers.
1 2 |
|
Here are the bytes that make up python bytecode. The interpreter will loop through each byte, look up what it should do for each one, and then do that thing. Notice that the bytecode itself doesn’t include any python objects, or references to objects, or anything like that.
One way to understand python bytecode would be to find the CPython interpreter file (it’s ceval.c
), and flip through it looking up what 100
means, then 1
, then 0
, and so on. We’ll do this later in the series! For now, there’s a simpler way: the dis
module.
Disassembling bytecode
Disassembling bytecode means taking this series of bytes and printing out something we humans can understand. It’s not a step in python execution; the dis
module just helps us understand an intermediate state of python internals. I can’t think of a reason why you’d ever want to use dis
in production code – it’s for humans, not for machines.
Today, however, taking some bytecode and making it human-readable is exactly what we’re trying to do, so dis
is a great tool. We’ll use the function dis.dis
to analyze the code object of our function foo
.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
(You usually see this called as dis.dis(foo)
, directly on the function object. That’s just a convenience: dis
is really analyzing the code object. If it’s passed a function, it just gets its code object.)
The numbers in the left-hand column are line numbers in the original source code. The second column is the offset into the bytecode: LOAD_CONST
appears at position 0, STORE_FAST
at position 3, and so on. The middle column shows the names of bytes. These names are just for our (human) benefit – the interpreter doesn’t need the names.
The last two columns give details about the instructions’s argument, if there is an argument. The fourth column shows the argument itself, which represents an index into other attributes of the code object. In the example, LOAD_CONST
’s argument is an index into the list co_consts
, and STORE_FAST
’s argument is an index into co_varnames
. Finally, in the fifth column, dis
has looked up the constants or names in the place the fourth column specified and told us what it found there. We can easily verify this:
1 2 3 4 |
|
This also explains why the second instruction, STORE_FAST
, is found at bytecode position 3. If a bytecode has an argument, the next two bytes are that argument. It’s the interpreter’s job to handle this correctly.
(You may be surprised that BINARY_ADD
doesn’t have arguments. We’ll come back to this in a future installment, when we get to the interpreter itself.)
People often say that dis
is a disassembler of python bytecode. This is true enough – the dis
module’s docs say it – but dis
knows about more than just the bytecode, too: it uses the whole code object to give us an understandable printout. The middle three columns show information actually encoded in the bytecode, while the first and the last columns show other information. Again, the bytecode itself is really limited: it’s just a series of numbers, and things like names and constants are not a part of it.
How does the dis
module get from bytes like 100
to names like LOAD_CONST
and back? Try to think of a way you’d do it. If you thought “Well, you could have a list that has the byte names in the right order,” or you thought, “I guess you could have a dictionary where the names are the keys and the byte values are the values,” then congratulations! That’s exactly what’s going on. The file opcode.py
defines the list and the dictionary. It’s full of lines like these (def_op
inserts the mapping in both the list and the dictionary):
1 2 3 4 |
|
There’s even a friendly comment telling us what each byte’s argument means.
Ok, now we understand what python bytecode is (and isn’t), and how to use dis
to make sense of it. In Part 4, we’ll look at another example to see how Python can compile down to bytecode but still be a dynamic language.