Allison Kaptur

An occasional blog on programming

Of Syntax Warnings and Symbol Tables

A Hacker Schooler hit an interesting bug today: her program would sometimes emit the message SyntaxWarning: import * only allowed at module level. I had never seen a SyntaxWarning before, so I decided to dig in.

The wording of the warning is strange: it says that star-import is only allowed at the module level, but it’s not a syntax error, just a warning. In fact, you can use a star-import in a scope that isn’t a module (in Python 2):

1
2
3
4
5
6
7
>>> def nope():
...     from random import *
...     print randint(1,10)
...
<stdin>:1: SyntaxWarning: import * only allowed at module level
>>> nope()
7

The Python spec gives some more details:

The from form with * may only occur in a module scope. If the wild card form of import — import * — is used in a function and the function contains or is a nested block with free variables, the compiler will raise a SyntaxError.

Just having import * in a function isn’t enough to raise a syntax error – we also need free variables. The Python execution model refers to three kinds of variables, ‘local,’ ‘global,’ and ‘free’, defined as follows:

If a name is bound in a block, it is a local variable of that block. If a name is bound at the module level, it is a global variable. (The variables of the module code block are local and global.) If a variable is used in a code block but not defined there, it is a free variable.

Now we can see how to trigger a syntax error from our syntax warning:

1
2
3
4
5
6
7
8
9
>>> def one():
...     def two():
...         from random import *
...         print randint(1,10)
...     two()
...
<stdin>:2: SyntaxWarning: import * only allowed at module level
  File "<stdin>", line 3
SyntaxError: import * is not allowed in function 'two' because it is a nested function

and similarly,

1
2
3
4
5
6
7
8
9
>>> def one():
...     from random import *
...     def two():
...         print randint(1,10)
...     two()
...
<stdin>:1: SyntaxWarning: import * only allowed at module level
  File "<stdin>", line 2
SyntaxError: import * is not allowed in function 'one' because it contains a nested function with free variables

As Python programmers, we’re used to our lovely dynamic language, and it’s unusual to hit compile-time constraints. As Amy Hanlon points out, it’s particularly weird to hit a compile-time error for code that wouldn’t raise a NameError when it ran – randint would indeed be in one’s namespace if the import-star had executed.

But we can’t run code that doesn’t compile, and in this case the compiler doesn’t have enough information to determine what bytecode to emit. There are different opcodes for loading and storing each of global, free, and local variables. A variable’s status as global, free, or local must be determined at compile time and then stored in the symbol table.

To investigate this, let’s look at minor variations on this code snippet and disassemble them.

1
2
3
4
5
6
7
8
9
10
11
>>> code_type = type((lambda: 1).__code__) # just a handle on the code type, which isn't exposed as a builtin
>>> # A helper function to disassemble nested functions
>>> def recur_dis(fn):
...     print(fn.__code__)
...     dis.dis(fn)
...     for const in fn.__code__.co_consts:
...         if isinstance(const, code_type):
...             print
...             print
...             print(const)
...             dis.dis(const)

First, when x is local, the compiler emits STORE_FAST in the assignment statement and LOAD_FAST to load it, marked with arrows below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
>>> def one():
...     def two():
...         x = 1
...         print x
...     two()
>>> recur_dis(one)
<code object one at 0x10e246730, file "<stdin>", line 1>
  2           0 LOAD_CONST               1 (<code object two at 0x10e246030, file "<stdin>", line 2>)
              3 MAKE_FUNCTION            0
              6 STORE_FAST               0 (two)

  5           9 LOAD_FAST                0 (two)
             12 CALL_FUNCTION            0
             15 POP_TOP
             16 LOAD_CONST               0 (None)
             19 RETURN_VALUE

<code object two at 0x10e246030, file "<stdin>", line 2>
  3           0 LOAD_CONST               1 (1)
              3 STORE_FAST               0 (x)          <---- STORE_FAST

  4           6 LOAD_FAST                0 (x)          <----- LOAD_FAST
              9 PRINT_ITEM
             10 PRINT_NEWLINE
             11 LOAD_CONST               0 (None)
             14 RETURN_VALUE

When x is global, the compiler emits LOAD_GLOBAL to load it. I think the assignment is STORE_FAST again, but it’s not pictured here because the assignment is outside the function and thus not disassembled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>>> x = 1
>>> def one():
...     def two():
...         print x
...     two()
...
>>> recur_dis(one)
<code object one at 0x10e246730, file "<stdin>", line 1>
  2           0 LOAD_CONST               1 (<code object two at 0x10e2464b0, file "<stdin>", line 2>)
              3 MAKE_FUNCTION            0
              6 STORE_FAST               0 (two)

  4           9 LOAD_FAST                0 (two)
             12 CALL_FUNCTION            0
             15 POP_TOP
             16 LOAD_CONST               0 (None)
             19 RETURN_VALUE

<code object two at 0x10e2464b0, file "<stdin>", line 2>
  3           0 LOAD_GLOBAL              0 (x)          <----- LOAD_GLOBAL
              3 PRINT_ITEM
              4 PRINT_NEWLINE
              5 LOAD_CONST               0 (None)
              8 RETURN_VALUE

Finally, when x is nonlocal, the compiler notices that we’ll need a closure, and emits the opcodes LOAD_CLOSURE, MAKE_CLOSURE, and later LOAD_DEREF.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
>>> def one():
...     x = 1
...     def two():
...        print x
...     two()
...
>>> recur_dis(one)
<code object one at 0x10e246e30, file "<stdin>", line 1>
  2           0 LOAD_CONST               1 (1)
              3 STORE_DEREF              0 (x)          <----- STORE_DEREF

  3           6 LOAD_CLOSURE             0 (x)          <----- LOAD_CLOSURE
              9 BUILD_TUPLE              1
             12 LOAD_CONST               2 (<code object two at 0x10e246d30, file "<stdin>", line 3>)
             15 MAKE_CLOSURE             0
             18 STORE_FAST               0 (two)

  5          21 LOAD_FAST                0 (two)
             24 CALL_FUNCTION            0
             27 POP_TOP
             28 LOAD_CONST               0 (None)
             31 RETURN_VALUE

<code object two at 0x10e246d30, file "<stdin>", line 3>
  4           0 LOAD_DEREF               0 (x)          <----- LOAD_DEREF
              3 PRINT_ITEM
              4 PRINT_NEWLINE
              5 LOAD_CONST               0 (None)
              8 RETURN_VALUE

Let’s now return to a case that throws a syntax error.

1
2
3
4
5
6
7
8
9
>>> def one():
...      from random import *
...      def two():
...          print x
...      two()
...
<stdin>:1: SyntaxWarning: import * only allowed at module level
  File "<stdin>", line 2
SyntaxError: import * is not allowed in function 'one' because it contains a nested function with free variables

I’d love to show what the disassembled bytecode for this one looks like, but we can’t do that because there is no bytecode! We got a compile-time error, so there’s nothing here.

Further reading

Everything I know about symbol tables I learned from Eli Bendersky’s blog. I’ve skipped some complexity in the implementation that Eli covers.

acking through the source code of CPython for the text of the error message leads us right to symtable.c, which is exactly where we’d expect this message to be emitted. The function check_unoptimized shows where the syntax error gets thrown (and shows another illegal construct, too – but we’ll leave that one as an exercise for the reader).

p.s. In Python 3, import * anywhere other than a module is just an unqualified syntax error – none of this messing around with the symbol table.

PyCon Prep: Import Is a Keyword

Last week I had a ton of fun working with Amy Hanlon on her Harry Potter themed fork of Python, called Nagini. Nagini is full of magic and surprises. It implements the things you’d hope for out of a Harry Potter Python, like making quit into avada_kedavra, and many analogous jokes.

Amy also had the idea to replace import with accio! Replacing import is a much harder problem than renaming a builtin. Python doesn’t prevent you from overwriting builtins, whereas to change keywords you have to edit the grammar and recompile Python. You should go read Amy’s post on making this work.

This brings us to an interesting question: why is import a keyword, anyway? There’s a function, __import__, that does (mostly) the same thing:

1
2
>>> __import__('random')
<module 'random' from '/path/to/random.pyc'>

The function form requires the programmer to assign the return value – the module – to a name, but once we’ve done that it works just like a normal module:

1
2
3
>>> random = __import__('random')
>>> random.random()
0.32574174955668145

The __import__ function can handle all the forms of import, including from foo import bar and from baz import * (although it never modifies the calling namespace). There’s no technical reason why __import__ couldn’t be the regular way to do imports.1

As far as I can tell, the main argument against an import function is aesthetic. Compare:

1
2
3
4
5
6
7
import foo
from foo import bar
import longmodulename as short

foo = __import__('foo')
bar = __import__('random').bar
short = __import__('longmodulename')

The first way certainly feels much cleaner and more readable.2

Part of my goal in my upcoming PyCon talk is to invite Pythonistas to consider decisions they might not have thought about before. import is a great vehicle for this, because everyone learns it very early on in their programming development, but most people don’t ever think about it again. Here’s another variation on that theme: import doesn’t have to be a keyword!


  1. I think all keywords could be expressed as functions, except those used for flow control (which I loosely define as keywords that generate any JUMP instructions when compiled). For example, between Python 2 and 3, two keywords did become functions – print and exec.

  2. I realize this is a slightly circular argument – if the function strategy were the regular way to import, it probably wouldn’t be so ugly.

Reading EBNF

One of the fun parts of pairing with Amy on Nagini was modifying the grammar of Python. You should try this – it’s easier than you think! Eli Bendersky has a great post with step-by-step instructions.1

Modifying Python’s grammar starts in the Grammar/Grammar file. I’ve recently learned how to read this (which, for me, mostly means learning how to pronounce the punctuation), so I want to walk through the import example in some detail. The syntax here is Extended Backus-Naur Form, or EBNF. You read it like a tree, and your primary verb is “consists of”:

  • import_stmt consists of one of two forms, import_name or import_from.
  • import_name consists of the literal word import followed by dotted_as_names.
  • dotted_as_names consists of a dotted_as_name (note the singular), optionally followed by one or more pairs of a comma and another dotted_as_name.
  • dotted_as_name consists of a dotted_name, optionally followed by the literal word ‘as’ and a NAME.
  • Finally, dotted_name consists of a NAME, maybe followed by pairs of a dot and another NAME.

You can walk the other branches in a similar way.

1
2
3
4
5
6
7
8
9
10
import_stmt: import_name | import_from
import_name: 'import' dotted_as_names
# note below: the ('.' | '...') is necessary because '...' is tokenized as ELLIPSIS
import_from: ('from' (('.' | '...')* dotted_name | ('.' | '...')+)
              'import' ('*' | '(' import_as_names ')' | import_as_names))
import_as_name: NAME ['as' NAME]
dotted_as_name: dotted_name ['as' NAME]
import_as_names: import_as_name (',' import_as_name)* [',']
dotted_as_names: dotted_as_name (',' dotted_as_name)*
dotted_name: NAME ('.' NAME)*

To accio-ify Python, we had to replace the occurences of 'import' with 'accio'. There are only two – we were only interested in the literal string import, not all the other names. import_as_name and so on are just nodes in the tree, and only matter to the parser and compiler.

Every other keyword and symbol that has special meaning to the Python parser also appears in Grammar as a string.

Perusing the grammar is (goofy) way to learn about corner cases of Python syntax, too! For example, did you know that with can take more than one context manager? It’s right there in the grammar:

1
2
with_stmt: 'with' with_item (',' with_item)*  ':' suite
with_item: test ['as' expr]
1
2
3
4
5
6
7
8
>>> with open('foo.txt','w') as f, open('bar.txt','w') as g:
...     f.write('foo')
...     g.write('bar')
...
>>> f
<closed file 'foo.txt', mode 'w' at 0x10bf71270>
>>> g
<closed file 'bar.txt', mode 'w' at 0x10bf71810>

Now go ahead and add your favorite keyword into Python!


  1. Like Eli, I’m not advocating for Python’s actual grammar to change – it’s just a fun exercise.

PyCon Prep: `require` in Ruby

I’m talking about import at PyCon in April. In the talk, we’ll imagine that there is no import and will reinvent it from scratch. I hope this will give everyone (including me!) a deeper understanding of the choices import makes and the ways it could have been different. Ideally, the structure will be a couple of sections of the form “We could have made [decisions]. That would mean [effects]. Surprise – that’s how it works in [language]!”1

This is the first of (probably) several posts with notes of things I’m learning as I prepare my talk. Feedback is welcome.

Today I’m looking into Ruby’s require and require_relative2 to see if aspects of them would be interesting to Python programmers. So far, here’s what I think is most relevant:

  • Unlike Python, require won’t load all objects in the required file. There’s a concept of local versus global variables in the file scope that doesn’t exist in Python.

  • Unlike Python, one file does not map to one module. Modules are created by using the keyword module.

  • Unlike Python, namespace collisions are completely possible. Consider the following simple files:

one.rb
1
2
3
4
5
puts "one!"

def foo
  :hello
end
two.rb
1
2
3
4
5
puts "two!"

def foo
  :world
end
main.rb
1
2
3
4
require_relative 'one'
require_relative 'two'

puts foo

And the output from running main.rb:

output
1
2
3
one!
two!
world
  • Like Python’s import, require will only load a file once. This can interact interestingly with namespace collisions – to take a contrived example:
main.rb
1
2
3
4
5
require_relative 'one'
require_relative 'two'
require_relative 'one'

puts foo

Because one.rb isn’t reloaded, foo is still 'world':

output
1
2
3
one!
two!
world

Questions for further investigation / thought

My talk should not convince people that Python is Right and other languages are Wrong. I’m trying to overcome my bias towards the system I’m most used to. (I think I’ve written roughly equal amounts of Python and Ruby, but the vast majority of the Ruby I’ve written is Rails, where all the requireing and namespacing happens by magic.) Here are some questions I’d like to research more.

  1. Python’s namespacing feels much better to me, although I’m sure that’s partly because I’m used to it. What’s the advantage to doing namespacing this way?

  2. Why have both require and require_relative? Why not have require check the relative path as well before raising a LoadError?

  3. What’s the advantage of uncoupling a module from a file?


  1. I asked on twitter for suggestions of languages that make interesting decisions about import equivalents. So far the suggestions are R, Go, Rust, Ruby, JavaScript, and Clojure. If you have others, let me know.

  2. As far as I can tell, the only difference between require and require_relative is the load path searched.

Introduction to the Python Interpreter, Part 4: It’s Dynamic!

[Edit: A significantly expanded version of this series appears as a chapter in The Architecture of Open Source Applications, volume 4, as A Python Interpreter Written in Python.]

This is Part 4 in a series on the Python interpreter. Read Part 1, Part 2, and Part 3. If you’re enjoying this series, consider applying to Hacker School, where I work as a facilitator.

One of the things I was confused about when I started digging into python internals was how python could be “dynamic” if it was also “compiled.” Often, in casual coversation, those two words are used as antonyms – there are “dynamic languages,”1 like Python, Ruby, and Javascript, and “compiled languages,” like C, Java, and Haskell.

Most of the time, when people talk about a “compiled” language, they mean one that compiles down to native x86/ARM/etc instructions2 – instructions for an actual machine made of metal. An “interpreted” language either doesn’t have any compilation at all3, or compiles to an intermediate representation, like bytecode. Bytecode is instructions for a virtual machine, not a piece of hardware. Python falls into this latter category: the Python compiler’s job is to generate bytecode for the Python interpreter.4

The Python interpreter’s job is to make sense of the bytecode via the virtual machine, which turns out to be a lot of work. We’ll dig in to the virtual machine in Part 5.

So far our discussion of compiling versus interpretation has been abstract. These ideas become more clear with an example.

1
2
3
4
5
6
7
8
9
10
>>> def modulus(x, y):
...     return x % y
...
>>> [ord(b) for b in modulus.func_code.co_code]
[124, 0, 0, 124, 1, 0, 22, 83]
>>> dis.dis(modulus.func_code)
  2           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 BINARY_MODULO
              7 RETURN_VALUE

Here’s a function, its bytecode, and its bytecode run through the disassembler. By the time we get the prompt back after the function definition, the modulus function has been compiled and a code object generated. That code object will never be modified.

This seems pretty easy to reason about. Unsurprisingly, typing a modulus (%) causes the compiler to emit the instruction BINARY_MODULO. It looks like this function will be useful if we need to calculate a remainder.

1
2
>>> modulus(15,4)
3

So far, so good. But what if we don’t pass it numbers?

1
2
>>> modulus("hello %s", "world")
'hello world'

Uh-oh, what happened there? You’ve probably seen this before, but it usually looks like this:

1
2
>>> print "hello %s" % "world"
hello world

Somehow, when BINARY_MODULO is faced with two strings, it does string interpolation instead of taking a remainder. This situation is a great example of dynamic typing. When the compiler built our code object for modulus, it had no idea whether x and y would be strings, numbers, or something else entirely. It just emitted some instructions: load one name, load another, BINARY_MODULO the two objects, and return the result. It’s the interpreter’s job to figure out what BINARY_MODULO actually means.

I’d like to reflect on the depth of our ignorance for a moment. Our function modulus can calculate remainders, or it can do string formatting … what else? If we define a custom object that responds to __mod__, then we can do anything.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> class Surprise(object):
...     def __init__(self, num):
...         self.num = num
...     def __mod__(self, other):
...         return self.num + other.num
...
>>> seven = Surprise(7)
>>> four = Surprise(4)
>>> modulus(seven, four)
11
>>> modulus(7,4)
3
>>> modulus("hello %s", "world")
'hello world'

The same function modulus, with the same bytecode, has wildly different effects when passed different kinds of objects. It’s also possible for modulus to raise an error – for example, a TypeError if we called it on objects that didn’t implement __mod__. Heck, we could even write a custom object that raises a SystemExit when __mod__ is invoked. Our __mod__ function could have written to a file, or changed a global variable, or deleted another attribute of the object. We have near-total freedom.

This ignorance is one of the reasons that it’s hard to optimize Python: you don’t know when you’re compiling the code object and generating the bytecode what it’s going to end up doing. The compiler has no idea what’s going to happen. As Russell Power and Alex Rubinsteyn wrote in “How fast can we make interpreted Python?”, “In the general absence of type information, almost every instruction must be treated as INVOKE_ARBITRARY_METHOD.”

While a general definition of “compiling” and “interpreting” can be difficult to nail down, in the context of Python it’s fairly straightforward. Compiling is generating the code objects, including the bytecode. Interpreting is making sense of the bytecode in order to actually make things happen. One of the ways in which Python is “dynamic” is that the same bytecode doesn’t always have the same effect. More generally, in Python the compiler does relatively little work, and the intrepreter relatively more.

In Part 5, we’ll look at the actual virtual machine and interpreter.


  1. You sometimes hear “interpreted language” instead of “dynamic language,” which is usually, mostly, synonymous.

  2. Thanks to David Nolen for this definition. The lines between “parsing,” “compiling,” and “interpreting” are not always clear.

  3. Some languages that are usually not compiled at all include R, Scheme, and binary, depending on the implementation and your definition of “compile.”

  4. As always in this series, I’m talking about CPython and Python 2.7, although most of this content is true across implementations.

Introduction to the Python Interpreter, Part 3: Understanding Bytecode

[Edit: A significantly expanded version of this series appears as a chapter in The Architecture of Open Source Applications, volume 4, as A Python Interpreter Written in Python.]

This is Part 3 in a series on the Python interpreter. Part 1 here, Part 2 here. If you’re enjoying this series, consider applying to Hacker School, where I work as a facilitator.

Bytecode

When we left our heroes, they had come across some odd-looking output:

1
2
>>> foo.func_code.co_code
'd\x01\x00}\x01\x00|\x01\x00|\x00\x00\x17S'

This is python bytecode.

You recall from Part 2 that “python bytecode” and “a python code object” are not the same thing: the bytecode is an attribute of the code object, among many other attributes. Bytecode is found in the co_code attribute of the code object, and contains instructions for the interpreter.

So what is bytecode? Well, it’s just a series of bytes. They look wacky when we print them because some bytes are printable and others aren’t, so let’s take the ord of each byte to see that they’re just numbers.

1
2
>>> [ord(b) for b in foo.func_code.co_code]
[100, 1, 0, 125, 1, 0, 124, 1, 0, 124, 0, 0, 23, 83]

Here are the bytes that make up python bytecode. The interpreter will loop through each byte, look up what it should do for each one, and then do that thing. Notice that the bytecode itself doesn’t include any python objects, or references to objects, or anything like that.

One way to understand python bytecode would be to find the CPython interpreter file (it’s ceval.c), and flip through it looking up what 100 means, then 1, then 0, and so on. We’ll do this later in the series! For now, there’s a simpler way: the dis module.

Disassembling bytecode

Disassembling bytecode means taking this series of bytes and printing out something we humans can understand. It’s not a step in python execution; the dis module just helps us understand an intermediate state of python internals. I can’t think of a reason why you’d ever want to use dis in production code – it’s for humans, not for machines.

Today, however, taking some bytecode and making it human-readable is exactly what we’re trying to do, so dis is a great tool. We’ll use the function dis.dis to analyze the code object of our function foo.

1
2
3
4
5
6
7
8
9
10
11
12
13
>>> def foo(a):
...     x = 3
...     return x + a
...
>>> import dis
>>> dis.dis(foo.func_code)
  2           0 LOAD_CONST               1 (3)
              3 STORE_FAST               1 (x)

  3           6 LOAD_FAST                1 (x)
              9 LOAD_FAST                0 (a)
             12 BINARY_ADD
             13 RETURN_VALUE

(You usually see this called as dis.dis(foo), directly on the function object. That’s just a convenience: dis is really analyzing the code object. If it’s passed a function, it just gets its code object.)

The numbers in the left-hand column are line numbers in the original source code. The second column is the offset into the bytecode: LOAD_CONST appears at position 0, STORE_FAST at position 3, and so on. The middle column shows the names of bytes. These names are just for our (human) benefit – the interpreter doesn’t need the names.

The last two columns give details about the instructions’s argument, if there is an argument. The fourth column shows the argument itself, which represents an index into other attributes of the code object. In the example, LOAD_CONST’s argument is an index into the list co_consts, and STORE_FAST’s argument is an index into co_varnames. Finally, in the fifth column, dis has looked up the constants or names in the place the fourth column specified and told us what it found there. We can easily verify this:

1
2
3
4
>>> foo.func_code.co_consts[1]
3
>>> foo.func_code.co_varnames[1]
'x'

This also explains why the second instruction, STORE_FAST, is found at bytecode position 3. If a bytecode has an argument, the next two bytes are that argument. It’s the interpreter’s job to handle this correctly.

(You may be surprised that BINARY_ADD doesn’t have arguments. We’ll come back to this in a future installment, when we get to the interpreter itself.)

People often say that dis is a disassembler of python bytecode. This is true enough – the dis module’s docs say it – but dis knows about more than just the bytecode, too: it uses the whole code object to give us an understandable printout. The middle three columns show information actually encoded in the bytecode, while the first and the last columns show other information. Again, the bytecode itself is really limited: it’s just a series of numbers, and things like names and constants are not a part of it.

How does the dis module get from bytes like 100 to names like LOAD_CONST and back? Try to think of a way you’d do it. If you thought “Well, you could have a list that has the byte names in the right order,” or you thought, “I guess you could have a dictionary where the names are the keys and the byte values are the values,” then congratulations! That’s exactly what’s going on. The file opcode.py defines the list and the dictionary. It’s full of lines like these (def_op inserts the mapping in both the list and the dictionary):

1
2
3
4
def_op('LOAD_CONST', 100)       # Index in const list
def_op('BUILD_TUPLE', 102)      # Number of tuple items
def_op('BUILD_LIST', 103)       # Number of list items
def_op('BUILD_SET', 104)        # Number of set items

There’s even a friendly comment telling us what each byte’s argument means.

Ok, now we understand what python bytecode is (and isn’t), and how to use dis to make sense of it. In Part 4, we’ll look at another example to see how Python can compile down to bytecode but still be a dynamic language.

Introduction to the Python Interpreter, Part 2: Code Objects

[Edit: A significantly expanded version of this series appears as a chapter in The Architecture of Open Source Applications, volume 4, as A Python Interpreter Written in Python.]

This is part of a series on the python interpreter. Part 1 here.

When we left our heroes, they were examining a simple function object. Let’s now dive a level deeper, and look at this function’s code object.

1
2
3
4
5
6
7
8
>>> def foo(a):
...     x = 3
...     return x + a
...
>>> foo
<function foo at 0x107ef7aa0>
>>> foo.func_code
<code object foo at 0x107eeccb0, file "<stdin>", line 1>

As you can see in the code above, the code object is an attribute of the function object. (There are lots of other attributes on the function object, too. They’re mostly not interesting because foo is so simple.)

A code object is generated by the Python compiler and intepreted by the interpreter. It contains information that this interpreter needs to do its job. Let’s look at the attributes of the code object.

1
2
3
4
5
6
7
>>> dir(foo.func_code)
['__class__', '__cmp__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__',
'__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__ne__', '__new__',
'__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
'__subclasshook__', 'co_argcount', 'co_cellvars', 'co_code', 'co_consts', 'co_filename',
'co_firstlineno', 'co_flags', 'co_freevars', 'co_lnotab', 'co_name', 'co_names', 'co_nlocals',
'co_stacksize', 'co_varnames']

There’s a bunch of stuff going on here, much of which we’re not going to worry about today. Let’s take a look at three attributes that are interesting to us for our code object on foo.

1
2
3
4
5
6
>>> foo.func_code.co_varnames
('a', 'x')
>>> foo.func_code.co_consts
(None, 3)
>>> foo.func_code.co_argcount
1

Here are some intelligible-looking things: the names of the variables and the constants that our function knows about and the number of arguments the function takes. But so far, we haven’t seen anything that looks like instructions for how to execute the code object. These instructions are called bytecode. Bytecode is an attribute of the code object:

1
2
>>> foo.func_code.co_code
'd\x01\x00}\x01\x00|\x01\x00|\x00\x00\x17S'

So much for our intelligible-looking things. What’s going on here? We’ll dive in to bytecode in Part 3.

Introduction to the Python Interpreter, Part 1: Function Objects

[Edit: A significantly expanded version of this series appears as a chapter in The Architecture of Open Source Applications, volume 4, as A Python Interpreter Written in Python.]

Over the last three months, I’ve spent a lot of time working with Ned Batchelder on byterun, a python bytecode interpreter written in python. Working on byterun has been tremendously educational and a lot of fun for me. At the end of this series, I’m going to attempt to convince you that it would be interesting and fun for you to play with byterun, too. But before we do that, we need a bit of a warm-up: an overview of how python’s internals work, so that we can understand what an interpreter is, what it does, and what it doesn’t do.

This series assumes that you’re in a similar position to where I was three months ago: you know python, but you don’t know anything about the internals.

One quick note: I’m going to work in and talk about Python 2.7 in this post. The interpreter in Python 3 is mostly pretty similar. There are also some syntax and naming differences, which I’m going to ignore, but everything we do here is possible in Python 3 as well.

How does it python?

We’ll start out with a really (really) high-level view of python’s internals. What happens when you execute a line of code in your python REPL?

1
2
3
4
5
~ $ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = "hello"

There are four steps that python takes when you hit return: lexing, parsing, compiling, and interpreting. Lexing is breaking the line of code you just typed into tokens. The parser takes those tokens and generates a structure that shows their relationship to each other (in this case, an Abstract Syntax Tree). The compiler then takes the AST and turns it into one (or more) code objects. Finally, the interpreter takes each code object executes the code it represents.

I’m not going to talk about lexing, parsing, or compiling at all today, mainly because I don’t know anything about these steps yet. Instead, we’ll suppose that all that went just fine, and we’ll have a proper python code object for the interpreter to interpret.

Before we get to code objects, let me clear up some common confusion. In this series, we’re going to talk about function objects, code objects, and bytecode. They’re all different things. Let’s start with function objects. We don’t really have to understand function objects to get to the interpreter, but I want to stress that function objects and code objects are not the same – and besides, function objects are cool.

Function objects

You might have heard of “function objects.” These are the things people are talking about when they say things like “Functions are first-class objects,” or “Python has first-class functions.” Let’s take a look at one.

1
2
3
4
5
6
>>> def foo(a):
...     x = 3
...     return x + a
...
>>> foo
<function foo at 0x107ef7aa0>

“Functions are first-class objects,” means that function are objects, like a list is an object or an instance of MyObject is an object. Since foo is an object, we can talk about it without invoking it (that is, there’s a difference between foo and foo()). We can pass foo into another function as an argument, or we could bind it to a new name (other_function = foo). With first-class functions, all sorts of possibilities are open to us!

In Part 2, we’ll dive down a level and look at the code object.

Python Puzzle Solutions

I really enjoyed seeing all the clever solutions to the python puzzle I posted. You’re all very creative! Here’s a discussion of the solutions I’ve seen, plus some clarifications. All spoilers are below the fold.

First, clarifications. (These weren’t always clear in the problem statement, particularly if you got the problem off of twitter, so award yourself full marks as desired.)

Order doesn’t matter

“Order doesn’t matter” means that the three-line version always returns False, and the semicolon version always returns True.

You control only the contents of the lines

Several people, including Pepijn De Vos, David Wolever, and diarmuidbourke suggested something like the following:

1
2
3
4
5
6
>>> """a; b; c""" == 'a; b; c'
True
>>> """a
... b
... c""" == 'a; b; c'
False

I’m being pedantic here, but I rule this cheating, since (a) each line has to be a valid python expression or statement, and a multi-line string literal is only one expression, and (b) the string """a; b; c""" is not the same as the string """a\nb\nc""".

Solutions appear below the fold.

A Python Puzzle

A couple of Hacker Schoolers were discussing an interesting corner of python today. We discovered a nice bit of trivia: there exist three lines of python code that display the following behavior:

1
2
3
4
5
6
7
8
9
10
11
12
>>> LINE_A
>>> LINE_B
>>> LINE_C
False
>>> LINE_A; LINE_B; LINE_C
True
>>> def my_function():
...     LINE_A
...     LINE_B
...     LINE_C
>>> my_function()
True

What are the lines?

Some ground rules:

  • Introspection of any kind is cheating (e.g. noting the line number).
  • No dunder (__foo__) methods allowed.
  • Each line is a valid python expression.
  • You can’t rely on order: while the lines will always execute A –> B –> C, a complete solution behaves identically if e.g. the semicolon version happens before the separate-line version.
  • No cheating with the function: e.g. you can’t add a return unless you add it everywhere.
  • Edit: And nothing stateful.

For bonus points, code golf! My solution to this is 14 19 characters long, not counting whitespace.