One of the fun parts of pairing with Amy on Nagini was modifying the grammar of Python. You should try this – it’s easier than you think! Eli Bendersky has a great post with step-by-step instructions.1
Modifying Python’s grammar starts in the Grammar/Grammar file. I’ve recently learned how to read this (which, for me, mostly means learning how to pronounce the punctuation), so I want to walk through the import example in some detail. The syntax here is Extended Backus-Naur Form, or EBNF. You read it like a tree, and your primary verb is “consists of”:
import_stmt consists of one of two forms, import_name or import_from.
import_name consists of the literal word import followed by dotted_as_names.
dotted_as_names consists of a dotted_as_name (note the singular), optionally followed by one or more pairs of a comma and another dotted_as_name.
dotted_as_name consists of a dotted_name, optionally followed by the literal word ‘as’ and a NAME.
Finally, dotted_name consists of a NAME, maybe followed by pairs of a dot and another NAME.
You can walk the other branches in a similar way.
import_stmt: import_name | import_from
import_name: 'import' dotted_as_names
# note below: the ('.' | '...') is necessary because '...' is tokenized as ELLIPSIS
import_from: ('from' (('.' | '...')* dotted_name | ('.' | '...')+)
'import' ('*' | '(' import_as_names ')' | import_as_names))
import_as_name: NAME ['as' NAME]
dotted_as_name: dotted_name ['as' NAME]
import_as_names: import_as_name (',' import_as_name)* [',']
dotted_as_names: dotted_as_name (',' dotted_as_name)*
dotted_name: NAME ('.' NAME)*
To accio-ify Python, we had to replace the occurences of 'import' with 'accio'. There are only two – we were only interested in the literal string import, not all the other names. import_as_name and so on are just nodes in the tree, and only matter to the parser and compiler.
Every other keyword and symbol that has special meaning to the Python parser also appears in Grammar as a string.
Perusing the grammar is (goofy) way to learn about corner cases of Python syntax, too! For example, did you know that with can take more than one context manager? It’s right there in the grammar:
with_stmt: 'with' with_item (',' with_item)* ':' suite
with_item: test ['as' expr]
I’m talking about import at PyCon in April. In the talk, we’ll imagine that there is no import and will reinvent it from scratch. I hope this will give everyone (including me!) a deeper understanding of the choices import makes and the ways it could have been different. Ideally, the structure will be a couple of sections of the form “We could have made [decisions]. That would mean [effects]. Surprise – that’s how it works in [language]!”1
This is the first of (probably) several posts with notes of things I’m learning as I prepare my talk. Feedback is welcome.
Today I’m looking into Ruby’s require and require_relative2 to see if aspects of them would be interesting to Python programmers. So far, here’s what I think is most relevant:
Unlike Python, require won’t load all objects in the required file. There’s a concept of local versus global variables in the file scope that doesn’t exist in Python.
Unlike Python, one file does not map to one module. Modules are created by using the keyword module.
Unlike Python, namespace collisions are completely possible. Consider the following simple files:
And the output from running main.rb:
Like Python’s import, require will only load a file once. This can interact interestingly with namespace collisions – to take a contrived example:
Because one.rb isn’t reloaded, foo is still 'world':
Questions for further investigation / thought
My talk should not convince people that Python is Right and other languages are Wrong. I’m trying to overcome my bias towards the system I’m most used to. (I think I’ve written roughly equal amounts of Python and Ruby, but the vast majority of the Ruby I’ve written is Rails, where all the requireing and namespacing happens by magic.) Here are some questions I’d like to research more.
Python’s namespacing feels much better to me, although I’m sure that’s partly because I’m used to it. What’s the advantage to doing namespacing this way?
Why have both require and require_relative? Why not have require check the relative path as well before raising a LoadError?
What’s the advantage of uncoupling a module from a file?
As far as I can tell, the only difference between require and require_relative is the load path searched.↩
This is Part 4 in a series on the Python interpreter. Read Part 1, Part 2, and Part 3. If you’re enjoying this series, consider applying to Hacker School, where I work as a facilitator.
Most of the time, when people talk about a “compiled” language, they mean one that compiles down to native x86/ARM/etc instructions2 – instructions for an actual machine made of metal. An “interpreted” language either doesn’t have any compilation at all3, or compiles to an intermediate representation, like bytecode. Bytecode is instructions for a virtual machine, not a piece of hardware. Python falls into this latter category: the Python compiler’s job is to generate bytecode for the Python interpreter.4
The Python interpreter’s job is to make sense of the bytecode via the virtual machine, which turns out to be a lot of work. We’ll dig in to the virtual machine in Part 5.
So far our discussion of compiling versus interpretation has been abstract. These ideas become more clear with an example.
Here’s a function, its bytecode, and its bytecode run through the disassembler. By the time we get the prompt back after the function definition, the modulus function has been compiled and a code object generated. That code object will never be modified.
This seems pretty easy to reason about. Unsurprisingly, typing a modulus (%) causes the compiler to emit the instruction BINARY_MODULO. It looks like this function will be useful if we need to calculate a remainder.
So far, so good. But what if we don’t pass it numbers?
>>>modulus("hello %s","world")'hello world'
Uh-oh, what happened there? You’ve probably seen this before, but it usually looks like this:
Somehow, when BINARY_MODULO is faced with two strings, it does string interpolation instead of taking a remainder. This situation is a great example of dynamic typing. When the compiler built our code object for modulus, it had no idea whether x and y would be strings, numbers, or something else entirely. It just emitted some instructions: load one name, load another, BINARY_MODULO the two objects, and return the result. It’s the interpreter’s job to figure out what BINARY_MODULO actually means.
I’d like to reflect on the depth of our ignorance for a moment. Our function modulus can calculate remainders, or it can do string formatting … what else? If we define a custom object that responds to __mod__, then we can do anything.
The same function modulus, with the same bytecode, has wildly different effects when passed different kinds of objects. It’s also possible for modulus to raise an error – for example, a TypeError if we called it on objects that didn’t implement __mod__. Heck, we could even write a custom object that raises a SystemExit when __mod__ is invoked. Our __mod__ function could have written to a file, or changed a global variable, or deleted another attribute of the object. We have near-total freedom.
This ignorance is one of the reasons that it’s hard to optimize Python: you don’t know when you’re compiling the code object and generating the bytecode what it’s going to end up doing. The compiler has no idea what’s going to happen. As Russell Power and Alex Rubinsteyn wrote in “How fast can we make interpreted Python?”, “In the general absence of type information, almost every instruction must be treated as INVOKE_ARBITRARY_METHOD.”
While a general definition of “compiling” and “interpreting” can be difficult to nail down, in the context of Python it’s fairly straightforward. Compiling is generating the code objects, including the bytecode. Interpreting is making sense of the bytecode in order to actually make things happen. One of the ways in which Python is “dynamic” is that the same bytecode doesn’t always have the same effect. More generally, in Python the compiler does relatively little work, and the intrepreter relatively more.
In Part 5, we’ll look at the actual virtual machine and interpreter.
You sometimes hear “interpreted language” instead of “dynamic language,” which is usually, mostly, synonymous.↩
Thanks to David Nolen for this definition. The lines between “parsing,” “compiling,” and “interpreting” are not always clear. ↩
Some languages that are usually not compiled at all include R, Scheme, and binary, depending on the implementation and your definition of “compile.”↩
As always in this series, I’m talking about CPython and Python 2.7, although most of this content is true across implementations.↩
You recall from Part 2 that “python bytecode” and “a python code object” are not the same thing: the bytecode is an attribute of the code object, among many other attributes. Bytecode is found in the co_code attribute of the code object, and contains instructions for the interpreter.
So what is bytecode? Well, it’s just a series of bytes. They look wacky when we print them because some bytes are printable and others aren’t, so let’s take the ord of each byte to see that they’re just numbers.
Here are the bytes that make up python bytecode. The interpreter will loop through each byte, look up what it should do for each one, and then do that thing. Notice that the bytecode itself doesn’t include any python objects, or references to objects, or anything like that.
One way to understand python bytecode would be to find the CPython interpreter file (it’s ceval.c), and flip through it looking up what 100 means, then 1, then 0, and so on. We’ll do this later in the series! For now, there’s a simpler way: the dis module.
Disassembling bytecode means taking this series of bytes and printing out something we humans can understand. It’s not a step in python execution; the dis module just helps us understand an intermediate state of python internals. I can’t think of a reason why you’d ever want to use dis in production code – it’s for humans, not for machines.
Today, however, taking some bytecode and making it human-readable is exactly what we’re trying to do, so dis is a great tool. We’ll use the function dis.dis to analyze the code object of our function foo.
(You usually see this called as dis.dis(foo), directly on the function object. That’s just a convenience: dis is really analyzing the code object. If it’s passed a function, it just gets its code object.)
The numbers in the left-hand column are line numbers in the original source code. The second column is the offset into the bytecode: LOAD_CONST appears at position 0, STORE_FAST at position 3, and so on. The middle column shows the names of bytes. These names are just for our (human) benefit – the interpreter doesn’t need the names.
The last two columns give details about the instructions’s argument, if there is an argument. The fourth column shows the argument itself, which represents an index into other attributes of the code object. In the example, LOAD_CONST’s argument is an index into the list co_consts, and STORE_FAST’s argument is an index into co_varnames. Finally, in the fifth column, dis has looked up the constants or names in the place the fourth column specified and told us what it found there. We can easily verify this:
This also explains why the second instruction, STORE_FAST, is found at bytecode position 3. If a bytecode has an argument, the next two bytes are that argument. It’s the interpreter’s job to handle this correctly.
(You may be surprised that BINARY_ADD doesn’t have arguments. We’ll come back to this in a future installment, when we get to the interpreter itself.)
People often say that dis is a disassembler of python bytecode. This is true enough – the dis module’s docs say it – but dis knows about more than just the bytecode, too: it uses the whole code object to give us an understandable printout. The middle three columns show information actually encoded in the bytecode, while the first and the last columns show other information. Again, the bytecode itself is really limited: it’s just a series of numbers, and things like names and constants are not a part of it.
How does the dis module get from bytes like 100 to names like LOAD_CONST and back? Try to think of a way you’d do it. If you thought “Well, you could have a list that has the byte names in the right order,” or you thought, “I guess you could have a dictionary where the names are the keys and the byte values are the values,” then congratulations! That’s exactly what’s going on. The file opcode.py defines the list and the dictionary. It’s full of lines like these (def_op inserts the mapping in both the list and the dictionary):
def_op('LOAD_CONST',100)# Index in const listdef_op('BUILD_TUPLE',102)# Number of tuple itemsdef_op('BUILD_LIST',103)# Number of list itemsdef_op('BUILD_SET',104)# Number of set items
There’s even a friendly comment telling us what each byte’s argument means.
Ok, now we understand what python bytecode is (and isn’t), and how to use dis to make sense of it. In Part 4, we’ll look at another example to see how Python can compile down to bytecode but still be a dynamic language.
As you can see in the code above, the code object is an attribute of the function object. (There are lots of other attributes on the function object, too. They’re mostly not interesting because foo is so simple.)
A code object is generated by the Python compiler and intepreted by the interpreter. It contains information that this interpreter needs to do its job. Let’s look at the attributes of the code object.
Here are some intelligible-looking things: the names of the variables and the constants that our function knows about and the number of arguments the function takes. But so far, we haven’t seen anything that looks like instructions for how to execute the code object. These instructions are called bytecode. Bytecode is an attribute of the code object:
Over the last three months, I’ve spent a lot of time working with Ned Batchelder on byterun, a python bytecode interpreter written in python. Working on byterun has been tremendously educational and a lot of fun for me. At the end of this series, I’m going to attempt to convince you that it would be interesting and fun for you to play with byterun, too. But before we do that, we need a bit of a warm-up: an overview of how python’s internals work, so that we can understand what an interpreter is, what it does, and what it doesn’t do.
This series assumes that you’re in a similar position to where I was three months ago: you know python, but you don’t know anything about the internals.
One quick note: I’m going to work in and talk about Python 2.7 in this post. The interpreter in Python 3 is mostly pretty similar. There are also some syntax and naming differences, which I’m going to ignore, but everything we do here is possible in Python 3 as well.
How does it python?
We’ll start out with a really (really) high-level view of python’s internals. What happens when you execute a line of code in your python REPL?
There are four steps that python takes when you hit return: lexing, parsing, compiling, and interpreting. Lexing is breaking the line of code you just typed into tokens. The parser takes those tokens and generates a structure that shows their relationship to each other (in this case, an Abstract Syntax Tree). The compiler then takes the AST and turns it into one (or more) code objects. Finally, the interpreter takes each code object executes the code it represents.
I’m not going to talk about lexing, parsing, or compiling at all today, mainly because I don’t know anything about these steps yet. Instead, we’ll suppose that all that went just fine, and we’ll have a proper python code object for the interpreter to interpret.
Before we get to code objects, let me clear up some common confusion. In this series, we’re going to talk about function objects, code objects, and bytecode. They’re all different things. Let’s start with function objects. We don’t really have to understand function objects to get to the interpreter, but I want to stress that function objects and code objects are not the same – and besides, function objects are cool.
You might have heard of “function objects.” These are the things people are talking about when they say things like “Functions are first-class objects,” or “Python has first-class functions.” Let’s take a look at one.
“Functions are first-class objects,” means that function are objects, like a list is an object or an instance of MyObject is an object. Since foo is an object, we can talk about it without invoking it (that is, there’s a difference between foo and foo()). We can pass foo into another function as an argument, or we could bind it to a new name (other_function = foo). With first-class functions, all sorts of possibilities are open to us!
In Part 2, we’ll dive down a level and look at the code object.
I really enjoyed seeing all the clever solutions to the python puzzle I posted. You’re all very creative! Here’s a discussion of the solutions I’ve seen, plus some clarifications. All spoilers are below the fold.
First, clarifications. (These weren’t always clear in the problem statement, particularly if you got the problem off of twitter, so award yourself full marks as desired.)
Order doesn’t matter
“Order doesn’t matter” means that the three-line version always returns False, and the semicolon version always returns True.
I’m being pedantic here, but I rule this cheating, since (a) each line has to be a valid python expression or statement, and a multi-line string literal is only one expression, and (b) the string """a; b; c""" is not the same as the string """a\nb\nc""".
I’ve managed to encounter three different bugs with the same obscure source in the last week. I think Hacker School might be cursed. Here’s a blog post attempting to rid us of the curse.
Bug 1: Flask app on Heroku can’t find images and stylesheets
The first bug was a Flask app deployed to Heroku. It worked fine locally, but when deployed, none of the images or stylesheets rendered.
Bug 2: A project fails to build
A Hacker Schooler cloned into a project and tried to build it with python setup.py install. The build failed with the error Supposed package directory '[project]' exists but is not a directory.
Bug 3: Heroku-deployed app crashes
I deployed a new feature to the Hacker School site (which is a Rails app), and crashed the application. Again, everything worked fine locally on my machine and my colleague’s machines.
The solution and explanations are below the fold. If you’d like to try to guess, you can ask me debugging questions on twitter (@akaptur), and I’ll respond within 24 hours until Friday, October 18th, 2013. If you don’t like guessing, or your own bugs are plenty for you, you can click through now.
Last week at Hacker School I did a quick presentation on python bytecode and the dis module. The disassembler is a very powerful tool with a gentle learning curve – that is, you can get a fair amount out of it without really knowing much about what’s going on. This post is a quick introduction to how and why you should use it.
Bytecode is the internal representation of a python program in the compiler. Here, we’ll be looking at bytecode from cpython, the default compiler. If you don’t know what compiler you’re using, it’s probably cpython.
How do I get bytecode?
You already have it! Bytecode is what’s contained in those .pyc files you see when you import a module. It’s also created on the fly by running any python code.
Ok, so you have some bytecode, and you want to understand it. Let’s look at it without using the dis module first.
Now this starts to make some sense. dis takes each byte, finds the opcode that corresponds to it in opcodes.py, and prints it as a nice, readable constant. If we look at opcodes.py we see that LOAD_CONST is 100, STORE_FAST is 125, etc. dis also shows the line numbers on the left and the values or names on the right. So without ever seeing something like before, we have an idea what’s going on: we first load a constant, 2, then somehow store it as a. Then we repeat this with 3 and b. We load a and b back up, do BINARY_ADD, which presumably adds the numbers, and then do RETURN_VALUE.
Examining the bytecode can sometimes increase your understanding of python code. Here is one example.
elif is identical in bytecode to else ... if. Take a look: