In 2001, my favorite programming language was Python. In 2008, my favorite programming language was Scheme. In 2014, my favorite programming language is x64 assembly. For some reason, that progression tends to surprise people. Come on a journey with me.
In this article, we’re going to consider a very simple toy problem: recursively summing up a list of numbers1.
>>> sum_list(range(101)) 5050
Young Carl Gauss would be proud.
>>> sum_list(range(1001)) RuntimeError: maximum recursion depth exceeded
Young programmers often learn from this type of experience that recursion sucks. (Or, as a modern young programmer might say, it doesn’t scale.) If they Google around a bit, they might find the following “solution”:
>>> import sys >>> sys.setrecursionlimit(1500) >>> sum_list(range(1001)) 500500
If they have a good computer science teacher, though, they’ll learn that the real solution is to use something called tail recursion. This is a somewhat mysterious, seemingly arbitrary concept. If the result of your recursive call gets returned immediately, without any intervening expessions, then somehow it “doesn’t count” toward the equally arbitrary recursion depth limit. Our example above isn’t tail-recusrive because we add
sum_list(list[1:]) before returning the result. In order to make
sum_list tail-recursive, we have to add an accumulator variable, which represents the sum of those numbers we’ve looked at already. We’ll call this version
sum_sublist, and wrap it in a new
sum_list function which calls
sum_sublist with the initial accumulator 0 (initially, we haven’t looked at any numbers yet, so the sum of them is 0).
>>> sum_list(range(101)) 5050
So far, so good.
>>> sum_list(range(1001)) RuntimeError: maximum recursion depth exceeded
On Wednesday, April 22, 2009, Guido van Rossum wrote: > A side remark about not supporting tail recursion elimination (TRE) > immediately sparked several comments about what a pity it is that Python > doesn’t do this, including links to recent blog entries by others trying to > “prove” that TRE can be added to Python easily. So let me defend my position > (which is that I don’t want TRE in the language). If you want a short > answer, it’s simply unpythonic. Here’s the long answer:
Third, I don’t believe in recursion as the basis of all programming. This is a fundamental belief of certain computer scientists, especially those who love Scheme…
Still, if someone was determined to add TRE to CPython, they could modify the compiler roughly as follows…
In other words, the only reason this doesn’t work is that Guido van Rossum2 prefers it that way. Guido, I respect your right to your opinion, but the reader and I are switching to Scheme.
Here’s a line-by-line translation:
guile> (sum_list (iota 1001)) 500500
Phew! Let’s make sure that we aren’t just getting lucky with a bigger recursion limit:
guile> (sum_list (iota 10000001)) 50000005000000
Well, isn’t that neat? If we go much bigger, it’ll take a long time, but as long as the output fits into memory, we’ll get the right answer3.
In our last two versions of
sum_list, we defined a helper function (
sum_sublist), and the rest of the body of
sum_list was just a single invocation of that helper function. This is an inelegant pattern4, which Scheme has a construct to address.
Named let creates a function and invokes it (with the provided initial values) in one step. It is decidedly my favorite control structure of all time. You can have your
while loops and your
for loops, and your
until loops too5. I’ll take named let any day, because it provides the abstraction barrier of recursion without compromising the conciseness and efficiency of iteration. In case you’re not sufficiently impressed, I discuss the delightful properties of using recursion instead of non-recursive loops below.
> sum_list(from(1,100)) 5050 > sum_list(from(1,10000000)) 50000005000000
In fact, if I weren’t so comfortable with named let, I doubt I’d be an effective assembly coder, because assembly doesn’t really have any other iteration constructs7. But I don’t miss them. What would they look like, anyway?
In the next installment of Python to Scheme to Assembly, we will look at
In this addendum, we’re going to look at the assembly for iteration, non-tail recursion, and tail recursion, as emitted by
gcc, and get to the bottom of what the difference is anyway.
At the top of each C file here, we have the following:
If I were solving this problem in the context of a C program, this is how I would do it.
Here’s the generated assembly, translated to
nasm syntax and commented.
This is almost identical to the assembly that I wrote, except that it clobbers one of its inputs (which is perfectly allowed by the C calling convention8), it uses
xor instead of
mov to load
0 (a solid optimization9), it uses
rep ret (less compact and no benefit on Intel chips), and it shuffles the instructions around such that two
tests are needed (almost certainly not helpful with modern branch prediction and loop detection). I haven’t run benchmarks on this, but my guess is that it would come out about even. (Both versions are eight instructions long.) I also think the shuffling makes this “iterative” version more opaque and difficult to reason about (not least because of the duplicated
test) than my “named let”-style code.
gcc -O3 can almost completely convert this version to iteration, so let’s look at the generated assembly from
gcc -O1 to get a better sense of what it might look like in a language implementation for which the necessary optimizations are too complex to be made automatically.
We can see immediately that some new instructions (
call) have been introduced. These are all stack manipulation instructions10. If we carefully pretend to be the CPU running this program, we can see that it pushes the address of every number in the linked list, and then dereferences and adds them up as it pops them from the stack. This is not good; if we wanted our entire data structure to be replicated on the stack, we would have passed it by value11! It’s generally the amount of memory set aside for the stack that we’ve actually run out of in the case of a
recursion depth exceeded error.
What about translating the tail-recursive version into C? Like Scheme and Python,
gcc supports nested function definitions (as a GNU extension to C), so this is no problem:
gcc -O1gives us (translated and commented as before):
In this mode, the tail
call is not being eliminated – although we’re no longer
rbx, we’re still pushing
rip to stack with every
call, and eventually we’ll run out of stack that way. The only way to get around this is to replace each
jmp: since we’re just going to take the return value of the next recursive invocation and then immediately
ret back to the previous caller on the stack, there’s no point in even inserting our own address on the stack (as
call does); we can just set up the next guy to pass the return value straight back to the previous guy, and quietly disappear.
gcc -O3 does this. In fact, somewhat surprisingly, it generates exactly the same assembly, line for line, for this version as for the purely iterative version above. That’s “tail call optimization” (TCO) or “tail recursion elimination” (TRE) in its most agressive form: it literally just gets rid of all calls and recursions and replaces them with an equivalent iteration (complete with duplicate
The upshot of all this is that not only does Scheme’s “named let” recursion form translate neatly into assembly, it provides – penalty-free – a better abstraction than either iteration (while-loop imitation) or stack-driven recursion, the two options
gcc appears to pick from when dealing with various ways to code a list traversal.
Actually, the real point I’m trying to make here is that, unlike in C, I can naturally do named let directly in assembly, and that’s one of the many reasons working in assembly makes me happy.
Appendix: What’s so great about recursion, anyway?
For me, the most important point in favor of a recursive representation of loops is that I find it easier to reason about correctness that way.
Any function we define ought to implement some ideal mathematical function that maps inputs to outputs12. If our code truly does implement that ideal function, we say that the code is correct. Generally, we can break down the body of a function as a composition of smaller functions; even in imperative languages, we can think of every statement as pulling in a state of the world, making well-defined changes, and passing the new state of the world into the next statement13. At each step, we ask ourselves, “are the outputs of this function going to be what I want them to be?” For loops, though, this gets tricky.
What recursion does for us as aspiring writers of correct functions is automatic translation of the loop verification problem into the much nicer problem of function verification. Intuitively, you can simply assume that all invocations of a recursive function within its own body are going to Do The Right Thing, ensure that the function as a whole Does The Right Thing under that assumption, and then conclude that the function Does The Right Thing in general. If this sounds like circular reasoning, it does14; but it turns out to be valid anyway.
There are many ways to justify this procedure formally, all of which are truly mind-bending15. But once you’ve justified this procedure once, you never have to do it again (unlike ad-hoc reasoning about loops). I’ve determined that the most elegant way to explain it is by expanding our named let example into a non-recursive function, which just happens to accept as a parameter a correct16 version of itself.
sum_sublist_nonrec is an honest-to-goodness non-recursive function, and we can check that it is correct. Given a correct function
f_correct (which takes as inputs a correct version of itself, a number, and a list, and correctly returns the sum of all the elements in the list plus the number), a number, and a list, does
sum_sublist_nonrec correctly return the sum of all elements in the list plus the number? Why yes, it does. (Constructing a formal proof tree for this claim is left as an exercise for the self-punishing reader.) Note that since
f_correct is assumed to already be correct, the correct version of it is still just
f_correct, so we can safely pass it to itself without violating our assumptions or introducing new ones. So,
sum_sublist_nonrec is correct.
Now let’s consider the correctness of
sum_list. It’s supposed to add up all the numbers in
list. What it actually does is to apply the (correct) function
sum_sublist_nonrec, passing in a correct version of itself (check! it’s already correct), a number to add the sum of the list to (check! adding zero to the sum of the list won’t change it), and the list (check! that’s what we’re supposed to sum up).
We’ve just proved our program correct! The magic of named let is that it generates this clumsy form with a bunch of
f_corrects from a compact and elegant form. In so doing, it lets us get away with much less formal reasoning while still having the confidence that it can be converted into something like what we just slogged through. Rest assured that no matter what you do with named let, no matter how complicated the construct you create, this “assume it does the right thing” technique still applies!
With one tiny caveat. We haven’t proved that the program terminates. If this technique proved termination, then we could just write
and it would be totally correct, no matter what thing we want it to do.
Technically, everywhere I’ve said “correct”, what I mean is partially correct: if it terminates, then the output is correct. (Equivalently, it definitely won’t return something incorrect.)
do-the-right-thing is, in fact, partially correct: it never returns at all, so it won’t give you any incorrect outputs!
Termination proofs of recursive functions can usually be handled by structural induction on possible inputs: you establish that it terminates for minimal elements (e.g. the empty list) and that termination for any non-minimal element is dependent only on termination for some set of smaller elements (e.g. the tail of the list). The structure that you need in order to think about termination this way is also much clearer with recursion than with iteration constructs.
Although at least it’s not as inelegant as defining the helper function outside the body of the actual function, thereby polluting the global namespace. Take advantage of nested functions!↩
If you’re curious how this works, click here. But I haven’t settled on an ASM REPL solution I’m happy with – this is just a one-off hack. A more legitimate ASM REPL may be the subject of a future blog post.↩
repprefixes, which can iterate certain single instructions. I think it’s fair to say those don’t really count.↩
I find calling conventions distasteful in general. The calling convention is like a shadow API (in fact, it’s often referred to as the ABI, for application binary interface) that nobody has any control over (except the people at AMD, Intel, and Microsoft who are in a position to decide on such things) and that applies to every function, every component on every computer everywhere. What if we let people define their ABI as part of their API? Would the world come crashing down? I doubt it. You can already cause quite a bit of trouble by misusing APIs; really, both API and ABI usage ought to be formally verified, and as such ought to have much more room for flexibility than they do now. </soapbox>↩
I would have applied this
xoroptimization too if I weren’t trying to literally translate Scheme code as an illustration.↩
“The stack” is not merely a region of memory managed by the OS (like “the heap”, its common counterpart). The stack is a hardware-accelerated mechanism deeply embedded in the CPU. There is a hardware register
rsp(a.k.a. the stack pointer). A
rsp(usually by 8 at a time, in 64-bit mode, since pointers are expressed as numbers of 8-bit bytes, and 64/8=8) and then stores a value to
popinstruction retrieves a value from
[rsp]and then increments
pushes the current value of
rip(a.k.a. the instruction pointer, or the program counter), and then executes an unconditional jump (
jmp). Finally, a
pops from the stack into
rip, returning to wherever the matching
You may point out here that C doesn’t actually let you pass entire linked lists by value. Maybe that’s because it’s a bad idea.↩
- If your function cannot be fully specified by an abstract mapping from inputs to outputs, then it is nondeterministic, which is a fancy word for “unpredictable”: there must exist some circumstances under which you cannot predict the behavior of the function, even knowing every input. Intuitively, I’m sure you can see how unpredictable software is a nightmare to debug. Controlling nondeterminism is an active field of computer science research, which is not the subject of this article. However, I hope you are at least convinced that nondeterminism is something you should avoid if possible, and that therefore you should try to design every function in your program as a proper mathematical function.
Note that I’m not talking about “purity” here – it’s fine for “outputs” to include side effects as of function exit, and for “inputs” to include states of the external world as of function entry. What’s important is that the state at function exit of anything the function modifies be uniquely determined by the state at function entry of anything that can affect its execution.↩
Unless we’re dealing with hairy scope issues like hoisting, in which case you should get rid of those first.↩
Pun intended. The sentence within which this footnote is referenced isn’t circular reasoning; it’s a tautology. Therefore, it’s an example of something that sounds like circular reasoning but is valid anyway. Of course, you shouldn’t take the existence of this cute example as evidence that the circular-sounding reasoning preceding it is not, in fact, circular. (That would be a fallacy of inappropriate generalization, which neither is nor sounds like circular reasoning.)↩
Trying to explain it for the purposes of this blog post – while making sure that I’m not missing something – took me over four hours.↩
Technically, I mean “partially correct”. This will be addressed in due time. Be patient, pedantic reader. This argument is hard enough to understand already.↩