Last night, a fellow Hacker Schooler challenged me to a running-time contest on the classic eight queens puzzle. Naturally, I pulled up my trusty Intel® 64 manual and got to work. It turned out to be even faster than I expected, churning out pretty-printed output in 15ms, which is totally dominated by the time it takes the terminal to display it (it takes only 2ms if redirected to a file).
Update: Very slightly more scientific testing, spurred by curious Hacker News commenters, indicates that, without pretty-printing and other overhead, the solving time is actually closer to 11.2µs – about a factor of 7 speedup over commenter bluecalm’s C implementation.
My solution method is heavily inspired by this paper (which, appropriately enough, concerns a beautifully insane programming language called MCPL, combining features from ML, C, and Prolog). This paper contributes two key insights about solving the 8-queens problem:
Conceptually, we can model the solution space as the leaves of a tree, where each internal node of the tree corresponds to a partial board (with the number of queens equal to the tree depth), and each parent-child link represents adding another queen at the row number corresponding to the depth of the child. Since there can only be one queen per row in a correct solution, this tree is a superset of the actual solution set.
Instead of actually constructing the tree, we can simply keep track of the current traversal state. In particular, this means we keep track of the currently occupied columns, the occupied leftward going diagonals, and the occupied rightward going diagonals, as they intersect the current row. (Each of these three state variables is eight bits of information.) In addition, we can keep track of the past traversal history of each level using
If any of this is unclear, check out the paper, which has a beautiful diagram that there is no need for me to attempt replicating.
I’m going to go through the first1 version of the code, which doesn’t produce the pretty boards but has most of the clever tricks. (Ironically, adding “pretty printing” made my code uglier. Maybe it’s just that I was up too late working on it.)
The heart of this algorithm is the sequence that updates the state variables as we move from one layer into the next. This whole program is small enough that it’s still practical to just set aside registers to represent most variables; in particular,
rdx represents where it’s okay to place a queen at the current layer (e.g. it starts out as
xmm1 (one of those fancy 128-bit registers that supports fancy new operations) stores the “occupied left diagonals”, “occupied right diagonals”, and “occupied columns” states, in that order (with “occupied columns” being the least significant word2).
xmm4 are just being used as scratch space. Finally,
xmm7 is a constant
To spare you the effort of searching through the Intel® 64 manual yourself, here are brief descriptions of all the fancy instructions I’m about to use.
vpsllw: Vector/Packed Shift Left (Logical) Words. Separately shifts left every word of the second argument by the number of bits represented as the third argument, and store the result to the first argument.
vpsrlw: Vector/Packed Shift Right (Logical) Words. Separately shifts right every word of the second argument by the number of bits represented as the third argument, and store the result to the first argument.
pblendw: Packed Blend Words. Using the third argument as a mask, selectively copy words from the second argument to the first argument.
vpsrldq: Vector/Packed Shift Right (Logical) Double Quadword. Shifts the entire second argument by the number of bytes specified in the third argument, and stores the result to the first argument.
por: Parallel OR. Bitwise ORs the first and second argument and assigns the result to the first argument.
vpandn: Vector/Parallel AND NOT. Inverts the second argument, ANDs the result with the third argument, and assigns the result of that to the first argument.
movq: Move Quadword. The standard way to move data between
xmmregisters and normal registers.
If you’re accustomed to C, you might think of this as functionally equivalent to something like
xmm1 <<= 1; xmm1 >>=13. We want the word in position 1 to shift right and the word in position 2 to shift left, while the word in position 0 (occupied columns) stays put.
Now, we want to combine the information about which squares in the next layer are under attack. It doesn’t matter from which direction – we want to make sure not to put a queen there. So, we shift right 2 words (= 4 bytes) and right 1 word (= 2 bytes) and OR them all together (accumulating into a scratch register so we don’t clobber our state).
But that still contains some stuff in the upper bytes. We only want the lower byte. And we also want
1 bits where queens should be allowed, rather than where they’re under attack. We can solve both problems with one
vpandn instruction, which will flip all the bits, but mask out everything except the first byte (since
So, now that we’re iterating, what happens next?
bsf: Bit Scan Forward. Finds the least significant
1bit in the second argument and stores the index of that bit into the first argument. If there is no
1bit the second argument, the value of the first argument is undefined, and the zero flag (
ZF) is set.
btc: Bit Clear. Clears the bit in the first argument with index given by the second argument.
je: Jump If Equal. Pretty self-explanatory, when used in conjunction with
jz: Jump If Zero. Jumps to the specified address/label if the zero flag (ZF) is set.
push: Push To Stack. Stores its single argument to the memory location pointed by
rsp, and decrements
rsp(usually by eight at a time, i.e.,
rsp <- rsp-8).
shl: Logical Shift Left for non-
First we try scanning for an available position on this row – one that isn’t under attack from already-placed queens, and that also hasn’t already been visited. If there is none, then we have no choice but to
backtrack (a little piece of code which is coming up soon). Assuming we find an available position, we first mark it as visited/unavailable. We then check if this is the last level that needs to be taken care of, by looking at the stack pointer. Since the stack gets deeper by 16 bytes with every level, this test4 is easily set up at program initialization. If the test is true, then we’ve discovered a solution, or “win state” – so we go ahead to the “win” code.
If we’ve neither succeeded nor failed, it means we just have to go another level down in the tree. In order to have an efficient backtracking capability, we store our state variables on the stack, so they can be restored when everything fails deeper down in the tree. Finally, we update our model of which squares are in danger by adding the queen we’re currently placing as a column-occupier and diagonal-occupier (modifying all three state variables at once with the magic of
por). Note here that
cl is just a name for the least significant byte of the
rcx register, which houses the horizontal position of the new queen.
First, we have another stack-pointer test - if we’ve tried to backtrack past the start of the program, then we know we’ve exhausted all possibilities and just go to
done. Assuming that’s not at issue, we simply restore the
xmm1 variables (using
r10 as scratch storage since one can’t directly pop
xmm registers). Then we just jump back into our loop, with a new state ready to go!
Somewhat surprisingly, the first version actually worked.↩
A word is two bytes. Why did I use words and not just bytes? The answer is that some of the fancy instructions we want to use don’t allow us to work with data elements any smaller than words.↩
But it’s all taking place in the register file – no memory accesses here!↩
That is to say, the value of
Requires a recent (Sandy Bridge or later) Intel CPU.↩