# Overkilling the 8-queens problem

Last night, a fellow Hacker Schooler challenged me to a running-time contest on the classic eight queens puzzle. Naturally, I pulled up my trusty Intel® 64 manual and got to work. It turned out to be even faster than I expected, churning out pretty-printed output in 15ms, which is totally dominated by the time it takes the terminal to display it (it takes only 2ms if redirected to a file).

Update: Very slightly more scientific testing, spurred by curious Hacker News commenters, indicates that, without pretty-printing and other overhead, the solving time is actually closer to 11.2µs – about a factor of 7 speedup over commenter bluecalm’s C implementation.

## The Approach

My solution method is heavily inspired by this paper (which, appropriately enough, concerns a beautifully insane programming language called MCPL, combining features from ML, C, and Prolog). This paper contributes two key insights about solving the 8-queens problem:

• Conceptually, we can model the solution space as the leaves of a tree, where each internal node of the tree corresponds to a partial board (with the number of queens equal to the tree depth), and each parent-child link represents adding another queen at the row number corresponding to the depth of the child. Since there can only be one queen per row in a correct solution, this tree is a superset of the actual solution set.

• Instead of actually constructing the tree, we can simply keep track of the current traversal state. In particular, this means we keep track of the currently occupied columns, the occupied leftward going diagonals, and the occupied rightward going diagonals, as they intersect the current row. (Each of these three state variables is eight bits of information.) In addition, we can keep track of the past traversal history of each level using a the stack.

If any of this is unclear, check out the paper, which has a beautiful diagram that there is no need for me to attempt replicating.

## The Code

I’m going to go through the first1 version of the code, which doesn’t produce the pretty boards but has most of the clever tricks. (Ironically, adding “pretty printing” made my code uglier. Maybe it’s just that I was up too late working on it.)

The heart of this algorithm is the sequence that updates the state variables as we move from one layer into the next. This whole program is small enough that it’s still practical to just set aside registers to represent most variables; in particular, rdx represents where it’s okay to place a queen at the current layer (e.g. it starts out as 0b11111111), and xmm1 (one of those fancy 128-bit registers that supports fancy new operations) stores the “occupied left diagonals”, “occupied right diagonals”, and “occupied columns” states, in that order (with “occupied columns” being the least significant word2). xmm2, xmm3, and xmm4 are just being used as scratch space. Finally, xmm7 is a constant 0xff.

### Instruction Dictionary

To spare you the effort of searching through the Intel® 64 manual yourself, here are brief descriptions of all the fancy instructions I’m about to use.

• vpsllw: Vector/Packed Shift Left (Logical) Words. Separately shifts left every word of the second argument by the number of bits represented as the third argument, and store the result to the first argument.
• vpsrlw: Vector/Packed Shift Right (Logical) Words. Separately shifts right every word of the second argument by the number of bits represented as the third argument, and store the result to the first argument.
• pblendw: Packed Blend Words. Using the third argument as a mask, selectively copy words from the second argument to the first argument.
• vpsrldq: Vector/Packed Shift Right (Logical) Double Quadword. Shifts the entire second argument by the number of bytes specified in the third argument, and stores the result to the first argument.
• por: Parallel OR. Bitwise ORs the first and second argument and assigns the result to the first argument.
• vpandn: Vector/Parallel AND NOT. Inverts the second argument, ANDs the result with the third argument, and assigns the result of that to the first argument.
• movq: Move Quadword. The standard way to move data between xmm registers and normal registers.
Now, let’s take this a few lines at a time.

If you’re accustomed to C, you might think of this as functionally equivalent to something like xmm1[2] <<= 1; xmm1[1] >>=13. We want the word in position 1 to shift right and the word in position 2 to shift left, while the word in position 0 (occupied columns) stays put.

Now, we want to combine the information about which squares in the next layer are under attack. It doesn’t matter from which direction – we want to make sure not to put a queen there. So, we shift right 2 words (= 4 bytes) and right 1 word (= 2 bytes) and OR them all together (accumulating into a scratch register so we don’t clobber our state).

But that still contains some stuff in the upper bytes. We only want the lower byte. And we also want 1 bits where queens should be allowed, rather than where they’re under attack. We can solve both problems with one vpandn instruction, which will flip all the bits, but mask out everything except the first byte (since xmm7=0xff).

So, now that we’re iterating, what happens next?

### Instruction Dictionary

• bsf: Bit Scan Forward. Finds the least significant 1 bit in the second argument and stores the index of that bit into the first argument. If there is no 1 bit the second argument, the value of the first argument is undefined, and the zero flag (ZF) is set.
• btc: Bit Clear. Clears the bit in the first argument with index given by the second argument.
• je: Jump If Equal. Pretty self-explanatory, when used in conjunction with cmp (Compare).
• jz: Jump If Zero. Jumps to the specified address/label if the zero flag (ZF) is set.
• push: Push To Stack. Stores its single argument to the memory location pointed by rsp, and decrements rsp (usually by eight at a time, i.e., rsp <- rsp-8).
• shl: Logical Shift Left for non-xmm registers.

First we try scanning for an available position on this row – one that isn’t under attack from already-placed queens, and that also hasn’t already been visited. If there is none, then we have no choice but to backtrack (a little piece of code which is coming up soon). Assuming we find an available position, we first mark it as visited/unavailable. We then check if this is the last level that needs to be taken care of, by looking at the stack pointer. Since the stack gets deeper by 16 bytes with every level, this test4 is easily set up at program initialization. If the test is true, then we’ve discovered a solution, or “win state” – so we go ahead to the “win” code.

If we’ve neither succeeded nor failed, it means we just have to go another level down in the tree. In order to have an efficient backtracking capability, we store our state variables on the stack, so they can be restored when everything fails deeper down in the tree. Finally, we update our model of which squares are in danger by adding the queen we’re currently placing as a column-occupier and diagonal-occupier (modifying all three state variables at once with the magic of por). Note here that cl is just a name for the least significant byte of the rcx register, which houses the horizontal position of the new queen.

What if we have to backtrack?

First, we have another stack-pointer test - if we’ve tried to backtrack past the start of the program, then we know we’ve exhausted all possibilities and just go to done. Assuming that’s not at issue, we simply restore the rdx and xmm1 variables (using r10 as scratch storage since one can’t directly pop xmm registers). Then we just jump back into our loop, with a new state ready to go!

Now we’re ready to look at the whole solution in context:

If you’re curious to investigate further, run the code yourself5 and/or check out the more complicated, pretty-printing version.

1. Somewhat surprisingly, the first version actually worked.

2. A word is two bytes. Why did I use words and not just bytes? The answer is that some of the fancy instructions we want to use don’t allow us to work with data elements any smaller than words.

3. But it’s all taking place in the register file – no memory accesses here!

4. That is to say, the value of r14.

5. Requires a recent (Sandy Bridge or later) Intel CPU.