Experience collecting

That’s what I seem to be doing.

Why do I do something? Because I’ve not done it before.

Take foods for instance. I’ll try almost anything. Twice at least. The first time the chef might not’ve prepared it right. Or maybe my palate wasn’t ready for it yet.

Now try to apply that to life. To everything.

That’s what I do.

Some things might turn out to be stupid afterward. Some times things break. Some times I push myself past the breaking point.

All of those things make me what I am.

A life is a collection of experiences.

I’m stocking the curio with as many varied things as I can.

Collecting.

Everything.

Life – Part 4 – Final tweaks

Gabe and I got VLife pretty optimized back in the day. We took the tile route to a even bigger extreme and gave tiles sub-tiles as well. The key is to do as little processing as you can get away with; any loop that you don’t have to loop over is less processing.

Of course you can go too far. If you make the sub tiles too small you wind up with too much overhead. If you makes the tiles themselves too small you wind up with overhead on the memory allocation/freeing side of the world. The key is to do benchmarks to get a sense of progress.

In some relatively recent conversations with a former coworker we came up with an even better way of representing the cells to same memory and potentially speed things up even more: denser cell packing!

In the previous description we were spending five bits to store each cell. One bit for the “liveness” of the cell and four for the neighbors since you can have up to eight live neighbors.

This is good, but we can get better. Storing five bits is problematic because it’s not a power of two; it’s hard to pack things like that. I suppose you could take a 16 bit word and get three cells, but it’s more of a stretch. That’s not even getting to the additional overhead of the bit shifting to process things.

We need a better way.

Rob came up with the better way through an elegant hack!

Let’s go down the path of storing a cell in only four bits and take one bit for the liveness. You can’t get away from that. That leaves three bits for neighbors. How can you store the number of neighbors in three bits? You can only count up to seven like that.

My initial thought was to use seven (all bits set) as a semaphore to trigger the exception case of “count the neighbors the old fashioned way” since seven or eight neighbors happens so infrequently. But Rob got me one better than that.

Implicit in storing the cell in four bits is that you store two cells per byte. The sibling cell’s liveness bit is in the same byte! The count only has to store the number of live cells other than your sibling. This gets the max down to seven. Problem solved!

Of course the next question is “how do you process this efficiently?”

Easy. A lookup table!

With only one byte to represent two cells the lookup table is only 256 (2^8) rows long. Given the previous state of the cell-pair, all of the information to do the processing is contained in that one byte: the liveness of both cells and enough information to determine the total neighbor count of each!

I’ve not yet implemented this, but it’s cool to advance the state of the art like this.

Life – Part 3 – VLife

This continues the series from yesterday.

While XLife introduced some optimization around how it computes the next board, you can make this run even faster if you look at the typical workload of this cellular automata.

Looking at the characteristics of each generation you’ll quickly see that the vast majority of cells don’t change state. A few things follow from this observation:

  1. If a cell doesn’t change state it doesn’t affect it’s neighbor’s live cell count
  2. The number of cell-state changes can be represented in a more concise format than a whole other board
  3. Cells that don’t change can’t affect their neighbors
    1. The maximum speed that information can propagate is one-cell per generation

Let’s take these one at a time.

Instead of evaluating each cell by counting it’s neighbors, why not store the neighbor count in the cell and update that when it’s needed (essentially this plays the same loop but after (and only if) the cell changes state. A vast majority of the looping can be eliminated by this simple step. If, for instance, 2% of the cells change state in a generation then you’ve saved 98% of the looping around the neighbors! The looping still has to happen to increment or decrement the neighbor count, but only if the cell changes state. Now when evaluating a cell all you need to consider is found in the cell itself: the state and the number of live neighbors!

Of course this carries with it the penalty of storing additional information in each cell, but we’ll get to that later. (In the fully ideal world you’d only need to store one bit per cell, but that’s the trade-off we’re making here)

This moves into #2: board maintenance. Since we’re now keeping track of a bit more info than before it would be more expensive to copy all of it. But we can use the same property of Life to our advantage here too. Instead of copying the board, all we need to do is keep track of which cells changed states in the generation. On the first pass you evaluate all the cells and keep a list of which ones changed, then on pass two you update the cells and their neighbor counts. This saves the overhead of trying to keep two boards!

Finally: #3. If a cell didn’t change then it couldn’t affect it’s neighbors. We can use this to our advantage. Let’s divide up the board into sectors and mark then “clean” in phase 1 when we’re evaluating the board. On phase 2, any sector that had something change let’s mark as dirty (or if it’s on a sector edge, mark it’s neighbor dirty too). Next generation one phase 1 all you have to do is evaluate the dirty sectors! If nothing changed last generation, then you know nothing will change next time too.

Tomorrow will have the last update: even more tweaks to the algorithm to make it faster and more efficient!

Life – Part 2 – XLife

Continuing from yesterday lets look at how this simple algorithm can be made faster!

The first innovation came from XLife that shipped with the X11 X-Windows system on Unix. (personally I didn’t find out about this until later, but historically this came before)

The problem with the conventional type of loop (ignore the wrap problem, this is only an illustration):

for (int x = 0; x < max_x; x++) {
  for (int y = 0; y < max_y; y++) {
    int neighbors;
    neighbors =
      board[y-1][x-1] +
      board[y-1][x] +
      board[y-1][x+1] +
      board[y][x-1] +
      board[y][x+1] +
      board[y+1][x-1] +
      board[y+1][x] +
      board[y+1][x+1]
    DoProcessing(...);
  }
}

The problem is that there is a lot of dereferencing going on all over the place. For each cell you need to do eight double-indexed dereferences. This leads to a a lot of memory traffic on the system bus.

Now lets assume we have each cell stored in a byte and you have a double-whammy: modern processors are bad (cycle-wise) at accessing memory on a non-word boundary.

But, you can play some interesting games if you don’t mind getting closer to the bare metal of the processor! What if instead of treating each cell as a byte, you treat it as a native machine word? For modern processors that would be either 32 or 64 bits.

uint64 top = board[y-1][x]
uint64 middle = board[y][x]
uint64 bottom = board[y+1][x]

uint64 neighbors = top<<8 + top + top >>8 +
                   middle<<8 + middle>>8 +
                   bottom<<8 + bottom + bottom>>8;

This will get you the neighbor counts for the middle six cells. Of course a cell can’t have more than 8 live neighbors so you can take it a step further by storing the cells one per nibble and replace the “8″s in the shifts up there with “4″s, now you’re finding the neighbor counts for 14 cells at a time! Now the looping over the result can be handled all over the registers instead of memory.

Of course this doesn’t handle the cells at the edge, but that can be done cheaply with some messy code (left as an exercise for the reader!).

This code will run substantially faster than the simple code up top. It’s better, but it can get better still by looking at how life typically gets played instead of treating every cell the same.

Tomorrow: VLife – our own creation!

Life – Part 1

Firstly, this isn’t about the life we live, but rather Conway’s Game of Life. (Go ahead and read that, there’s no real reason for me to retype all of it here — and there’s some cool graphics that show it all playing out)

The rules are simple. It’s played on a grid of “cells,” each of which can be alive or dead.

  • If a dead cell has exactly three live neighbors then it comes to life
  • If a live cell has two or three live neighbors it stays alive.
    • Too many and it dies of overpopulation
    • Too few and it dies of loneliness

Each generation of cells simply applies these rules to generate the next generation.

A simple way of implementing this is to double-buffer the boards. One of these buffers stores to the current board and you generate the next generation on the other one. The reason you need two in this case is because you can’t modify the board you’re working on. (think what would happen if you kill off a cell before you evaluate its neighbor) When you’re done constructing the new generation, you simply repeat the process with the boards’ roles reversed.

This works, but there’s plenty of room for improvement.

The obvious problem with this is that you need twice the memory to process all of this. This isn’t an overwhelming problem when you’re looking at a glider, but when you try to make a full computer out of it… well it starts getting a bit hairy.

The other parts you can optimize aren’t so readily apparent. We’ll get there in the next couple of days.

Tomorrow: How XLife added a layer of speed to the process!