Navigate back to the homepage

# Concurrency is not Parallelism

Mike Christensen
October 27th, 2020 · 4 min read

In my previous post I discussed how the simple rules of the Boids algorithm, applied to independent actors aware only of their local environments, produces beautiful, realistic flocking behaviour. While writing I was reminded of another of my favourite examples of emergent behaviour: Conway’s Game of Life. For fun, I decided to implement it in Go - and found that it demonstrates well that old Go proverb:

Concurrency is not parallelism.

## Conway’s Game of Life

In 1970, John Conway invented his “Game of Life” (GoL): a simple cellular automaton, from which emerges a curious life-like behaviour, almost as if one was looking through a microscope into a Petri dish.

The GoL universe is a matrix of cells, each of which is either alive (depicted black) or dead (depicted white). Each cell transitions between these states according to the following rules:

1. Any live cell with two or three live neighbours survives.
2. Any dead cell with three live neighbours becomes a live cell.
3. All other live cells die in the next generation. Similarly, all other dead cells stay dead.

A neighbour refers to any of the immediately adjacent cells.

## Implementing GoL in Go

When approaching a new problem, always implement a simple solution first. As Donald Knuth famously said:

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.” - Donald Knuth

With this in mind, let’s implement a super simple GoL algorithm in Go!

### Modelling the World Grid

First we need a model of the world:

`.css-1chxjt6{position:absolute;right:22px;top:24px;padding:8px 12px 7px;border-radius:5px;color:#6f7177;-webkit-transition:background 0.3s ease;transition:background 0.3s ease;}.css-1chxjt6:hover{background:rgba(255,255,255,0.07);}.css-1chxjt6[data-a11y="true"]:focus::after{content:"";position:absolute;left:-2%;top:-2%;width:104%;height:104%;border:2px solid var(--theme-ui-colors-accent,#6166DC);border-radius:5px;background:rgba(255,255,255,0.01);}@media (max-width:45.9375em){.css-1chxjt6{display:none;}}1// World describes the world grid.2type World struct {3    Width, Height int4    // Grid describes the set of cells in the world and their alive/dead state.5    // Note that the first row in the array corresponds to the bottom of the screen,6    // and the first element of a row corresponds to the left side of the screen.7    Grid [][]bool8}`

We can instantiate a `World` like so:

`1// NewWorld returns a World of the given dimensions with a randomly initialised grid.2func NewWorld(width, height int) *World {3    world := &World{4        Width:  width,5        Height: height,6    }7    world.Grid = make([][]bool, height)8    for y := range world.Grid {9        world.Grid[y] = make([]bool, width)10        for x := range world.Grid[y] {11            world.Grid[y][x] = rand.Float64() < 0.212        }13    }14    return world15}`

### Counting Neighbours

In order to implement Conway’s rules, we will need to be able to count the number of alive neighbours to a given cell:

`1// WrapCoords accepts an (x, y) coordinate and returns a coordinate which is2// within the bounds of the World, using a toroidal wrapping.3func (w *World) WrapCoords(x, y int) (int, int) {4    if x < 0 {5        x = w.Width - 16    }7    if x > w.Width-1 {8        x = 09    }10    if y < 0 {11        y = w.Height - 112    }13    if y > w.Height-1 {14        y = 015    }16    return x, y17}1819// CountLiveNeighbours returns the number of live neighbours20// around the given cell position.21func (w *World) CountLiveNeighbours(x, y int) int {22    count := 023    for j := -1; j <= 1; j++ {24        for i := -1; i <= 1; i++ {25            if i == 0 && j == 0 {26                continue27            }28            _x, _y := w.WrapCoords(x+i, y+j)29            if w.Grid[_y][_x] {30                count++31            }32        }33    }34    return count35}`

This simple implementation wraps the world grid such that flying off the right hand side brings you back to the left, and flying off the top brings you to the bottom - i.e. a torus:

### Conway’s Rules

Now that we can count neighbours, we can implement Conway’s rules:

`1// Update applies the rules to the world.2func (w *World) Update() {3    // Create a deep copy of the world grid4    nextGrid := make([][]bool, len(w.Grid))5    for y := range w.Grid {6        nextGrid[y] = make([]bool, len(w.Grid[y]))7        copy(nextGrid[y], w.Grid[y])8    }910    // Apply Conway's rules11    for y := range w.Grid {12        for x := range w.Grid[y] {13            n := w.CountLiveNeighbours(x, y)14            if w.Grid[y][x] {15                // Any live cell with fewer than two or more than three live neighbours dies16                if n < 2 || n > 3 {17                    nextGrid[y][x] = false18                }19            } else {20                // Any dead cell with exactly three live neighbours becomes a live cell21                if n == 3 {22                    nextGrid[y][x] = true23                }24            }25        }26    }2728    // Update the world grid29    w.Grid = nextGrid30}`

Note how we create a copy of the world grid so that we don’t read the updated values of neighbouring cells.

### Rendering to the Screen

Finally, it would be great to render the game to the screen. For this I used the excellent `pixel` library from @faiface. We can create a window and game loop in the same way as described in my previous post:

`1const (2    // WindowWidth is the width of the window in pixels3    WindowWidth = 12004    // WindowHeight is the height of the window in pixels5    WindowHeight = 8006    // WorldWidth is the width of the world in # cells.7    WorldWidth = 12008    // WorldHeight is the height of the world in # cells.9    WorldHeight = 80010)1112func createWindow(title string, width, height float64) (*pixelgl.Window, error) {13    window, err := pixelgl.NewWindow(pixelgl.WindowConfig{14        Title:  title,15        Bounds: pixel.R(0, 0, width, height),16        VSync:  true,17    })18    if err != nil {19        return nil, errors.Wrap(err, "new window failed")20    }21    window.SetSmooth(true)22    return window, nil23}2425// Run runs the game.26func Run() {27    window, err := createWindow("GoLife", WindowWidth, WindowHeight)28    if err != nil {29        panic(errors.Wrap(err, "create window failed"))30    }31    frames := 032    second := time.Tick(time.Second)33    world := life.NewWorld(WorldWidth, WorldHeight)34    drawer := draw.NewDrawer()35    for !window.Closed() {36        drawer.DrawFrame(window, world)37        world.Update()38        window.Update()39        frames++40        select {41        case <-second:42            window.SetTitle(fmt.Sprintf("%s | FPS: %d", "GoLife", frames))43            frames = 044        default:45        }46    }47}4849func init() {50    rand.Seed(time.Now().Unix())51}`

Then, we can implement `DrawFrame` to draw the world grid on the window. We can achieve this by iterating over the world grid and rendering black rectangles at the coordinates of alive cells, applying a scaling factor to convert the grid coordinates to window coordinates:

`1// DrawFrame draws the world on the window.2func (d *Drawer) DrawFrame(window *pixelgl.Window, world *life.World) error {3    window.Clear(colornames.Aliceblue)4    d.Batch.Clear()56    imd := imdraw.New(nil)7    imd.Color = pixel.RGB(0, 0, 0)8    scale := pixel.V(9        window.Bounds().Max.Y/float64(len(world.Grid)),10        window.Bounds().Max.X/float64(len(world.Grid[0])),11    )12    for y := range world.Grid {13        for x := range world.Grid[y] {14            if !world.Grid[y][x] {15                continue16            }17            imd.Push(pixel.V(float64(x)*scale.X, float64(y)*scale.Y))18            imd.Push(pixel.V(float64(x+1)*scale.X, float64(y+1)*scale.Y))19            imd.Rectangle(0)20        }21    }22    imd.Draw(d.Batch)2324    d.Batch.Draw(window)25    return nil26}`

Here’s the result with a 1200x800 world grid:

## Faster!

Great, so our basic implementation works - but can we make it faster?

### Benchmarking

First up, time for a speed test: let’s add some benchmarks. For the purposes of this post, we’ll ignore the rendering code and benchmark the core GoL update logic. If you are unfamiliar with writing benchmarks in the Go, I recommend this excellent post from Dave Cheney.

`1func benchmarkUpdate(sideLength int, b *testing.B) {2    world := life.NewWorld(sideLength, sideLength)3    for n := 0; n < b.N; n++ {4        world.Update()5    }6}78func BenchmarkUpdate30(b *testing.B) { benchmarkUpdate(30, b) }9func BenchmarkUpdate35(b *testing.B) { benchmarkUpdate(35, b) }10func BenchmarkUpdate40(b *testing.B) { benchmarkUpdate(40, b) }11func BenchmarkUpdate45(b *testing.B) { benchmarkUpdate(45, b) }`

Here we are measuring the time taken to update different grid sizes. Running on my machine (with a 2m `-benchtime` to improve statistical significance):

`1❯ go test -timeout=1h -benchtime=2m -bench=. ./pkg/life/life_test.go2goos: darwin3goarch: amd644BenchmarkUpdate30-4      2718184             53135 ns/op5BenchmarkUpdate35-4      2132811             69771 ns/op6BenchmarkUpdate40-4      1558998            101462 ns/op7BenchmarkUpdate45-4      1000000            121130 ns/op8BenchmarkUpdate50-4       319368            141794 ns/op9BenchmarkUpdate55-4       226026            162587 ns/op10BenchmarkUpdate60-4       190359            203564 ns/op11BenchmarkUpdate65-4       136521            248646 ns/op12PASS13ok      command-line-arguments  788.950s`

As expected, we can see that runtime grows linearly with the number of cells in the grid (the square of the height of the grid):

In order to compare the program performance as we optimise the code, let’s obtain a benchmark on a large grid of 10000x10000 cells:

`1# func BenchmarkUpdate10000(b *testing.B) { benchmarkUpdate(10000, b) }23❯ go test -benchtime=1m30s -bench=. ./pkg/life/life_test.go4BenchmarkUpdate10000-4                19        5031633976 ns/op`

So, about 5 seconds on my machine to update the whole grid.

### Partitioning the Grid

Can we speed up the program by partitioning the grid and computing each section of the grid concurrently (in their own goroutines)?

We can achieve this by executing the outer update loop over a specific range of rows:

`1// updatePartition applies the rules to a vertically-partitioned portion of the world.2func (w *World) updatePartition(wg *sync.WaitGroup, yStart, yEnd int, nextGrid [][]bool) {3    defer wg.Done()4    for y := range w.Grid[yStart:yEnd] {5        y = y + yStart6        for x := range w.Grid[y] {7            n := w.CountLiveNeighbours(x, y)8            if w.Grid[y][x] {9                // Any live cell with fewer than two or more than three live neighbours dies10                if n < 2 || n > 3 {11                    nextGrid[y][x] = false12                }13            } else {14                // Any dead cell with exactly three live neighbours becomes a live cell15                if n == 3 {16                    nextGrid[y][x] = true17                }18            }19        }20    }21}`

Then we can call `updatePartition` concurrently for the desired number of partitions (note that the number of partitions must be a factor of the world height):

`1// Update applies the rules to the world.2func (w *World) Update(partitions int) {3    // Create a deep copy of the world grid4    nextGrid := make([][]bool, len(w.Grid))5    for y := range w.Grid {6        nextGrid[y] = make([]bool, len(w.Grid[y]))7        copy(nextGrid[y], w.Grid[y])8    }910    // Apply Conway's rules11    var wg sync.WaitGroup12    for p := 0; p < partitions; p++ {13        wg.Add(1)14        go w.updatePartition(&wg, p*w.Height/partitions, (p+1)*w.Height/partitions, nextGrid)15    }16    wg.Wait()1718    // Update the world grid19    w.Grid = nextGrid20}`

As an aside, you may notice that our goroutines may concurrently read from the same cell. This can occur at the boundaries between partitions where a neighbouring cell is found in the adjacent partition. This is in fact not a race condition, since no writes occur on `w.Grid`; they only occur on the deep copy, `nextGrid`. Note that if we wanted to remove the synchronisation (achieved with the `WaitGroup`) that is currently required with each call to `Update` - that is, to allow goroutines to race ahead to the next generation - then we would instead need to use channels to communicate the state of the partition boundaries:

Don’t communicate by sharing memory, share memory by communicating.

Here are the results of running the 10000x10000 benchmark with 2 partitions on my machine:

`1❯ go test -benchtime=1m30s -bench=. ./pkg/life/life_test.go2BenchmarkUpdate10000-4                26        4110649931 ns/op`

That’s down from ~5s to ~4s, a 20% improvement. We can see further improvement with 4 partitions:

`1❯ go test -benchtime=1m30s -bench=. ./pkg/life/life_test.go2BenchmarkUpdate10000-4                33        3073360188 ns/op`

However, increasing the number of partitions beyond this doesn’t improve the runtime of our program!

`1❯ go test -benchtime=1m30s -bench=. ./pkg/life/life_test.go2BenchmarkUpdate10000-4                33        3051534928 ns/op3❯ go test -benchtime=1m30s -bench=. ./pkg/life/life_test.go4BenchmarkUpdate10000-4                27        3930115474 ns/op`

## Concurrency is Not Parallelism

Why doesn’t adding more than 4 goroutines lead to a speed up? There’s a clue in the benchmark output:

`1BenchmarkUpdate10000-4`

That `-4` postfix indicates the value of `GOMAXPROCS` during that run - that is, the number of OS threads available to our goroutines. My machine has 4 virtual cores, and by default (since Go 1.5) `GOMAXPROCS` matches this value.

When we split the grid into 4 partitions, the Go runtime was able to schedule the goroutines into separate OS threads, which are executed in parallel across each vCPU. With more partitions, execution of each goroutine continues to be concurrent, but not in fact parallel.

### What if N > `GOMAXPROCS`?

You may have heard goroutines describes as “lightweight threads” before. One of the benefits of this lightweight footprint is supposed to be that you can spawn a lot of them - but if spawning more than `GOMAXPROCS` goroutines doesn’t speed up our program, what is the point? The answer lies with the Go scheduler and the type of work that our goroutines are performing.

Note that update algorithm for Conway’s GoL is busy. There are no I/O or otherwise blocking calls and the algorithm makes maximal use of CPU time. Imagine instead our goroutines were executing code that made blocking calls. Then, the scheduler could allow other goroutines to use the CPU while the caller is blocked.

So how do you determine whether to add concurrency to your program? It’s important to profile your code to understand where adding concurrency will yield benefits and where it will not.

If you want to play around with this yourself, you can check out the source code. If you’ve got comments, questions or feedback you can hit me up on Twitter ✌️

### Flocking Gophers: The Boids Algorithm

Emergent flocking behaviour and predator evasion in Go.

September 6th, 2020 · 3 min read

### Integration Testing in Go

How to test code works correctly together.

August 27th, 2020 · 3 min read