Benchmarking Go programs, part 2

There were a couple of things that didn’t make sense to me initially in terms of optimisation of Go code. I’m going to try to list some of them that I found more interesting and you should be aware of this if you’re really working hard to optimize your Go program for performance or memory usage.

Escape analysis

The term “Escape analysis” is referring to allocation of a variable on the heap.

The basic rule is that if a reference to a variable is returned from the function where it is declared, it “escapes” - it can be referenced after the function returns, so it must be heap-allocated.

For a more in-depth reading of how this works in Go, please check out this excellent article by Alan Gardner. Meanwhile, we’ll move towards the example where this came in play, the following code snippet from this reddit thread.

The example was a simple recursion example:

func recursive1(count int64) int64 {
	if count <= 0 {
		return 0
	}
	return recursive1(count-1) + 1
}

You’ll notice that this code doesn’t use references. As Alan says in his article: “In general, code without references uses stack allocation, trivially.”. This means that while the values are not getting allocated on the heap, the function itself does incur a penalty for these allocations. But how much?

Let’s consider returning the values from this function by reference:

func recursive2(result *int64, count int32) {
	if count <= 0 {
		return
	}
	*result++
	recursive2(result, count-1)
}

The functions call graph hasn’t been altered, there’s still count recursive function calls being made, and in each function call, the result is incremented once. The difference is only that the result variable was pre-allocated in the benchmark, getting rid of all the stack allocations. When running the benchmark on these functions, these are the results:

BenchmarkR2_1-4        500000     3120 ns/op
BenchmarkR2_2-4        200000     6211 ns/op
BenchmarkR2_5-4        100000    15675 ns/op
BenchmarkR2_10-4        50000    31070 ns/op
BenchmarkR1_1-4        300000     4130 ns/op
BenchmarkR1_2-4        200000     8296 ns/op
BenchmarkR1_5-4        100000    20599 ns/op
BenchmarkR1_10-4        30000    41487 ns/op

A significant improvement was made over the initial implementation: from 4130ns/op to 3120ns/op, which translates into a speed gain of 32%! While this example shows the implications of a lot of function calls, there are additional considerations as well - the compiler might optimize the above code eventually, and whatever the compiler can’t optimize, you can do a better job by hand. In this very trivial case, we’re literally optimizing away only the function return overhead. If you can eliminate the function calls altogether, the results will be staggering.

"Returning values by reference may improve the speed of your Go app by 30%+ #golang" via @TitPetric

Click to Tweet

The cost of defer

In this reddit thread I questioned if it’s reasonable to use defer instead of “unrolling” the function to it’s full size. This stemmed from my misunderstanding of what defer actually does, and this conclusion surprised me somewhat:

type Defer struct {
	sync.Mutex
	value int
}

func (r *Defer) defer1() {
	r.Lock()
	defer r.Unlock()
	r.value++
}

func (r *Defer) defer2() {
	r.Lock()
	r.value++
	r.Unlock()
}

Would you believe me if I told you that defer2 is once faster than defer1? Maybe you’ll believe the benchmarks:

BenchmarkDefer2_1-4     50000000      25 ns/op
BenchmarkDefer2_2-4     30000000      49 ns/op
BenchmarkDefer2_5-4     10000000     123 ns/op
BenchmarkDefer2_10-4     5000000     244 ns/op
BenchmarkDefer1_1-4     30000000      51 ns/op
BenchmarkDefer1_2-4     20000000     102 ns/op
BenchmarkDefer1_5-4      5000000     253 ns/op
BenchmarkDefer1_10-4     3000000     511 ns/op

The reason for this result is that defer allocates memory. Far be it from me to break down exactly how this is being done, but there are significant optimisations made for Go 1.8 and upcoming 1.9. Due to these optimisations, you shouldn’t disqualify using defer even for more cosmetic use cases - making your code more readable. But if you’re doing something with high lock contention / function call frequency and your code can’t panic - consider skipping defer and take advantage of this speed boost.

"Dealing with high frequency function calls, skip defer and get a significant speed boost #golang" via @TitPetric

Click to Tweet

The cost of timers/sleep

I initially started programming in single threaded languages, namely pascal, c++, x86 assembler and on the far end of the timeline, PHP. Basically all of these had a function like usleep, which would take a number of microseconds or some time duration and halt the program execution for that number of seconds.

You could say that time.Sleep is the Go equivalent of this.

I was optimizing a relatively straightforward generator/consumer app and was doing benchmarks and profile runs in order to be stumped - why was my app producing such high CPU usage, while benchmarks indicated I could generate several million samples within a second. The heavily upvoted reddit thread I posted about this is very insightful and you should definitely read through the comments.

The reason for this was in the fine print - what time.Sleep is actually doing:

Sleep pauses the current goroutine for at least the duration d.

But a goroutine is not a thread, which has further implications. The usleep being used by those single-threaded languages is basically a system call which halts the complete process for a specified amount of time. The scheduling for that is completely contained in the kernel. Meanwhile if we would use the same implementation of this function in Go, it would pause all goroutines running on the same (possibly single) thread.

There exists a thing which is called the “Go Scheduler”. There are a few in-depth explanations of what it is and how it works, but let’s concentrate on the most simple explanation: the Go scheduler is what runs your main function and creates all your goroutines and allocates time between them so they can each perform their work effectively.

So what exactly happens when you call time.Sleep?

// timeSleep puts the current goroutine to sleep for at least ns nanoseconds.
//go:linkname timeSleep time.Sleep
func timeSleep(ns int64) {
	if ns <= 0 {
		return
	}
	t := new(timer)
	t.when = nanotime() + ns
	t.f = goroutineReady
	t.arg = getg()
	lock(&timers.lock)
	addtimerLocked(t)
	goparkunlock(&timers.lock, "sleep", traceEvGoSleep, 2)
}
  1. Creates a new timer,
  2. Sets a trigger to the timer to time + duration,
  3. Locks the timer and registers a callback when to unlock

So, basically, putting a goroutine to sleep involves a lock (a futex lock - fast user space mutex), a timer and a callback which would tell the scheduler that our goroutine is again ready for getting some sweet CPU time. If you’re looking at the calls like time.After() or time.NewTicker(), you’ll see that timerproc does a whole lot more, including channel overhead for sending events. And if you’re sleeping for 4000 times / second (a timer at 250 microseconds ticks) this will negatively impact CPU usage and over-invoke the scheduler, and the scheduler will do a high number of context switches between goroutines.

So what can you do? Put your hand off the throttle. Try to think in a more serial way in such cases. It’s going to be less CPU expensive producing 100 samples every 25 milliseconds than 1 sample every 250 microseconds. Use time.Sleep in a loop instead of using time.NewTicker() if you can. Try to minimize the impact on the Go scheduler.

"Minimizing impact on the Go scheduler is an effective way to reduce CPU usage #golang" via @TitPetric

Click to Tweet

What are the implications of optimizing timers like this?

Going down from 80% CPU usage to 8% CPU usage doesn’t really increase your volume, as the output is based on time. But it does give you space to do more things in addition to generating the main output. This is a classic example of “do more with given resources”.

You can also check out the complete code samples on this github repo. While the above gives you an idea of how benchmarking works in Go, and how to usefully implement it, I’ll follow up with a more complex example. A good way to stay updated is to enter your e-mail below, or follow me on Twitter.

This is the second article in the series, read also part 1: Benchmarking Go programs.

About the author

I'm the author of API Foundations in Go and my latest book, 12 Factor Apps with Docker and Go. You should read both books if you want to learn more about Docker and Go. I write about technology, Docker, Go, Node, PHP and databases most of the time. I am always interested in implementing best practices in whatever I do. You can reach out to me on Twitter.

I'm available for consulting / development jobs. Fixing bottlenecks and scaling services to cope with high traffic is my thing. I specialize in back-end development with a focus on providing APIs for consumption in Angular, React, Vue.js and other front-end technology stacks.