Benchmarking Go programs

Benchmarking is an important step in writing software. This may extend beyond the scope of your Go program, but attention should also be given to the Go program itself. If you’re writing APIs or if you’re implementing a parser for a specific file format - benchmarking will tell you how performant your implementation is.

Writing tests in Go is relatively simple - you create a file with the suffix _test.go, and when you run go test it will run the tests contained within. But there’s a specific feature to Go, benchmarks. You write the benchmarks in the same way as you would write tests, but change the function signature from this:

func TestScannerTextBlock(t *testing.T) {

to this:

func BenchmarkScannerTextBlock(b *testing.B) {

As you see, the Testing functions have a “Test” prefix for the function name. In comparison, the Benchmark functions have a “Benchmark” prefix. What changes is also the argument from *testing.T to *testing.B. The types provide functionality to manage tests and benchmarks.

A feature complete example of a benchmark would be this:

func BenchmarkStringJoin1(b *testing.B) {
	input := []string{"Hello", "World"}
	for i := 0; i < b.N; i++ {
		result := strings.Join(input, " ")
		if result != "Hello World" {
			b.Error("Unexpected result: " + result)

And you could run your benchmark like this:

# go test -run=xxx -bench=. -cpuprofile profile_cpu.out

The above command will skip tests (unless you’ve named them ‘xxx’), run all benchmarks, and create a CPU profile for the benchmarking run.

"Benchmarking with Go: It's just as simple as writing tests #golang" via @TitPetric

Click to Tweet

There are a few things to note in that example:

1. Testing in the benchmark

You can write a sanity check test inside the benchmark, which can error out of the benchmark if the results might be invalid. While this falls into the testing territory, there might be errors that can be caught this way from objects that maintain a state that changes over many requests / function calls.

Ir you detect an error, just call the b.Error function and the benchmarks will stop.

2. Benchmark intensity

The usual story when benchmarking code in other programming languages would be that you would log the time taken for a loop, where you would set a loop limit by hand. Go benchmarks are smarter in that respect - they provide a b.N integer that tells the benchmark how many loops it should do, so it roughly takes one second (by default).

  -test.benchtime duration
        approximate run time for each benchmark (default 1s)

3. Reporting Allocations

When running a benchmark, Go can log and report allocations which you make. This logging and reporting is triggered with a call to b.ReportAllocs(). The allocation information may signify a portion of your code that can be optimized in a way so it takes less memory. After running the benchmark, some information regarding B/op and allocs/op will be printed. The complete output might be similar to this:

BenchmarkStringJoin1-4   300000   4351 ns/op   32 B/op   2 allocs/op

This tells us that each iteration of the benchmark allocated on average 32 bytes, with two allocations. With this information we can set out to improve the test, hopefully allocating less memory.

func BenchmarkStringJoin2(b *testing.B) {
	input := []string{"Hello", "World"}
	join := func(strs []string, delim string) string {
		if len(strs) == 2 {
			return strs[0] + delim + strs[1];
		return "";
	for i := 0; i < b.N; i++ {
		result := join(input, " ")
		if result != "Hello World" {
			b.Error("Unexpected result: " + result)

The new benchmark produces the following results:

BenchmarkStringJoin2-4   500000   2440 ns/op   16 B/op   1 allocs/op

So, we managed to halve the allocations and speed up the function at the same time. Looking at the implementation of strings.Join, I think that maybe the optimisation comes from the compiler itself. The closure join which we introduced must be partially optimized and only the allocation for result must remain. The compiler should produce something like this:

func BenchmarkStringJoin3(b *testing.B) {
	input := []string{"Hello", "World"}
	for i := 0; i < b.N; i++ {
		result := input[0] + " " + input[1];
		if result != "Hello World" {
			b.Error("Unexpected result: " + result)

Where the result would count as one allocation. But in fact, the compiler optimizes the above code even further:

BenchmarkStringJoin3-4   30000000   37.6 ns/op   0 B/op   0 allocs/op

This makes it a bit difficult to point at a specific allocation in the code, and pin point the difference. For example, above you could say that declaring the input variable counts as an allocation - but if you move it into the for loop, the results will vary. The allocs/op for the Join1 and Join2 tests would stay the same, but the Test2, will increase in memory usage:

BenchmarkStringJoin2-4    500000   2440 ns/op   16 B/op   1 allocs/op
BenchmarkStringJoin2B-4   300000   5004 ns/op   48 B/op   2 allocs/op

Compiler optimizations are great, but it’s a bit tricky to deduce where the allocations might be coming from. In fact, with added optimisations in the compiler itself, the above benchmarks might improve with time, without changing a single line of source code.

"With compiler optimizations it's tricky to deduce where allocations might be coming from. #golang" via @TitPetric

Click to Tweet

For those not faint of heart, you can run the following command and take a look at what the compiler produces - pure unadulterated assembly.

# go tool objdump app.test > app.asm

4. CPU profile with pprof

While you have some basic information about the time it took for each function to be executed, you might want to drill down when benchmarking more complex functions. This way you might figure out, which function being called takes the most time.

The benchmark command above produces an app.test executable, along with a profile_cpu.out. You can use the pprof tool from the go toolchain to analyze this output:

# go tool pprof app.test profile_cpu.out
Entering interactive mode (type "help" for commands)
(pprof) top10
8220ms of 10360ms total (79.34%)
Dropped 63 nodes (cum <= 51.80ms)
Showing top 10 nodes out of 54 (cum >= 160ms)
      flat  flat%   sum%        cum   cum%
    2410ms 23.26% 23.26%     4960ms 47.88%  runtime.concatstrings
    2180ms 21.04% 44.31%     2680ms 25.87%  runtime.mallocgc
    1200ms 11.58% 55.89%     1200ms 11.58%  runtime.memmove
     530ms  5.12% 61.00%      530ms  5.12%  runtime.memeqbody
     530ms  5.12% 66.12%     2540ms 24.52%  runtime.rawstringtmp
     470ms  4.54% 70.66%     2420ms 23.36%  strings.Join
     390ms  3.76% 74.42%     2330ms 22.49%  app.BenchmarkStringJoin3B
     180ms  1.74% 76.16%     1970ms 19.02%  runtime.rawstring
     170ms  1.64% 77.80%     5130ms 49.52%  runtime.concatstring3
     160ms  1.54% 79.34%      160ms  1.54%  runtime.eqstring

This doesn’t exactly show the complete call graph relationships, so you would know what functions strings.Join is calling for example. Or why only BenchmarkStringJoin3B shows up in the top 10. One way to get that is to generate it as a PDF or SVG with the same pprof tool (you will need dot installed):

# go tool pprof -svg profile_cpu.out > profile_cpu.svg
# go tool pprof -pdf profile_cpu.out > profile_cpu.pdf

This will produce a callgraph PDF like this screenshot below:

A zoomed in callgraph section

As we declared several benchmarks, the output is slightly inconvenient if you’re interested only in one. You can run a specific benchmark and only get the output of that run. As Join2B was the worst benchmark, we want to drill into that one specifically:

go test -run=xxx -bench=BenchmarkStringJoin2B$ -cpuprofile profile_2b.out
go test -run=xxx -bench=BenchmarkStringJoin2$ -cpuprofile profile_2.out
go tool pprof -svg profile_2b.out > profile_2b.svg
go tool pprof -svg profile_2.out > profile_2.svg

A partial call graph with a single benchmark

As the screenshot shows, the 2B benchmark is slower due to extra allocations in the benchmark function. In comparison, the benchmark with the allocation outside the loop allocates memory under runtime.concatstrings. It might still not be a great example of drilling down into the benchmark, it might be better to use flame graphs.

A flame graph gives you the option to drill down top level functions by depth. This means that you can get a full flame graph like this and then drill down.

"A flame graph is a good way to drill down your benchmarks, finding your bottlenecks #golang" via @TitPetric

Click to Tweet

A full run flame graph

In order to drill down, you just click on the individual functions on the bottom, that are interesting to you. This will change the graph to display only descendant functions that have been invoked inside.

A flame graph zoomed in to a single benchmark

Generating the graph is as simple as generating the svg/pdf with pprof, but it does require some extra tooling, namely uber/go-torch, which uses a FlameGraph library by Brendan Gregg. A short bash script that checks out all the dependencies and generates the flame graph is bundled here:

# install flamegraph scripts
if [ ! -d "/opt/flamegraph" ]; then
	echo "Installing flamegraph (git clone)"
	git clone --depth=1 /opt/flamegraph

# install go-torch using docker
if [ ! -f "bin/go-torch" ]; then
	echo "Installing go-torch via docker"
	docker run --net=party --rm=true -it -v $(pwd)/bin:/go/bin golang go get
	# or if you have go installed locally: go get

bin/go-torch -b profile_cpu.out -f profile_cpu.torch.svg

You can also check out the complete code samples on this github repo. While the above gives you an idea of how benchmarking works in Go, and how to usefully implement it, I’ll follow up with a more complex example. A good way to stay updated is to enter your e-mail below, or follow me on Twitter.

While I have you here...

It would be great if you buy one of my books:

I promise you'll learn a lot more if you buy one. Buying a copy supports me writing more about similar topics. Say thank you and buy my books.

Feel free to send me an email if you want to book my time for consultancy/freelance services. I'm great at APIs, Go, Docker, VueJS and scaling services, among many other things.