Make Your Go Programs Lightning Fast With Profiling

Go is often used for writing distributed systems, advanced data stores, and microservices. Performance is key in these domains.

In this tutorial, you'll learn how to profile your programs to make them lightning fast (utilize the CPU better) or feather light (use less memory). I will cover CPU and memory profiling, using the pprof (the Go profiler), visualizing the profiles, and even flame graphs.

Profiling is measuring the performance of your program in various dimensions. Go comes with great support for profiling and can profile the following dimensions out of the box:

a sampling of CPU time per function AND instruction
a sampling of all heap allocations
stack traces of all current goroutines
stack traces that led to the creation of new OS threads
stack traces that led to blocking on synchronization primitives
stack traces of holders of contended mutexes

You can even create custom profiles if you want. Go profiling involves creating a profile file and then analyzing it using the pprof go tool.

How to Create Profile Files

There are several ways to create a profile file.

Using "go test" to Generate Profile Files

The easiest way is to use go test. It has several flags that allow you to create profile files. Here is how to generate both a CPU profile file and a memory profile file for the test in the current directory: go test -cpuprofile cpu.prof -memprofile mem.prof -bench .

Download Live Profile Data From a Long-Running Service

If you want to profile a long-running web service, you can use the built-in HTTP interface for providing profile data. Add somewhere the following import statement:

import _ "net/http/pprof"

Now, you can download live profile data from the /debug/pprof/ URL. More info is available in the net/http/pprof package documentation.

Profiling in Code

You can also add direct profiling into your code for complete control. First you need to import runtime/pprof. CPU profiling is controlled by two calls:

pprof.StartCPUProfile()
pprof.StopCPUProfile()

Memory profiling is done by calling runtime.GC() followed by pprof.WriteHeapProfile().

All the profiling functions accept a file handle that you are responsible for opening and closing appropriately.

The Sample Program

To see the profiler in action, I'll use a program that solves Project Euler's Problem 8. The problem is: given a 1,000-digit number, find the 13 adjacent digits within this number that have the largest product.

Here is a trivial solution that iterates over all the sequences of 13 digits, and for each such sequence it multiplies all 13 digits and returns the result. The largest result is stored and finally returned:

package trivial

import (
    "strings"
)

func calcProduct(series string) int64 {
	digits := make([]int64, len(series))
	for i, c := range series {
		digits[i] = int64(c) - 48
	}

	product := int64(1)
	for i := 0; i < len(digits); i++ {
		product *= digits[i]
	}

	return product
}

func FindLargestProduct(text string) int64 {
	text = strings.Replace(text, "\n", "", -1)
	largestProduct := int64(0)
	for i := 0; i < len(text); i++ {
		end := i + 13
		if end > len(text) {
			end = len(text)
		}
		series := text[i:end]
		result := calcProduct(series)
		if result > largestProduct {
			largestProduct = result
		}
	}

	return largestProduct
}

Later, after profiling, we will see some ways to improve the performance with another solution.

CPU Profiling

Let's profile the CPU of our program. I'll use the go test method using this test:

1	import (
2	"testing"
3	)
4
5	const text = `
6	73167176531330624919225119674426574742355349194934
7	96983520312774506326239578318016984801869478851843
8	85861560789112949495459501737958331952853208805511
9	12540698747158523863050715693290963295227443043557
10	66896648950445244523161731856403098711121722383113
11	62229893423380308135336276614282806444486645238749
12	30358907296290491560440772390713810515859307960866
13	70172427121883998797908792274921901699720888093776
14	65727333001053367881220235421809751254540594752243
15	52584907711670556013604839586446706324415722155397
16	53697817977846174064955149290862569321978468622482
17	83972241375657056057490261407972968652414535100474
18	82166370484403199890008895243450658541227588666881
19	16427171479924442928230863465674813919123162824586
20	17866458359124566529476545682848912883142607690042
21	24219022671055626321111109370544217506941658960408
22	07198403850962455444362981230987879927244284909188
23	84580156166097919133875499200524063689912560717606
24	05886116467109405077541002256983155200055935729725
25	71636269561882670428252483600823257530420752963450
26	`
27
28	func TestFindLargestProduct(t *testing.T) {
29	for i := 0; i < 100000; i++ {
30	res := FindLargestProduct(text)
31	expected := int64(23514624000)
32	if res != expected {
33	t.Errorf("Wrong!")
34	}
35	}
36	}

Note that I run the test 100,000 times because the go profiler is a sampling profiler that needs the code to actually spend some significant time (several milliseconds cumulative) on each line of code. Here is the command to prepare the profile:

1	go test -cpuprofile cpu.prof -bench .
2	ok _/github.com/the-gigi/project-euler/8/go/trivial 13.243s

It took a little over 13 seconds (for 100,000 iterations). Now, to view the profile, use the pprof go tool to get into the interactive prompt. There are many commands and options. The most basic command is topN; with the -cum option it shows the top N functions that took the most cumulative time to execute (so a function that takes very little time to execute, but is called many times, can be at the top). This is usually what I start with.

> go tool pprof cpu.prof
Type: cpu
Time: Oct 23, 2017 at 8:05am (PDT)
Duration: 13.22s, Total samples = 13.10s (99.06%)
Entering interactive mode (type "help" for commands)
(pprof) top5 -cum
Showing nodes accounting for 1.23s, 9.39% of 13.10s total
Dropped 76 nodes (cum <= 0.07s)
Showing top 5 nodes out of 53
 flat  flat%   sum%   cum   cum%
0.07s  0.53%  0.53%   10.64s 81.22%  FindLargestProduct
0      0%     0.53%   10.64s 81.22%  TestFindLargestProduct
0      0%     0.53%   10.64s 81.22%  testing.tRunner
1.07s  8.17%  8.70%   10.54s 80.46%  trivial.calcProduct
0.09s  0.69%  9.39%   9.47s  72.29%  runtime.makeslice

Let's understand the output. Each row represents a function. I elided the path to each function due to space constraints, but it will show in the real output as the last column.

Flat means the time (or percentage) spent within the function, and Cum stands for cumulative—the time spent within the function and all the functions it calls. In this case, testing.tRunner actually calls TestFindLargestProduct(), which calls FindLargestProduct(), but since virtually no time is spent there, the sampling profiler counts their flat time as 0.

Memory Profiling

Memory profiling is similar, except that you create a memory profile:

go test -memprofile mem.prof -bench .

PASS

ok      _/github.com/the-gigi/project-euler/8/go/trivial

You can analyze your memory usage using the same tool.

Using pprof to Optimize Your Program's Speed

Let's see what we can do to solve the problem faster. Looking at the profile, we see that calcProduct() takes 8.17% of the flat runtime, but makeSlice(), which is called from calcProduct(), is taking 72% (cumulative because it calls other functions). This gives a pretty good indication of what we need to optimize. What does the code do? For each sequence of 13 adjacent numbers, it allocates a slice:

1	func calcProduct(series string) int64 {
2	digits := make([]int64, len(series))
3	...

That's almost 1,000 times per run, and we run 100,000 times. Memory allocations are slow. In this case, there is really no need to allocate a new slice every time. Actually, there is no need to allocate any slice at all. We can just scan the input array.

The following code snippet shows how to compute the running product by simply dividing by the first digit of the previous sequence and multiplying by the cur digit.

if cur == 1 {
	currProduct /= old
	continue
}

if old == 1 {
	currProduct *= cur
} else {
	currProduct = currProduct / old * cur
}

if currProduct > largestProduct {
	largestProduct = currProduct
}

Here is a short list of some of the algorithmic optimizations:

Computing a running product. Suppose we computed the product at index N... N+13 and call it P(N). Now we need to compute the product at index N+1..N+13. P(N+1) equals to P(N) except that the first number at index N is gone and we need to take into account the new number at index N+14T. This can be done by dividing the previous product by its first number and multiplying by the new number.
Not computing any sequence of 13 numbers that contain 0 (the product will always be zero).
Avoiding division or multiplication by 1.

The complete program is here. There is some thorny logic to work around the zeros, but other than that it is pretty straightforward. The main thing is that we only allocate one array of 1000 bytes at the beginning, and we pass it by pointer (so no copy) to the findLargestProductInSeries() function with a range of indexes.

package scan


func findLargestProductInSeries(digits *[1000]byte, 
                                start, end int) int64 {
    if (end - start)  < 13 {
        return -1
    }

    largestProduct := int64((*digits)[start])
    for i := 1; i < 13 ; i++ {
        d := int64((*digits)[start + i])
        if d == 1 {
            continue
        }
        largestProduct *= d
    }

    currProduct := largestProduct
    for ii := start + 13; ii < end; ii++ {
        old := int64((*digits)[ii-13])
        cur := int64((*digits)[ii])
        if old == cur {
            continue
        }

        if cur == 1 {
            currProduct /= old
            continue
        }

        if old == 1 {
            currProduct *= cur
        } else {
            currProduct = currProduct / old * cur
        }

        if currProduct > largestProduct {
            largestProduct = currProduct
        }
    }

    return largestProduct
}


func FindLargestProduct(text string) int64 {
    var digits [1000]byte
    digIndex := 0
    for _, c := range text {
        if c == 10 {
            continue
        }
        digits[digIndex] = byte(c) - 48
        digIndex++
    }

    start := -1
    end := -1
    findStart := true
    var largestProduct int64
    for ii := 0; ii < len(digits) - 13; ii++ {
        if findStart {
            if digits[ii] == 0 {
                continue
            } else {
                start = ii
                findStart = false
            }
        }

        if digits[ii] == 0 {
            end = ii
            result := findLargestProductInSeries(&digits, 
                                                 start, 
                                                 end)
            if result > largestProduct {
                largestProduct = result
            }
            findStart = true
        }
    }

    return largestProduct
}

The test is the same. Let's see how we did with the profile:

1	> go test -cpuprofile cpu.prof -bench .
2	PASS
3	ok _/github.com/the-gigi/project-euler/8/go/scan 0.816s

Right off the bat, we can see that the runtime dropped from more than 13 seconds to less than one second. That's pretty good. Time to peek inside. Let's use just top10, which sorts by flat time.

(pprof) top10
Showing nodes accounting for 560ms, 100% of 560ms total
flat  flat%   sum%     cum   cum%
290ms 51.79%  51.79%   290ms 51.79%  findLargestProductInSeries
250ms 44.64%  96.43%   540ms 96.43%  FindLargestProduct 
20ms  3.57%   100%     20ms  3.57%   runtime.usleep 
0     0%      100%     540ms 96.43%  TestFindLargestProduct 
0     0%      100%     20ms  3.57%   runtime.mstart 
0     0%      100%     20ms  3.57%   runtime.mstart1 
0     0%      100%     20ms  3.57%   runtime.sysmon 
0     0%      100%     540ms 96.43%  testing.tRunner 

This is great. Pretty much the entire runtime is spent inside our code. No memory allocations at all. We can dive deeper and look at the statement level with the list command:

1	(pprof) list FindLargestProduct
2	Total: 560ms
3	ROUTINE ======================== scan.FindLargestProduct
4	250ms 540ms (flat, cum) 96.43% of Total
5	. . 44:
6	. . 45:
7	. . 46:func FindLargestProduct(t string) int64 {
8	. . 47: var digits [1000]byte
9	. . 48: digIndex := 0
10	70ms 70ms 49: for _, c := range text {
11	. . 50: if c == 10 {
12	. . 51: continue
13	. . 52: }
14	. . 53: digits[digIndex] = byte(c) - 48
15	10ms 10ms 54: digIndex++
16	. . 55: }
17	. . 56:
18	. . 57: start := -1
19	. . 58: end := -1
20	. . 59: findStart := true
21	. . 60: var largestProduct int64
22	. . 61: for ii := 0; ii < len(digits)-13; ii++ {
23	10ms 10ms 62: if findStart {
24	. . 63: if digits[ii] == 0 {
25	. . 64: continue
26	. . 65: } else {
27	. . 66: start = ii
28	. . 67: findStart = false
29	. . 68: }
30	. . 69: }
31	. . 70:
32	70ms 70ms 71: if digits[ii] == 0 {
33	. . 72: end = ii
34	20ms 310ms 73: result := f(&digits,start,end)
35	70ms 70ms 74: if result > largestProduct {
36	. . 75: largestProduct = result
37	. . 76: }
38	. . 77: findStart = true
39	. . 78: }
40	. . 79: }

This is pretty amazing. You get a statement by statement timing of all the important points. Note that the call on line 73 to function f() is actually a call to findLargestProductInSeries(), which I renamed in the profile due to space limitations. This call takes 20 ms. Maybe, by embedding the function code in place, we can save the function call (including allocating stack and copying arguments) and save those 20 ms. There may be other worthwhile optimizations that this view can help pinpoint.

Visualization

Looking at these text profiles can be difficult for large programs. Go gives you a lot of visualization options. You'll need to install Graphviz for the next section.

The pprof tool can generate output in many formats. One of the easiest ways (svg output) is simply to type 'web' from the pprof interactive prompt, and your browser will display a nice graph with the hot path marked in pink.

Flame Graphs

The built-in graphs are nice and helpful, but with large programs, even these graphs can be difficult to explore. One of the most popular tools for visualizing performance results is the flame graph. The pprof tool doesn't support it out of the box yet, but you can play with flame graphs already using Uber's go-torch tool. There is ongoing work to add built-in support for flame graphs to pprof.

Conclusion

Go is a systems programming language that is used to build high-performance distributed systems and data stores. Go comes with excellent support that keeps getting better for profiling your programs, analyzing their performance, and visualizing the results.

There is a lot of emphasis by the Go team and the community on improving the tooling around performance. The full source code with three different algorithms can be found on GitHub.