Benchmarking Data-Oriented Design

· jollygood's blog

#data-oriented-design

Today we're going to have a look at performance effects of data-oriented design. A common pattern in data-oriented design is a struct of arrays. Let's benchmark it against the more traditional array of structs to see the difference in performance.

# Short intro to data-oriented design

Data-oriented design focuses on data layout to utilize hardware efficiently, typically optimizing memory operations and CPU cache utilization.

To improve performance, we want many cache hits and as few cache misses as possible.

There is more to it, but that's out of scope for this post. I recommend watching the talks on data-oriented design by Andrew Kelley1 and Mike Acton2 if you want to learn more.

# Example: user data

Say we have a system where we store some user data. We'll benchmark a loop over all users, listing user email addresses that haven't been confirmed yet.

# Baseline: array of structs

Let's first make a baseline design by organizing user data as an array of structs:

 1var users []user
 2
 3type user struct {
 4	ID                 int
 5	IsActive           bool
 6	Name               string
 7	Username           string
 8	PasswordHash       []byte
 9	MustChangePassword bool
10	Email              string
11	EmailConfirmed     bool
12}
13
14func getUnconfirmedEmailAddresses(users []user) []string {
15	var unconfirmed []string
16	for _, u := range users {
17		if !u.EmailConfirmed {
18			unconfirmed = append(unconfirmed, u.Email)
19		}
20	}
21	return unconfirmed
22}

Now let's put on our data-oriented design glasses and think about the code above.

The for loop reads a user struct on every iteration in order to access the EmailConfirmed field (1 byte) and occasionally the Email field (16 byte string header). The size of a user struct is 104 bytes so, depending on compiler optimizations, we read one or two cache lines (64 or 128 bytes) on each iteration. This means one or two cache misses for every iteration and that most bytes of each cache line go to waste.

Let's assume that the majority of users have already confirmed their email address, so it's very rare that we actually have to read the Email field for a user. Hmmm... What if we could read a cache line full of those EmailConfirmed booleans?

This leads us to using a struct of arrays.

# Data-oriented design: struct of arrays

In the baseline we collected all data for a specific user into a struct. But as we noted above, we could utilize cache lines more efficiently if we had an array of EmailConfirmed booleans for all users. Let's pivot our data structure to have one array per user data field, each array contains data for all users:

 1var users userData
 2
 3type userData struct {
 4	ID                 []int
 5	IsActive           []bool
 6	Name               []string
 7	Username           []string
 8	PasswordHash       [][]byte
 9	MustChangePassword []bool
10	Email              []string
11	EmailConfirmed     []bool
12}
13
14func (u userData) getUnconfirmedEmailAddresses() []string {
15	var unconfirmed []string
16	for i, confirmed := range u.EmailConfirmed {
17		if !confirmed {
18			unconfirmed = append(unconfirmed, u.Email[i])
19		}
20	}
21	return unconfirmed
22}

Now we're looping over an array of booleans, 1 byte per element, i.e. a cache line contains 64 elements. This means we get a cache hit 63 times out of 64. There is a chance that the occasional Email read evicts EmailConfirmed data from the cache, but by analyzing our data we're pretty confident that there are so few unconfirmed email addresses that this is not a problem.

# Benchmark and analysis

The point of this whole exercise is to see if data-oriented design really works. So I wrote a benchmark3, looping over a million users' data. (CPU caches are hard to measure exactly, so I figured that a large number of users would give a better signal to noise ratio).

$ go test -bench=UnconfirmedEmail data-oriented-design-benchmarks_test.go 
              Iterations   Time             Memory throughput   Data processed
baseline             100   10673319 ns/op   9743.92 MB/s        99.18 MB/op
data-oriented        582    2255190 ns/op    515.10 MB/s         1.108 MB/op

We can see that data-oriented design really improves performance, it's about 5 times faster! The baseline took 10ms to complete while the data-oriented design took 2ms. Also, the baseline processed 99 MB of user data while the data oriented design only had to process 1 MB.

An interesting thing to note is that the memory throughput is an order of magnitude lower for the data-oriented design compared to the baseline. An educated guess on the memory throughput is that the data-oriented design is utilizing CPU cycles more for computation and less for waiting on memory operations, compared to the baseline. This indicates that memory speed probably is a bottleneck for the baseline but not so much for the data-oriented design, which was the goal for using data-oriented design!

# Summary

In conclusion, data-oriented design works and it's pretty easy to write a benchmark to measure the performance improvements.

There is much more to data-oriented design than what we've done in this example. However, a good rule of thumb is to design the data layout with locality in mind. Pieces of data that are used near each other in time (e.g. in the same loop) should be located near each other in memory (e.g. as elements of an array).

See the link to the benchmark source code below and stay tuned for part 2!