Today we're going to have a look at performance effects of data-oriented design. A common pattern in data-oriented design is a struct of arrays. Let's benchmark it against the more traditional array of structs to see the difference in performance.
Short intro to data-oriented design #
Data-oriented design focuses on data layout to utilize hardware efficiently, typically optimizing memory operations and CPU cache utilization.
- CPUs are fast, main memory access is slow
- CPUs have caches, memory access happens through cache lines.
- Cache lines are fixed size blocks of memory, typically 64 bytes. This is the smallest amount of memory data you can access in one operation.
To improve performance, we want many cache hits and as few cache misses as possible.
There is more to it, but that's out of scope for this post. I recommend watching the talks on data-oriented design by Andrew Kelley1 and Mike Acton2 if you want to learn more.
Example: user data #
Say we have a system where we store some user data. We'll benchmark a loop over all users, listing user email addresses that haven't been confirmed yet.
Baseline: array of structs #
Let's first make a baseline design by organizing user data as an array of structs:
1var users []user
2
3type user struct {
4 ID int
5 IsActive bool
6 Name string
7 Username string
8 PasswordHash []byte
9 MustChangePassword bool
10 Email string
11 EmailConfirmed bool
12}
13
14func getUnconfirmedEmailAddresses(users []user) []string {
15 var unconfirmed []string
16 for _, u := range users {
17 if !u.EmailConfirmed {
18 unconfirmed = append(unconfirmed, u.Email)
19 }
20 }
21 return unconfirmed
22}
Now let's put on our data-oriented design glasses and think about the code above.
The for loop reads a user
struct on every iteration in order to access the EmailConfirmed
field (1 byte) and occasionally the Email
field (16 byte string header).
The size of a user
struct is 104 bytes so, depending on compiler optimizations, we read one or two cache lines (64 or 128 bytes) on each iteration.
This means one or two cache misses for every iteration and that most bytes of each cache line go to waste.
Let's assume that the majority of users have already confirmed their email address, so it's very rare that we actually have to read the Email
field for a user.
Hmmm... What if we could read a cache line full of those EmailConfirmed
booleans?
This leads us to using a struct of arrays.
Data-oriented design: struct of arrays #
In the baseline we collected all data for a specific user into a struct.
But as we noted above, we could utilize cache lines more efficiently if we had an array of EmailConfirmed
booleans for all users.
Let's pivot our data structure to have one array per user data field, each array contains data for all users:
1var users userData
2
3type userData struct {
4 ID []int
5 IsActive []bool
6 Name []string
7 Username []string
8 PasswordHash [][]byte
9 MustChangePassword []bool
10 Email []string
11 EmailConfirmed []bool
12}
13
14func (u userData) getUnconfirmedEmailAddresses() []string {
15 var unconfirmed []string
16 for i, confirmed := range u.EmailConfirmed {
17 if !confirmed {
18 unconfirmed = append(unconfirmed, u.Email[i])
19 }
20 }
21 return unconfirmed
22}
Now we're looping over an array of booleans, 1 byte per element, i.e. a cache line contains 64 elements.
This means we get a cache hit 63 times out of 64.
There is a chance that the occasional Email
read evicts EmailConfirmed
data from the cache, but by analyzing our data we're pretty confident that there are so few unconfirmed email addresses that this is not a problem.
Benchmark and analysis #
The point of this whole exercise is to see if data-oriented design really works. So I wrote a benchmark3, looping over a million users' data. (CPU caches are hard to measure exactly, so I figured that a large number of users would give a better signal to noise ratio).
$ go test -bench=UnconfirmedEmail data-oriented-design-benchmarks_test.go
Iterations Time Memory throughput Data processed
baseline 100 10673319 ns/op 9743.92 MB/s 99.18 MB/op
data-oriented 582 2255190 ns/op 515.10 MB/s 1.108 MB/op
We can see that data-oriented design really improves performance, it's about 5 times faster! The baseline took 10ms to complete while the data-oriented design took 2ms. Also, the baseline processed 99 MB of user data while the data oriented design only had to process 1 MB.
An interesting thing to note is that the memory throughput is an order of magnitude lower for the data-oriented design compared to the baseline. An educated guess on the memory throughput is that the data-oriented design is utilizing CPU cycles more for computation and less for waiting on memory operations, compared to the baseline. This indicates that memory speed probably is a bottleneck for the baseline but not so much for the data-oriented design, which was the goal for using data-oriented design!
Summary #
In conclusion, data-oriented design works and it's pretty easy to write a benchmark to measure the performance improvements.
There is much more to data-oriented design than what we've done in this example. However, a good rule of thumb is to design the data layout with locality in mind. Pieces of data that are used near each other in time (e.g. in the same loop) should be located near each other in memory (e.g. as elements of an array).
See the link to the benchmark source code below and stay tuned for part 2!