Data locality is a key part of writing fast data science code.
The core idea is simple: your data starts out in RAM (or disk), and to actually do anything useful with it, you need to move it to the CPU. This is actually pretty slow, so you want to minimize this transfer.
Say you have pairs of heights and weights, and you want to get the sum of all the weights. Let's start by generating 40 million pairs:
num_rows = 40000000
struct HeightAndWeight
height::Int
weight::Int
end
heights_and_weights = [HeightAndWeight(row[1], row[2]) for row in eachrow(rand(50:150, num_rows, 2))];
This gives us an array of HeightAndWeight
s like so:
40000000-element Array{HeightAndWeight,1}:
HeightAndWeight(138, 84)
HeightAndWeight(140, 136)
HeightAndWeight(87, 137)
HeightAndWeight(109, 143)
One approach is to iterate through each row and add the weight to a counter.
function my_sum(data)
sum_weights = 0
for row in data
sum_weights += row.weight
end
return sum_weights
end
Benchmarking this, it takes about 52 ms
using BenchmarkTools
@btime(my_sum(heights_and_weights))
> 52.535 ms (1 allocation: 16 bytes)
What if our heights and weights were separate, and we just pass the weights?
just_heights = [row.height for row in heights_and_weights]
just_weights = [row.weight for row in heights_and_weights]
@btime(my_sum2(just_weights))
> 23.495 ms (1 allocation: 16 bytes)
This is about twice as fast, modulo noise! Despite the fact that we're doing exactly the same number of additions, we're transferring half the data, which takes the majority of the time.
When we have an array of HeightAndWeight
structs, it's very difficult to avoid passing in extra data. If the struct had even more fields, it would be even slower, despite the fact that we're adding up the same number of weights. This is a common situation with object-oriented programming or row-oriented tables.
The pattern of creating a collection of many HeightAndWeight
objects is known as Array-of-Structs, while the pattern of having several arrays is known as Struct-of-Arrays (https://en.wikipedia.org/wiki/AoS_and_SoA .) Array-of-Structs is typically slower because of the increased data transfer, and the difference is most prominent in areas of your code where you're iterating over large collections of wide objects with many fields. Consider refactoring those to Struct-of-Arrays.
Note that this isn't an unconditional efficiency win. If you frequently do things with these objects that use both the height and the width from a large random selection of objects (or all the objects in a random order), you will get better memory locality from the array of structs than from the struct of arrays. I don't know how common that situation is. I suspect there are situations where actually you want a struct of arrays of structs; e.g., when you use the coordinates of an object you probably use them all at once, but often you use the coordinates without using the hitpoint-count (in a game) or the population (in a GIS) or whatever.
Very cool, thanks for writing this up. Hard-to-predict access in loops is an interesting case, and it makes sense that AoS would beat SoA there.
Yeah, SIMD is a significant point I forgot to mention.
It's a fair amount of work to switch between SoA and AoS in most cases, which makes benchmarking hard!
StructArrays.jl
makes this pretty doable in Julia, and Jonathan Blow talks about making it simple to switch between SoA and AoS in his programming language Jai. I would definitely like to see more languages making it easy to just try one and benchmark the results.