Comment Permalink

gjm4y50

Looks like it is marginally quicker in the first of those cases. Note that you're iterating over the objects linearly, which means that the processor's memory-access prediction hardware will have a nice easy job; and you're iterating over the whole thing without repeating any, which means that all the cache is buying you is efficient access to prefetched bits. After defining

function mss(data::Vector{HeightAndWeight})
    s = 0
    for i=1:40000000
        j=((i*i)%40000000)+1
        @inbounds s += data[j].weight * data[j].height
    end
    return s
end

function mss2(heights::Vector{Int},weights::Vector{Int})
    s = 0
    for i=1:40000000
        j=((i*i)%40000000)+1
        @inbounds s += weights[j] * heights[j]
    end
    return s
end

(mss for "my scattered sum"; the explicit type declarations, literal size and @inbounds made it run faster on my machine, and getting rid of overhead seems like a good idea for such comparisons; the squaring is just a simple way to get something not too predictable) I got the following timings:

julia> @btime(mss(heights_and_weights))
  814.056 ms (0 allocations: 0 bytes)
400185517392

julia> @btime(mss2(just_heights,just_weights))
  1.253 s (0 allocations: 0 bytes)
400185517392

so the array-of-structs turns out to work quite a lot better in this case. (Because we're doing half the number of hard-to-predict memory accesses.)

Note how large those timings are, by the way. If I just process every row in order, I get 47.9ms for array-of-structs and 42.6ms for struct-of-arrays. 40.3ms if I use zip as you did instead of writing array accesses explicitly, which is interesting; I'm not that surprised the compiler can eliminate the superficial inefficiencies of the zip-loop, but I'm surprised it ends up strictly better rather than exactly the same. Anyway, this is the other way around from your example: struct-of-arrays is faster for me in that situation. But when we process things in "random" order, it's 20-30x slower because we no longer get to make much use of the cache or memory-access prediction, and then array-of-structs wins handily.

... Actually, looking at the generated code, those aren't the only reasons. It's also that when iterating over the whole array linearly the compiler can generate nice SIMD code. Cute. [EDITED to add:] Of course this might also be part of the story when comparing SoA and AoS in the linear-iteration case; SoA may be more vectorization-friendly. Of course this is only relevant to some possible workloads; if you're doing something much more complicated than a long string of multiply-accumulates, SIMD may no longer be a factor and the relationship between the two approaches may change. The moral: Benchmark, benchmark, benchmark!

Reply

See in context

35 Moving Data Around is Slow

by SatvikBeri

22nd Mar 2021

2 min read

10

35

Data locality is a key part of writing fast data science code.

The core idea is simple: your data starts out in RAM (or disk), and to actually do anything useful with it, you need to move it to the CPU. This is actually pretty slow, so you want to minimize this transfer.

Say you have pairs of heights and weights, and you want to get the sum of all the weights. Let's start by generating 40 million pairs:

num_rows = 40000000
struct HeightAndWeight
    height::Int
    weight::Int
end
heights_and_weights = [HeightAndWeight(row[1], row[2]) for row in eachrow(rand(50:150, num_rows, 2))];

This gives us an array of HeightAndWeights like so:

40000000-element Array{HeightAndWeight,1}:
 HeightAndWeight(138, 84)
 HeightAndWeight(140, 136)
 HeightAndWeight(87, 137)
 HeightAndWeight(109, 143)

One approach is to iterate through each row and add the weight to a counter.

function my_sum(data)
    sum_weights = 0
    for row in data
        sum_weights += row.weight
    end
    return sum_weights
end

Benchmarking this, it takes about 52 ms

using BenchmarkTools
@btime(my_sum(heights_and_weights))
>  52.535 ms (1 allocation: 16 bytes)

What if our heights and weights were separate, and we just pass the weights?

just_heights = [row.height for row in heights_and_weights]
just_weights = [row.weight for row in heights_and_weights]
@btime(my_sum2(just_weights))
>   23.495 ms (1 allocation: 16 bytes)

This is about twice as fast, modulo noise! Despite the fact that we're doing exactly the same number of additions, we're transferring half the data, which takes the majority of the time.

When we have an array of HeightAndWeight structs, it's very difficult to avoid passing in extra data. If the struct had even more fields, it would be even slower, despite the fact that we're adding up the same number of weights. This is a common situation with object-oriented programming or row-oriented tables.

The pattern of creating a collection of many HeightAndWeight objects is known as Array-of-Structs, while the pattern of having several arrays is known as Struct-of-Arrays (https://en.wikipedia.org/wiki/AoS_and_SoA .) Array-of-Structs is typically slower because of the increased data transfer, and the difference is most prominent in areas of your code where you're iterating over large collections of wide objects with many fields. Consider refactoring those to Struct-of-Arrays.

Data ScienceProgrammingWorld Modeling

Frontpage

35

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:36 PM

[-]Gunnar_Zarncke4y90

But beware of premature optimization.

Reply

[-]SatvikBeri4y170

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." – Donald Knuth

Reply

[-]Gunnar_Zarncke4y30

That would be my preferred quote too.

Reply

[-]gjm4y60

Note that this isn't an unconditional efficiency win. If you frequently do things with these objects that use both the height and the width from a large random selection of objects (or all the objects in a random order), you will get better memory locality from the array of structs than from the struct of arrays. I don't know how common that situation is. I suspect there are situations where actually you want a struct of arrays of structs; e.g., when you use the coordinates of an object you probably use them all at once, but often you use the coordinates without using the hitpoint-count (in a game) or the population (in a GIS) or whatever.

Reply

[-]SatvikBeri4y60

I see where that intuition comes from, and at first I thought that would be the case. But the machine is very good at iterating through pairs of arrays. Continuing the previous example:

function my_double_sum(data)
    sum_heights_and_weights = 0
    for row in data
        sum_heights_and_weights += row.weight + row.height
    end
    return sum_heights_and_weights
end
@btime(my_double_sum(heights_and_weights))
>   50.342 ms (1 allocation: 16 bytes)

function my_double_sum2(heights, weights)
    sum_heights_and_weights = 0
    for (height, weight) in zip(heights, weights)
        sum_heights_and_weights += height + weight
    end
    return sum_heights_and_weights
end
@btime(my_double_sum2(just_heights, just_weights))
>   51.060 ms (1 allocation: 16 bytes)

Reply

[-]gjm4y50

Looks like it is marginally quicker in the first of those cases. Note that you're iterating over the objects linearly, which means that the processor's memory-access prediction hardware will have a nice easy job; and you're iterating over the whole thing without repeating any, which means that all the cache is buying you is efficient access to prefetched bits. After defining

function mss(data::Vector{HeightAndWeight})
    s = 0
    for i=1:40000000
        j=((i*i)%40000000)+1
        @inbounds s += data[j].weight * data[j].height
    end
    return s
end

function mss2(heights::Vector{Int},weights::Vector{Int})
    s = 0
    for i=1:40000000
        j=((i*i)%40000000)+1
        @inbounds s += weights[j] * heights[j]
    end
    return s
end

(mss for "my scattered sum"; the explicit type declarations, literal size and @inbounds made it run faster on my machine, and getting rid of overhead seems like a good idea for such comparisons; the squaring is just a simple way to get something not too predictable) I got the following timings:

julia> @btime(mss(heights_and_weights))
  814.056 ms (0 allocations: 0 bytes)
400185517392

julia> @btime(mss2(just_heights,just_weights))
  1.253 s (0 allocations: 0 bytes)
400185517392

so the array-of-structs turns out to work quite a lot better in this case. (Because we're doing half the number of hard-to-predict memory accesses.)

Note how large those timings are, by the way. If I just process every row in order, I get 47.9ms for array-of-structs and 42.6ms for struct-of-arrays. 40.3ms if I use zip as you did instead of writing array accesses explicitly, which is interesting; I'm not that surprised the compiler can eliminate the superficial inefficiencies of the zip-loop, but I'm surprised it ends up strictly better rather than exactly the same. Anyway, this is the other way around from your example: struct-of-arrays is faster for me in that situation. But when we process things in "random" order, it's 20-30x slower because we no longer get to make much use of the cache or memory-access prediction, and then array-of-structs wins handily.

... Actually, looking at the generated code, those aren't the only reasons. It's also that when iterating over the whole array linearly the compiler can generate nice SIMD code. Cute. [EDITED to add:] Of course this might also be part of the story when comparing SoA and AoS in the linear-iteration case; SoA may be more vectorization-friendly. Of course this is only relevant to some possible workloads; if you're doing something much more complicated than a long string of multiply-accumulates, SIMD may no longer be a factor and the relationship between the two approaches may change. The moral: Benchmark, benchmark, benchmark!

Reply

[-]SatvikBeri4y40

Very cool, thanks for writing this up. Hard-to-predict access in loops is an interesting case, and it makes sense that AoS would beat SoA there.

Yeah, SIMD is a significant point I forgot to mention.

It's a fair amount of work to switch between SoA and AoS in most cases, which makes benchmarking hard! StructArrays.jl makes this pretty doable in Julia, and Jonathan Blow talks about making it simple to switch between SoA and AoS in his programming language Jai. I would definitely like to see more languages making it easy to just try one and benchmark the results.

Reply

[-]wolajacy4y30

AFAIK, popular data science tools (Spark, Pandas, etc.) already use columnar formats for data serialization and network-based communication: https://en.wikipedia.org/wiki/Apache_Arrow

Similiar idea for disk storage (which is again orders of magnitude slower, so the gains in certain situations might be even bigger): https://en.wikipedia.org/wiki/Apache_Parquet

Generally, if you're doing big data, there are actually more benefits from using this layout - data homogenity means much better compression and possibilities for smarter encodings.

Reply

[-]SatvikBeri4y30

Yup, these are all reasons to prefer column orientation over row orientation for analytics workloads. In my opinion data locality trumps everything but compression and fast transmission is definitely very nice.

Until recently, numpy and pandas were row oriented, and this was a major bottleneck. A lot of pandas's strange API is apparently due to working around row orientation. See e.g. this article by Wes McKinney, creator of pandas: https://wesmckinney.com/blog/apache-arrow-pandas-internals/#:~:text=Arrow's%20C%2B%2B%20implementation%20provides%20essential,optimized%20for%20analytical%20processing%20performance

Reply

[-]Taran4y30

Game programmers love this trick too, and for the same reasons: you're typically not doing elaborate computations to small amounts of data, you're doing simple computations (position updates, distance comparisons, &c) with large amounts of data. Paring away unnecessary memory transfers is a big part of how to make that kind of computation go fast.

Reply

Moderation Log