Two questions about pointers and performance

Jens · February 27, 2020, 1:14pm

While implementing a 2D hierarchical hash grid / spatial hashing data structure in Swift, and trying to optimize this function (defined only for i in 0 ... 6):

func globalCellIndex(forGridIndex i: UInt8) -> UInt32 {
    let mask = ~(UInt32(0b11111111111111111) &>> (i &* 2))
    return UInt32(0b10101010101000000) & mask
}

I ran into the following two questions.

Further background details here.

The hierarchical hash grid is optimized for speed and a specific use case, so it's hardcoded to work only with bounding circles that have their centers in 0 ..< 256 and radii in 0.25 ..< 32.

                                                      global
gridIndex  cellSize   resolution     radius range    cellIndex
     0          1      256 x 256     0.25 ..<  0.5        0
     1          2      128 x 128      0.5 ..<  1.0    65536
     2          4       64 x  64      1.0 ..<  2.0    81920
     3          8       32 x  32      2.0 ..<  4.0    86016
     4         16       16 x  16      4.0 ..<  8.0    87040
     5         32        8 x   8      8.0 ..< 16.0    87296
     6         64        4 x   4     16.0 ..< 32.0    87360

The globalCellIndex is the start index for a given grid, and is calculated as follows:

globalCellIndex(forGridIndex: 0) == 0
globalCellIndex(forGridIndex: 1) == 65536 == 256*256
globalCellIndex(forGridIndex: 2) == 81920 == 256*256 + 128*128
...
globalCellIndex(forGridIndex: 6) == 87360 == 256*256 + 128*128 + 64*64 + 32*32 + 16*16 + 8*8 + 4*4

This function can be written like this:

func globalCellIndex(forGridIndex i: UInt8) -> UInt32 {
    let mask = ~(UInt32(0b11111111111111111) &>> (i &* 2))
    return UInt32(0b10101010101000000) & mask
}

The following makes it clear how it works:

0 ->     0 == 0b00000000000000000
1 -> 65536 == 0b10000000000000000
2 -> 81920 == 0b10100000000000000
3 -> 86016 == 0b10101000000000000
4 -> 87040 == 0b10101010000000000
5 -> 87296 == 0b10101010100000000
6 -> 87360 == 0b10101010101000000

Question 1

I noticed that the following alternative formulation of the function is a bit faster:

func globalCellIndexB(forGridIndex i: UInt8) -> UInt32 {
    let results = (UInt32(0), UInt32(65536), UInt32(81920), UInt32(86016),
                   UInt32(87040), UInt32(87296), UInt32(87360))
    return withUnsafeBytes(of: results) {
        $0.bindMemory(to: UInt32.self)[Int(i)]
    }
}

according to

this benchmark program.

import Foundation

func globalCellIndex(forGridIndex i: UInt8) -> UInt32 {
    let mask = ~(UInt32(0b11111111111111111) &>> (i &* 2))
    return UInt32(0b10101010101000000) & mask
}

func globalCellIndexB(forGridIndex i: UInt8) -> UInt32 {
    let results = (UInt32(0), UInt32(65536), UInt32(81920), UInt32(86016),
                   UInt32(87040), UInt32(87296), UInt32(87360))
    return withUnsafeBytes(of: results) {
        $0.bindMemory(to: UInt32.self)[Int(i)]
    }
}

func test() {
    print("Generating 8MB of random gridIndices (bytes in range 0 ... 6) …")
    let count = 1024*1024*8
    let gridIndices: [UInt8] = (0 ..< count).map { _ in .random(in: 0 ... 6) }
    print("Done! Performing test, 5 times:")
    var checksum: UInt32 = 0
    for _ in 0 ..< 5 {
        let t0 = DispatchTime.now().uptimeNanoseconds
        for gi in gridIndices {
            checksum &+= globalCellIndex(forGridIndex: gi)
            //checksum &+= globalCellIndexB(forGridIndex: gi)
        }
        let t1 = DispatchTime.now().uptimeNanoseconds
        print(Double(t1-t0)/1e9, "seconds (\(checksum))")
    }
}
test()

Can anyone explain why this function is faster than the former bit manipulating one? I did look at them in godbolt but I lack the skills needed to make sense of their disassembly.

Question 2

Is there any difference I should know about between the following three ways of writing the read-data-from-pointer-code? They are equivalent (all three result in identical code being generated) according to godbolt:

func globalCellIndexB(forGridIndex i: UInt8) -> UInt32 {
    let results = (UInt32(0), UInt32(65536), UInt32(81920), UInt32(86016),
                   UInt32(87040), UInt32(87296), UInt32(87360))
    return withUnsafeBytes(of: results) {
        
        // Alternative 1:
        $0.bindMemory(to: UInt32.self)[Int(i)]

        // Alternative 2:
        $0.baseAddress!.assumingMemoryBound(to: UInt32.self)[Int(i)]

        // Alternative 3:
        $0.load(fromByteOffset: Int(i) &* 4, as: UInt32.self)

    }
}

Jens · February 27, 2020, 1:51pm

The first (bit manipulating) version of the function:

can be rewritten for example like this:

func globalCellIndex(forGridIndex i: UInt8) -> UInt32 {
    return (UInt32(0x15540000) &>> (i &<< 1)) & UInt32(0x1ffff)
}

but it doesn't make any difference to its speed, so the second version (with the pointer-code) is still faster.

Karl · February 27, 2020, 2:13pm

How much are we talking about? Outside of the margin of error, I take it.

Jens · February 27, 2020, 2:17pm

Depending on the particular context, the first (bit manipulating) version seems to sometimes take twice the time of the second version, but in contexts like that of the benchmarking program I included, the difference is smaller but still outside the margin of error, as can be seen by running the benchmarking program.

Lantua · February 27, 2020, 2:39pm

Q2:

IIRC, each region of memory can be tagged (bound) with something like "Here lies UInt32 data", and you're supposed to use data only when it is tagged properly (and is initialized). So

bindMemory(to:): Takes a layout-compatible memory and tag it (to UInt32 in this case).
assumingMemoryBound(to:): You already tag this region somewhere else and just want to use it.
load(fromByteOffset:): Copies memory byte-by-byte. There is no assumption on the binding, only alignment.

Though I don't know where this rule is strictly enforced. It seems to be very lenient on simple types, like Int, UInt32

scanon · February 27, 2020, 2:39pm

Godbolt link.

Let's look at the disassembly for the first function:

output.globalCellIndex(forGridIndex: Swift.UInt8) -> Swift.UInt32:
        push    rbp
        mov     rbp, rsp
        lea     ecx, [rdi + rdi]
        mov     eax, -131072
        sar     eax, cl
        and     eax, 87360
        pop     rbp
        ret

First I'll trim out the stack frame stuff, since that's irrelevant to your question, and I'll annotate what's going on:

// on function entry, i is stored in low 8 bits of rdi, result is returned in eax.
lea     ecx, [rdi + rdi] // ecx <-- 2*i
mov     eax, -131072     // eax <-- Int32(0b11111111111111100000000000000000)
sar     eax, cl          // eax &>>= ecx (i.e eax <-- mask)
and     eax, 87360       // result <-- eax & 0b10101010101000000

The critical path for this algorithm is LEA -> SAR -> AND. We can look up the latency and throughput of these operations for recent architectures in Intel's optimization manual:

instruction | latency | uOps | notes
    LEA     |    1    |   1  | Timing for a "simple LEA", which this is
    SAR     |    2    |   3  | Non-constant shifts are surprisingly expensive
            |         |      | on recent x86, for reasons that are outside
            |         |      | the scope of this note.
    AND     |    1    |   1  |

The MOV instruction adds a single uOp, but it doesn't contribute to the latency chain, and if this function is heavily used and inlined (as I would expect it to be if its performance matters), then the MOV should be hoisted and its (already small) cost should be amortized. So, as a good approximation, this requires 5 uOps, and has a latency of 4 cycles.

Now let's look at the other implementation:

// This block of instructions is generating the results tuple.
movaps  xmm0, xmmword ptr [rip + .LCPI2_0]
movaps  xmmword ptr [rbp - 32], xmm0
movabs  rax, 374933465158656
mov     qword ptr [rbp - 16], rax
mov     dword ptr [rbp - 8], 87360

movzx   eax, dil                          // zero-extend i to fill register
mov     eax, dword ptr [rbp + 4*rax - 32] // load results[i]

At first glance, this looks a lot worse (and it's objectively bad codegen; there's no need for the compiler to manufacture results like its doing--that whole first block of five instructions is entirely superfluous). But, just like above, in a context where the performance matters, this would be inlined, and this set-up code will get hoisted, so that it does not contribute to the observed performance. So we're left with only the MOV and MOVZX instructions. These are both single uOp, so we have two uOps total (vs 5 for the previous algorithm).

Note that the MOVZX will also be eliminated by the compiler when this function gets inlined, assuming that the compiler can prove that i is already in the range of a UInt8 (likely!)

Latency is harder to talk about here, because the second instruction is a load, and the latency of loads isn't a totally well-defined concept, but assuming this code is heavily used, it will always hit the L1 cache, so "4 cycles" is a reasonable approximation.

Thus, compared to the first algorithm, this algorithm:

uses significantly fewer uOps (1 or 2 vs. 5)
has approximately the same latency (4-ish cycles vs. 4 cycles)

There's one more difference that's probably most important: it uses different uOps. The uOps used by the first algorithm all have to execute on the ALU ports (0, 1, 5 and 6), which means they're competing with all the surrounding arithmetic operations for execution resources. The load uOp used by the second algorithm executes on a different group of ports (2 or 3), so it does not contend with surrounding arithmetic operations (but it could contend with surrounding load operations).

In particular, in a tight timing loop, the first will contend with the loop-arithmetic itself, while the second will not, which easily explains 2x or more differences in a stand-alone timer.

Ultimately, the only way to get conclusive answers to this sort of question is to use the performance counters, but this is a simple enough case that I can be pretty confident in an answer given from first principles like this one.

lukasa · February 27, 2020, 2:43pm

Steve, bear in mind that in the test program both chunks of code get inlined and eligible to be vectorised. The first does get vectorised: the second only gets unrolled.

scanon · February 27, 2020, 2:47pm

Variable shifts and different-width integer types are a big weakness for x86 vectorization pre-AVX2, so this is not an algorithm that should be expected to vectorize well at baseline (in fact, it's exactly the sort of algorithm where compiler auto-vectorization is likely to make performance worse).

lukasa · February 27, 2020, 2:51pm

That's totally in line with what Godbolt suggests: plonking the test program into Godbolt shows a pretty astonishing multiplication in code size in order to do 8 elements at a time, and those vector ops all compete for the same ports I suspect.

scanon · February 27, 2020, 2:55pm

Even more amusing, the second algorithm should be more amenable to vectorization (since the lookup can be done entirely in-register), but LLVM declines to do that, and actually insists on re-materializing the table on each pass, but it's still faster.

The good news, @Jens, is that there's easily a 10-20x speedup to be had here if it's actually critical to your performance, either by convincing someone to fix the compiler or by writing a small C function using intrinsics that can be inlined.

Karl · February 27, 2020, 2:56pm

Do we have compiler flags for this? C compilers support flags like -mavx2, but I’m not aware of anything similar for swift.

scanon · February 27, 2020, 2:57pm

We do, but I can't really recommend using them unequivocally: -Xcc -Xclang -Xcc -target-feature -Xcc -Xclang -Xcc +avx2.

([SR-11660] Umbrella: function multiversioning and dispatch on CPU features · Issue #54069 · apple/swift · GitHub is an umbrella bug that discusses improving this situation)

Jens · February 27, 2020, 3:33pm

Thank you for the very informative explanation!

It would of course be nice if the compiler was improved, but I'll continue implementing the data structure and do some more profiling on both x86 and arm, before optimizing this or other parts of it further.

Jens · February 27, 2020, 4:07pm

Yes, in a case like this (with simple types), the three variants result in identical code generation so I assume I can just go with eg the one that is textually shortest, and that the tagging / binding doesn't matter.

Lantua · February 27, 2020, 4:11pm

Technically speaking, that memory is already bound to UInt32, and so all 3 alternatives would apply equally well:

In alternative 1, you rebind UInt32 to the layout compatible type UInt32.
In alternative 2, you use an already bound UInt32 memory.
In alternative 3, the alignment condition is already met.

scanon · February 27, 2020, 4:47pm

Tagging @Andrew_Trick for an authoritative answer.

Jens · February 27, 2020, 4:47pm

Just to see what would happen I wrote

the two functions in C (realizing that I've almost forgot C)

#ifndef c_stuff_h
#define c_stuff_h

#include <stdint.h>

uint32_t globalCellIndexForGridIndexC(uint8_t i) {
    return (((uint32_t) 0x15540000) >> (i << 1)) & ((uint32_t) 0x1ffff);
}
uint32_t globalCellIndexForGridIndexD(uint8_t i) {
    static const uint32_t results[] = {
        0, 65536, 81920, 86016, 87040, 87296, 87360
    };
    return results[i];
}

#endif
/* c_stuff_h */

and compared them against their Swift counterparts in the above benchmark program.

Here's the result (on my late 2013 MBP):

Actual runs here.

@inline(__always)
func globalCellIndex(forGridIndex i: UInt8) -> UInt32 {
    return (UInt32(0x15540000) &>> (i &<< 1)) & UInt32(0x1ffff)
}

0.011108873 seconds (663375808)
0.010979427 seconds (1326751616)
0.010437289 seconds (1990127424)
0.010497336 seconds (2653503232)
0.010778083 seconds (3316879040)

@inline(__always)
func globalCellIndexB(forGridIndex i: UInt8) -> UInt32 {
    let results = (UInt32(0), UInt32(65536), UInt32(81920), UInt32(86016),
                   UInt32(87040), UInt32(87296), UInt32(87360))
    return withUnsafeBytes(of: results) {
        $0.bindMemory(to: UInt32.self)[Int(i)]
    }
}

0.009007419 seconds (645195456)
0.00815917 seconds (1290390912)
0.00902409 seconds (1935586368)
0.008354092 seconds (2580781824)
0.00881582 seconds (3225977280)

uint32_t globalCellIndexForGridIndexC(uint8_t i) {
    return (((uint32_t) 0x15540000) >> (i << 1)) & ((uint32_t) 0x1ffff);
}

0.009720541 seconds (701568192)
0.008872579 seconds (1403136384)
0.009593372 seconds (2104704576)
0.00884601 seconds (2806272768)
0.008909735 seconds (3507840960)

uint32_t globalCellIndexForGridIndexD(uint8_t i) {
    static const uint32_t results[] = {
        0, 65536, 81920, 86016, 87040, 87296, 87360
    };
    return results[i];
}

0.003545362 seconds (599353984)
0.003476748 seconds (1198707968)
0.003460682 seconds (1798061952)
0.003350081 seconds (2397415936)
0.003334296 seconds (2996769920)

...
let t0 = DispatchTime.now().uptimeNanoseconds
for gi in gridIndices {
    checksum &+= UInt32(gi)                 // baseline: 0.001 s
    //checksum &+= globalCellIndex(forGridIndex: gi)  // 0.011 s (Swift bit manip)
    //checksum &+= globalCellIndexB(forGridIndex: gi) // 0.009 s (Swift tuple)
    //checksum &+= globalCellIndexForGridIndexC(gi)   // 0.009 s (  C   bit manip)
    //checksum &+= globalCellIndexForGridIndexD(gi)   // 0.003 s (  C   "tuple")
}
let t1 = DispatchTime.now().uptimeNanoseconds
...

So, while the bit manipulation version got only slightly faster as a C function, the tuple/static array based version was quite a bit faster in C (0.011-0.001)/(0.003-0.001)==(0.01/0.002)==5 times faster, as per your analysis, and I guess it could be faster still by using intrinsics.

scanon · February 27, 2020, 4:49pm

Right, this is because the codegen for the shift-version is basically already optimal, but C version manages to elide creating the static array on each iteration. There's still another factor of 2 or 3 to pick up if you really want to go crazy =)

Jens · February 27, 2020, 4:51pm

That would be crazy. : P

Karl · February 27, 2020, 5:03pm

It’s probably worth filing bugs reports in any case.