Compiler optimisations for functional-style Collection algorithms

wadetregaskis · November 7, 2023, 2:01am

I'm exploring Swift's performance in some basic situations involving operations on Collections and Sequences. As part of that, I've been benchmarking the difference between programming styles (using simple, contrived algorithms), e.g.:

Functional

testData.next
    .filter { 0 == $0 % 2 }
    .map { $0.byteSwapped }
    .filter { ($0 & 0xff00) >> 8 < $0 & 0xff }
    .map { $0.leadingZeroBitCount }
    .filter { Int.bitWidth - 8 >= $0 }
    .reduce(into: 0, &+=)

Imperative

var result = 0

for value in testData.next {
    if 0 == value % 2 {
         let value = value.byteSwapped

         if (value & 0xff00) >> 8 < value & 0xff {
             let value = value.leadingZeroBitCount

             if Int.bitWidth - 8 >= value {
                 result &+= value
             }
         }
     }
}

Unsurprisingly (to me) the latter is many times faster than the former, because in a sense I've "manually unrolled" the former. The former - even compiled with -O or -Ounchecked - actually allocates the numerous intermediary collections etc. The Swift compiler currently seems to implement it in a very literal, prescriptive sense.

Now, if you make use of lazy sequences - just add .lazy after testData.next - then the compiler does in fact optimise it down to something very similar to the imperative version. Which frankly surprised me - I've never seen it do that before (perhaps all my prior cases were too complicated for the optimiser?). What surprises me even more is that the compiler's version performs better than my hand-unrolled version!

Amazing work compiler!

(although, why u no optimise my imperative version as good as your lazy functional version?!? )

The problem is, that innocuous lazy property has a lot of pitfalls (e.g. when it's not really lazy and in fact does duplicate work). So, is there some way to get the optimiser to do the same optimisations with the non-lazy version?

I'm currently using Swift 5.9.

When I look at the disassembly of the non-lazy version, the compiler has already inlined basically all the relevant Array methods, making it pretty close to locally reasonable (few if any hidden side effects). So one would think that the optimiser can then - in principle - "see through" all the intermediary cruft and optimise it away.

taylorswift · November 7, 2023, 3:08am

is testData an Array? what is the element type?

wadetregaskis · November 7, 2023, 4:07am

[Int] in this case.

I fill it with pseudo-random values (from a specific seed for reproducibility). I also swap between two such arrays between test iterations, to reduce the benefits of cache (for Reasons I'm more interested in "cold" performance than hot loops, at the moment).

None of that seems to make any difference to the optimiser's behaviour (it's all hidden behind the next property in any case, which is also intentional - I don't want the optimiser "seeing through" the test data generation as that wouldn't be representative of real-world applications).

jrose · November 7, 2023, 5:22am

Sometimes the thing that prevents this is what counts as observable behavior. As a trivial example, this code always traps...

if x + 1 > x {
  var y = .max
  y += 1
}

...but whether it traps on line 1 or line 3 depends on whether x is .max already. So the compiler can't skip the step where it adds 1 to x even though the value isn't actually useful. (This is unlike C, where this is either a weird way to spell x == UINT_MAX or a really bad way to spell __builtin_assume(x < INT_MAX).)

Still, your code doesn't seem like it would run into that, since you don't have any potentially-trapping operations there except for allocation failures. So it might be possible to teach the compiler more about array allocations and make this kind of change. That said, lazy is the recommended way to do this, with a design that's intended to be optimizable, as long as you always "complete" the lazy chain with something that produces a non-lazy type explicitly. I'm not saying the pitfalls it has are a good thing, but as usual they're hard to fix without breaking compatibility for someone. :-(

wadetregaskis · November 8, 2023, 7:09pm

I wrote some formal benchmarks around this, and a blog post.

What really surprised me is that on Apple Silicon - well, the M2, at least - the performance of the lazy functional style is still much worse than the imperative style. Even though the same optimisations seem to be applied (as on x86-64), and the resulting machine code looks pretty similar (to the imperative version). I might examine it more closely at a later date, but for now it's a bit of a mystery.

David_Smith · November 8, 2023, 7:24pm

Without having looked at this specific case, one fairly common cause of this is that pushing stack frames is relatively^[1] more expensive on Apple Silicon due to pointer authentication. So a non-leaf function not being inlined/shrink-wrapped could be more of a slowdown than expected. We saw a lot of this in the very early days of pointer authentication before we turned on omitting leaf frames to mitigate it.

I haven't looked into absolute numbers, but luckily they're not relevant for this discussion ↩︎

wadetregaskis · November 8, 2023, 8:42pm

There's no function calls in the resulting (optimised) code, inside the loop:

loc_10008cd7c:
  ldr        x11, [x10], #0x8
  rev        x12, x11
  ubfx       x13, x12, #0x8, #0x8
  and        x14, x12, #0xff
  cmp        x12, #0x80
  ccmp       x13, x14, #0x2, hs
  clz        x12, x12
  add        x12, x12, x8
  csel       x12, x8, x12, hs
  cmp        x11, #0x0
  csel       x8, x8, x12, eq
  subs       x9, x9, #0x1
  b.ne       loc_10008cd7c

In comparison the imperative version is:

oc_10008d3a8:
  cmp        x9, x10
  b.eq       loc_10008d338

  b.hs       loc_10008d410

  ldr        x14, [x11, x9, lsl #3]
  cbnz       x14, loc_10008d3d0

loc_10008d3bc:
  cmp        x13, x9
  b.eq       loc_10008d338

  ldr        x14, [x12, x9, lsl #3]
  add        x9, x9, #0x1
  cbz        x14, loc_10008d3bc

loc_10008d3d0:
  add        x9, x9, #0x1
  rev        x14, x14
  ubfx       x15, x14, #0x8, #0x8
  cmp        x15, w14, uxtb
  b.hs       loc_10008d3a8

  cmp        x14, #0x80
  b.lo       loc_10008d3a8

  clz        x14, x14
  add        x8, x14, x8
  b          loc_10008d3a8

David_Smith · November 8, 2023, 8:48pm

Oh conditional select vs branch is a tricky one. The conventional wisdom is always "avoid branches", but it turns out branch predictors and instruction caches are really good these days.

As is often the case Linus has a pretty good rant about it CMOV (Linus Torvalds)

hassila · November 9, 2023, 2:53pm

Hey Wade, saw that you removed the metrics for peak memory usage as the sampling was missing the peaks (sample is every 10ms, so it’s only suitable for longer tests - should probably make sampling interval configurable in the future. )

What is a bit weird is that allocatedResidentMemory should give good results - that metric isn’t sampling but looks at how much memory the jemalloc statistics shows as a delta before/after running the test (so any setup code should be excluded) - will be out of office now for a few days but would be interested to hear what you saw in more detail - we want those metrics to be robust…

Karl · November 9, 2023, 4:24pm

Well, if you have predictable branches, yes. In other cases, branchless can still be worth it:

scanon · November 9, 2023, 4:42pm

No doubt, but (at least outside of randomized search), people's intuition about what is "predictable" runs headlong into the remarkable accuracy and memory of modern branch predictors. If you know for a fact that your data is really, truly, random, and you're only going to make a single pass through it, then branchless algorithms are almost always better. But if there's any structure at all, or if you end up iterating through the data a few times, branching usually wins.

The only time this really falls over is when eliminating branches unlocks other optimizations, such as vectorization.

wadetregaskis · November 9, 2023, 4:55pm

Yes, from its description allocatedResidentMemory should work, but it clearly doesn't in these benchmarks. It reports the same thing as the sampling metrics - barely a blip of increased memory usage. Whereas if you simply look at Activity Monitor (or equivalent) you can see that memory balloons several-fold, and Instruments' Allocations instrument shows that too.

(it's also plainly obvious that it has to - the first filter barely removes any elements so it's basically guaranteed to allocate at least one array that's virtually the same size as the input array)

I didn't try debugging the metrics, but you should be able to reproduce easily by just reverting that commit and running them yourself.

I also found it weird that across dozens of runs the memory numbers were very consistent. You'd think it'd at least sometimes happen to sample RSS when memory use is actually high. Per Instruments it's not like the peaks are especially brief.

In the above, the allocatedResidentMemory et al reported only a few percent of extra memory use above the baseline, but you can see that even for a sparse random sample it really should have seen at least 2x memory use.

Also, the memory metrics count memory allocated outside the benchmark, such as the test datasets in this case. That greatly diminishes their utility since it obscures the actual memory behaviour of the code under test.

(I don't mean to be critical or judgemental - I think the Benchmarks library is fantastic overall - I'm just trying to thoroughly document what I've observed)

wadetregaskis · November 9, 2023, 4:57pm

But that's exactly the case in this benchmark (the input data is the output stream of the Xoshiro PRNG) and yet seemingly predicated instructions are far worse than branching.

I do wonder if this is Apple Silicon specific… perhaps Apple's CPU designers just didn't put much effort into predicated instructions? e.g. there's no reason, in principle, why they can't be predicted and speculated on just like branches, but maybe they're not? I imagine they are relatively uncommon in "Mac"-level workloads.

scanon · November 9, 2023, 5:07pm

Are you running repeatedly over the same set of data in your benchmark?

I'm not sure what this would even mean; a mac is a general-purpose computer, and the CPU is a general-purpose CPU. What's a "Mac-level" workload? In any event, conditional move on AS has broadly similar performance characteristics to x86 (the primary difference is that more stuff can be done conditionally on arm64, because there's mechanisms to fuse conditions--you can see this in action with the ccmp instructions in your example).

I don't have time to really dig into your example right now, but my quick reaction is that the cmov-version is fairly good size; If the branchy version can eliminate most/all of that work a decent portion of the time, that could easily come out ahead if there's any structure that lets the predictor do even a semi-decent job. Looking at performance counters would be informative. (If you really want a definitive answer, open an issue on the swift GitHub repo and tag me, attach the benchmark project source, and I'll get to it ... eventually if someone else doesn't).

wadetregaskis · November 9, 2023, 5:39pm

Yes, but the performance deficit exists just the same in all dataset sizes, from a few bytes up to many gigabytes (and presumably beyond; I haven't tested). I doubt the branch predictor can remember branches taken millions of branches ago, or otherwise recognise "oh, this is that same stream of 4 GiB I saw a few minutes ago"… unless I'm vastly under-appreciating how clever modern branch prediction is?

It's possible Xoshiro isn't a good PRNG, but it seems pretty well-documented to perform well (even if not cryptographically strong).

I did try, actually, but I've never really understood Instruments' PMC support. The numbers I saw for some really well-understood events (e.g. instructions retired, retired memory loads, etc) seemed pretty obviously bogus. I miss Shark.

Sure.

I was already pondering whether to do that, but hesitated because I'm not certain how representative and robust this particular benchmark is. Also on my todo list is to explore the problem space in other dimensions, to see if the compiler e.g. see-saws between predication and branches depending on e.g. exactly what operations the filters & maps perform. I did try a couple of other random variations of the algorithm, and saw similar results, but not enough to draw a conclusion.

hassila · November 9, 2023, 5:48pm

That should definitely not be the case, basically a capture of the jemalloc stats are captured before/after each “outer” benchmark run - I’ll have a look what happens when back in office.

If you see anything strange like that, definitely file a n issue if you can - we want it to be correct and robust measurements within the context that is reasonable possible - this definitely should be.

wadetregaskis · November 9, 2023, 9:58pm

To clarify, I'm talking about the peakMemoryResident metric in that case. What it shows - sampling problems notwithstanding - is technically correct, but it's not interesting in the benchmarks in question because what matters is additional RSS inflation, beyond what's necessary for the test dataset itself (and the test dataset is sometimes way bigger than the significant deltas in peak RSS under test, so it obscures the results).

i.e. I'd prefer it subtract the RSS at the start of the benchmarks, to 'normalise'. Though there may be added complexity there in validating that benchmarks don't share state that persists across executions. They might need to be run in separate processes, for that sort of measurement.

Also, for the full peak RSS story you need to track paging as well, in case the benchmark exceeds the available RAM.

hassila · November 10, 2023, 6:22am

Right - I think perhaps such a RSS delta could be added as a separate metric, but I see your need (we typically don’t have large in-memory data sets under test currently, so didn’t have the need (yet)) - I opened Add new RSS delta metric that removes baseline RSS · Issue #198 · ordo-one/package-benchmark · GitHub for that, thanks for the feedback!

We already separate the individual benchmarks into separate processes when run (as RSS even without delta also was polluted across runs).

I still think the allocatedResidentMemory should have shown you usable and comparable results, will double check that.

hassila · November 14, 2023, 6:14pm

Alright, had some time to go through this - there was a bug in the sampling (like... we forgot to sample resident/virtual and only took the last result) - that has been fixed now and I get reasonable results with your benchmark thus:

(peaks at 1100MB now instead for the 'bad case').

The default sampling rate is 5ms.

allocatedResidentMemory is a bit more tricky as the stats we get back from jemalloc seems non-intuitive, will need to circle back to that later and won't touch it for now (I did try with a baseline and just look at 'delta' but really not useful output).

So if you update dependencies to the release linked to above, peakMemoryResident should now give you usable results at least.

wadetregaskis · November 14, 2023, 6:17pm

Thank you!

"Resident peak" does seem much more plausible now, from the numbers you just showed. "Allocated resident" seems pretty obviously broken still, but that seems to be what you're alluding to being "tricky" about jemalloc.