Weird de-optimization in MapReduceClass benchmark

I've been working on setup overhead in benchmarks and stumbled across strange case with MapReduceClass benchmark which involves Objective-C interop.

public func run_MapReduceClass(_ N: Int) {
  let numbers = (0..<1000).map { NSDecimalNumber(value: $0) }

  var c = 0
  for _ in 1...N*100 {
    let mapped = numbers.map { $0.intValue &+ 5 }
    c += mapped.reduce(0, &+)
  }
  CheckResults(c != 0)
}

I have extracted the numbers initialization into the setUpFunction (which prepares IUO decimals), but this did more than remove the 600 μs setup overhead: the benchmark got about twice faster. The speedup can be reproduced even if I leave the initialization in the run method, but alias the variable:

public func run_MapReduceClassShort(_ N: Int) {
  decimals = (0..<1000).map { NSDecimalNumber(value: $0) }
  let numbers: [NSDecimalNumber] = decimals
...

Looking at profile in Instruments, it looks like something is causing extra ref-counting traffic. The objc_release is dominating the trace in the original version:

It disappears when the variable is aliased (as demonstrated above):

So, I'm not sure if this should be filed as a bug, or if demonstrating this behavior is the purpose of this benchmark… When I take the initialization out into setUpFunction, I cannot reproduce the original timings, so I'll probably need to rename the benchmark, to not mess long time tracking, right?

For comparison, this is the trace from cleaned-up version:
MapReduceClassSetup

All this is true also for the smaller sister benchmark MapReduceClassShort.
@Ben_Cohen, @Erik_Eckstein, please advise!

OK, after sleeping on this, it’s clear that when I aliased the variable into a global, I’ve extended the lifetime of that [NSDecimalNumber] outside of the runFunction, taking the deallocation out of the measurement. (It still surprises me that deallocating array of 1000 decimals takes about the same amount of time as reducing it 100 times..., some 6800 μs — while it takes 600 μs to create — all on my old machine, of course)

@Ben_Cohen can you confirm that testing of Obj-C alloc/dealloc isn’t the point of this benchmark and I can adjust it to measure only the map/reduce interaction with class instances under a modified name MapReduceClass2 and MapReduceClassSmall2?
cc @Erik_Eckstein

@Ben_Cohen The proposed modification to MapReduceClass is in this commit on pending PR:

Hmm. The original intent of this benchmark isn't something I'd know, but from the name, it suggests it's trying to benchmark class array performance. If that's the case, I'd suggest not using NSNumber, as otherwise it's really testing a weird combo of NSNumber performance and Array performance. So I'd suggest if we don't have one for that adding a second one.

I can't really speak to the effect on the optimizer of hoisting the array creation, that's more a question for @Erik_Eckstein.

Erm… Oh. According to blame, this benchmark was added by you, @Ben_Cohen :slight_smile:

That's why I was asking you
I can try putting the Int in a Box and see how that behaves. I'll report back.

Haha, oops!

It looks like I misread NSDecimalNumber in the post as NSNumber. I think NSDecimalNumber is always a reference so that's OK, though it might be good to add a Swift class too.

I think it probably doesn't make sense to have a test that includes the deallocation. But I'm not sure whether it's better to change the existing benchmark vs add a new one.

There still is difference between NSDecimalNumber and class Box:

class Box {
  var v: Int
  init(_ v: Int) { self.v = v }
}
Benchmark Min Q1 Median Q3 Max MAX_RSS (b)
MapReduceClass 12521 12666 12818 12958 13460 311296
MapReduceClass2 5561 5604 5648 5843 6470 282624
MapReduceClass3 3518 3533 3816 3903 4377 253952
MapReduceClassShort 18706 19049 19178 19386 19746 217088
MapReduceClassShort2 13022 13333 13497 13673 13928 221184
MapReduceClassShort3 11596 11802 11986 12208 12499 212992

(Yes, I'm running this on 2008 MBP :scream:)
The originals are more than twice the runtime of the 2 variants due to the deallocation at the end. The 2 is taking the setup out of the benchmark, which also removes the deallocation from the measurement. The 3 is with Box. Should I keep both for comparison? Since this changes what's measured, they will be renamed and then I would also lower the workload to be under 1ms… What do you think, @Ben_Cohen and @Erik_Eckstein?

I think this makes the most sense: Lowered workload by a factor of 10, keeping NSDecimal. 2 variants are using the class Box.

Benchmark Min Q1 Median Q3 Max MAX_RSS (b)
MapReduceClass2 351 352 371 391 567 241664
MapReduceClassShort2 1097 1133 1155 1216 1447 204800
MapReduceNSDecimalNumber 555 561 568 579 676 253952
MapReduceNSDecimalNumberShort 1301 1340 1361 1394 1684 225280

BTW all these measurements were done on a fully saturated CPU (worst case scenario: Xcode installation in the background) with More Robust Benchmark_Driver, which demonstrates the progress in improving the quality of measurements and the importance of low workloads for robustness.

@Ben_Cohen, could you also please respond to @Erik_Eckstein's question in that PR about the utility of re-added benchmarks?

sounds good