I've been working on setup overhead in benchmarks and stumbled across strange case with MapReduceClass benchmark which involves Objective-C interop.
public func run_MapReduceClass(_ N: Int) {
let numbers = (0..<1000).map { NSDecimalNumber(value: $0) }
var c = 0
for _ in 1...N*100 {
let mapped = numbers.map { $0.intValue &+ 5 }
c += mapped.reduce(0, &+)
}
CheckResults(c != 0)
}
I have extracted the numbers initialization into the setUpFunction (which prepares IUO decimals), but this did more than remove the 600 μs setup overhead: the benchmark got about twice faster. The speedup can be reproduced even if I leave the initialization in the run method, but alias the variable:
public func run_MapReduceClassShort(_ N: Int) {
decimals = (0..<1000).map { NSDecimalNumber(value: $0) }
let numbers: [NSDecimalNumber] = decimals
...
Looking at profile in Instruments, it looks like something is causing extra ref-counting traffic. The objc_release is dominating the trace in the original version:
So, I'm not sure if this should be filed as a bug, or if demonstrating this behavior is the purpose of this benchmark… When I take the initialization out into setUpFunction, I cannot reproduce the original timings, so I'll probably need to rename the benchmark, to not mess long time tracking, right?
For comparison, this is the trace from cleaned-up version:
All this is true also for the smaller sister benchmark MapReduceClassShort. @Ben_Cohen, @Erik_Eckstein, please advise!
OK, after sleeping on this, it’s clear that when I aliased the variable into a global, I’ve extended the lifetime of that [NSDecimalNumber] outside of the runFunction, taking the deallocation out of the measurement. (It still surprises me that deallocating array of 1000 decimals takes about the same amount of time as reducing it 100 times..., some 6800 μs — while it takes 600 μs to create — all on my old machine, of course)
@Ben_Cohen can you confirm that testing of Obj-C alloc/dealloc isn’t the point of this benchmark and I can adjust it to measure only the map/reduce interaction with class instances under a modified name MapReduceClass2 and MapReduceClassSmall2?
cc @Erik_Eckstein
Hmm. The original intent of this benchmark isn't something I'd know, but from the name, it suggests it's trying to benchmark class array performance. If that's the case, I'd suggest not using NSNumber, as otherwise it's really testing a weird combo of NSNumber performance and Array performance. So I'd suggest if we don't have one for that adding a second one.
I can't really speak to the effect on the optimizer of hoisting the array creation, that's more a question for @Erik_Eckstein.
It looks like I misread NSDecimalNumber in the post as NSNumber. I think NSDecimalNumber is always a reference so that's OK, though it might be good to add a Swift class too.
I think it probably doesn't make sense to have a test that includes the deallocation. But I'm not sure whether it's better to change the existing benchmark vs add a new one.
There still is difference between NSDecimalNumber and class Box:
class Box {
var v: Int
init(_ v: Int) { self.v = v }
}
Benchmark
Min
Q1
Median
Q3
Max
MAX_RSS (b)
MapReduceClass
12521
12666
12818
12958
13460
311296
MapReduceClass2
5561
5604
5648
5843
6470
282624
MapReduceClass3
3518
3533
3816
3903
4377
253952
MapReduceClassShort
18706
19049
19178
19386
19746
217088
MapReduceClassShort2
13022
13333
13497
13673
13928
221184
MapReduceClassShort3
11596
11802
11986
12208
12499
212992
(Yes, I'm running this on 2008 MBP )
The originals are more than twice the runtime of the 2 variants due to the deallocation at the end. The 2 is taking the setup out of the benchmark, which also removes the deallocation from the measurement. The 3 is with Box. Should I keep both for comparison? Since this changes what's measured, they will be renamed and then I would also lower the workload to be under 1ms… What do you think, @Ben_Cohen and @Erik_Eckstein?
I think this makes the most sense: Lowered workload by a factor of 10, keeping NSDecimal. 2 variants are using the class Box.
Benchmark
Min
Q1
Median
Q3
Max
MAX_RSS (b)
MapReduceClass2
351
352
371
391
567
241664
MapReduceClassShort2
1097
1133
1155
1216
1447
204800
MapReduceNSDecimalNumber
555
561
568
579
676
253952
MapReduceNSDecimalNumberShort
1301
1340
1361
1394
1684
225280
BTW all these measurements were done on a fully saturated CPU (worst case scenario: Xcode installation in the background) with More Robust Benchmark_Driver, which demonstrates the progress in improving the quality of measurements and the importance of low workloads for robustness.