When i rewrite it with a normal for loop like this
return image.rgba(of: Component.self).map
{
var array:[Component] = []
array.reserveCapacity($0.count << 2)
for pixel:RGBA<Component> in $0
{
array.append(pixel.r)
array.append(pixel.g)
array.append(pixel.b)
array.append(pixel.a)
}
return (array, size: image.properties.size)
}
It runs ~30% faster!
I tested on a large input and found the .rgba(of:) call takes 488,286 clock cycles, out of a total of 570,665 ticks for the whole function with the for loop, and 789,350 for the flatMap version. That means the flatMap itself is almost 4x slower than the for loop itself. What is going on?
Extra allocations both for the intermediate arrays (which don’t exist in the fast version) and when resizing the final array (which gets capacity reserved up front in the fast version).
Can you verify that this shows the same perf characteristics you are seeing?
When I tested it locally I was seeing around a 11% change:
(38460.0 - 34605)/34605.0 = 0.11140008669267447
(I am assuming that the old was the non-flatMap version and the new was the flat map). But I did not really stabilize my CPU so take that with a grain of salt.
Is there a reason why flatMap doesn't store the segments, sum their count, reserve capacity up front, and then populate it?
It would take 2 passes over the segments (one to sum the count, another to copy the values), but that hit might be worth it in cases where the segments are really small and can cause many array reallocations.
Aren't there currently bugs where global-level declarations don't get the same optimizations as things inside types? Perhaps trying putting everything statically inside an enum and see if that changes the output?
I just landed a fix so that swiftpm based build should just work. My suggestion: download a swift-4.2 toolchain, use that swiftpm to build using swift build --configuration release. And then run the executable swift-bench using:
@taylorswift Just merged it. You should be able to use a nightly toolchain to build the benchmarks now without needing to use cmake or anything. @Aciid is committing an integration test to make sure that it keeps on working.
I just got it to work/landed both it and the test. You should be able to build this against a development snapshot from swift.org so you don't need to build the compiler itself.
The intermediate arrays ought to be stack allocated.
But I don't think it's reasonable for the compiler to take care of the array growth reallocations. Bear in mind the closure can generate any kind of sequence. For the optimizer to look at the closure, see it's producing the same length array every iteration, then use that to generate code that multiplies the length of the original collection by that fixed length, and then pre-allocate that much space in the array beforehand, is an unrealistic expectation.