Why is `Array.flatMap` so slow?

I’ve got a loop that looks like this:

            return image.rgba(of: Component.self).map 
            { 
                let array:[Component] = $0.flatMap 
                {
                    [$0.r, $0.g, $0.b, $0.a]
                } 
                
                return (array, size: image.properties.size) 
            }

When i rewrite it with a normal for loop like this

            return image.rgba(of: Component.self).map 
            { 
                var array:[Component] = []
                    array.reserveCapacity($0.count << 2)
                for pixel:RGBA<Component> in $0 
                {
                    array.append(pixel.r)
                    array.append(pixel.g)
                    array.append(pixel.b)
                    array.append(pixel.a)
                }
                
                return (array, size: image.properties.size) 
            }

It runs ~30% faster!

I tested on a large input and found the .rgba(of:) call takes 488,286 clock cycles, out of a total of 570,665 ticks for the whole function with the for loop, and 789,350 for the flatMap version. That means the flatMap itself is almost 4x slower than the for loop itself. What is going on?

I ran this on a release configuration, of course.

1 Like

Extra allocations both for the intermediate arrays (which don’t exist in the fast version) and when resizing the final array (which gets capacity reserved up front in the fast version).

isn’t this the kind of thing everyone tells us “the compiler is supposed to take care of”?

3 Likes

@taylorswift can you provide a small test case that shows this? Or even better a single source benchmark that shows this behavior?

// flatmap.swift
import func Glibc.clock

struct S 
{
    let a:Int, 
        b:Int, 
        c:Int, 
        d:Int 
}

let structured:[S] = (0 ..< 1 << 16).map 
{
    _ in
    .init(  a: Int.random(in: .min ... .max), 
            b: Int.random(in: .min ... .max), 
            c: Int.random(in: .min ... .max), 
            d: Int.random(in: .min ... .max))
}

func flattenFunctional(_ structured:[S]) -> [Int]
{
    return structured.flatMap 
    {
        [$0.a, $0.b, $0.c, $0.d]
    }
}

func flattenLoop(_ structured:[S]) -> [Int]
{
    var flattened:[Int] = []
        flattened.reserveCapacity(structured.count << 2)
    for aggregate:S in structured 
    {
        flattened.append(aggregate.a)
        flattened.append(aggregate.b)
        flattened.append(aggregate.c)
        flattened.append(aggregate.d)
    }
    
    return flattened
}

let t1:Int = clock()
let flattened1:[Int] = flattenLoop(structured)
print(clock() - t1)
print(flattened1.last as Any)

let t2:Int = clock()
let flattened2:[Int] = flattenFunctional(structured)
print(clock() - t2)
print(flattened2.last as Any)
$ swiftc -O flatmap.swift 
$ ./flatmap 
1056
Optional(1651662832467973056)
3168
Optional(1651662832467973056)
1 Like

Thanks! I'll throw it in a PR in a bit.

2 Likes

Can you verify that this shows the same perf characteristics you are seeing?

When I tested it locally I was seeing around a 11% change:

(38460.0 - 34605)/34605.0 = 0.11140008669267447

(I am assuming that the old was the non-flatMap version and the new was the flat map). But I did not really stabilize my CPU so take that with a grain of salt.

Is there a reason why flatMap doesn't store the segments, sum their count, reserve capacity up front, and then populate it?

It would take 2 passes over the segments (one to sum the count, another to copy the values), but that hit might be worth it in cases where the segments are really small and can cause many array reallocations.

I don’t know how to run that file but I modified my test file to use a tuple like you did, and it’s still showing the >3x change.

// flatmap.swift 
import func Glibc.clock

struct S 
{
    let a:Int, 
        b:Int, 
        c:Int, 
        d:Int 
}

let structured:[S] = (0 ..< 1 << 16).map 
{
    _ in
    .init(  a: Int.random(in: .min ... .max), 
            b: Int.random(in: .min ... .max), 
            c: Int.random(in: .min ... .max), 
            d: Int.random(in: .min ... .max))
}

let structured2:[(Int, Int, Int, Int)] = (0 ..< 1 << 16).map 
{
    _ in
    (
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max)
    )
}

func flattenFunctional(_ structured:[S]) -> [Int]
{
    return structured.flatMap 
    {
        [$0.a, $0.b, $0.c, $0.d]
    }
}

func flattenLoop(_ structured:[S]) -> [Int]
{
    var flattened:[Int] = []
        flattened.reserveCapacity(structured.count << 2)
    for aggregate:S in structured 
    {
        flattened.append(aggregate.a)
        flattened.append(aggregate.b)
        flattened.append(aggregate.c)
        flattened.append(aggregate.d)
    }
    
    return flattened
}

func flattenFunctional2(_ structured:[(Int, Int, Int, Int)]) -> [Int]
{
    return structured.flatMap 
    {
        [$0.0, $0.1, $0.2, $0.3]
    }
}

func flattenLoop2(_ structured:[(Int, Int, Int, Int)]) -> [Int]
{
    var flattened:[Int] = []
        flattened.reserveCapacity(structured.count << 2)
    for (x, y, z, w):(Int, Int, Int, Int) in structured 
    {
        flattened.append(x)
        flattened.append(y)
        flattened.append(z)
        flattened.append(w)
    }
    
    return flattened
}

let t1:Int = clock()
let flattened1:[Int] = flattenLoop(structured)
print(clock() - t1)
print(flattened1.last as Any)

let t2:Int = clock()
let flattened2:[Int] = flattenFunctional(structured)
print(clock() - t2)
print(flattened2.last as Any)


let t3:Int = clock()
let flattened3:[Int] = flattenLoop2(structured2)
print(clock() - t3)
print(flattened3.last as Any)

let t4:Int = clock()
let flattened4:[Int] = flattenFunctional2(structured2)
print(clock() - t4)
print(flattened4.last as Any)
$ swiftc -O flatmap.swift 
$ ./flatmap 
1003
Optional(-8153965953306206946)
3121
Optional(-8153965953306206946)
998
Optional(6228559059253042443)
3240
Optional(6228559059253042443)

Aren't there currently bugs where global-level declarations don't get the same optimizations as things inside types? Perhaps trying putting everything statically inside an enum and see if that changes the output?

nope

import func Glibc.clock

let structured2:[(Int, Int, Int, Int)] = (0 ..< 1 << 16).map 
{
    _ in
    (
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max)
    )
}

enum Flatten 
{
    static 
    func functional(_ structured:[(Int, Int, Int, Int)]) -> [Int]
    {
        return structured.flatMap 
        {
            [$0.0, $0.1, $0.2, $0.3]
        }
    }

    static 
    func loop(_ structured:[(Int, Int, Int, Int)]) -> [Int]
    {
        var flattened:[Int] = []
            flattened.reserveCapacity(structured.count << 2)
        for (x, y, z, w):(Int, Int, Int, Int) in structured 
        {
            flattened.append(x)
            flattened.append(y)
            flattened.append(z)
            flattened.append(w)
        }
        
        return flattened
    }
    
    static 
    func benchmark(_ structured:[(Int, Int, Int, Int)])
    {
        let t1:Int = clock()
        let flattened1:[Int] = loop(structured)
        print(clock() - t1)
        print(flattened1.last as Any)

        let t2:Int = clock()
        let flattened2:[Int] = functional(structured)
        print(clock() - t2)
        print(flattened2.last as Any)
    }
}

Flatten.benchmark(structured2)
$ swiftc -O flatmap.swift
$ ./flatmap 
1011
Optional(4744574673919927740)
3350
Optional(4744574673919927740)

I just landed a fix so that swiftpm based build should just work. My suggestion: download a swift-4.2 toolchain, use that swiftpm to build using swift build --configuration release. And then run the executable swift-bench using:

./swiftbench FlattenListLoop
./swiftbench FlattenListFlatMap
1 Like

Keep in mind I need to land that PR first with the benchmark... but after that it should be good to go.

@taylorswift Just merged it. You should be able to use a nightly toolchain to build the benchmarks now without needing to use cmake or anything. @Aciid is committing an integration test to make sure that it keeps on working.

1 Like

i got

swift build -c release
'swiftbench' swift/swift/benchmark: warning: Ignoring duplicate product 'PrimsSplit' (static)
Compile ObjectiveCTests ObjectiveCTests.m
Compile Swift Module 'TestsUtils' (1 sources)
In file included from swift/swift/benchmark/utils/ObjectiveCTests/ObjectiveCTests.m:13:
swift/swift/benchmark/utils/ObjectiveCTests/ObjectiveCTests.h:13:9: fatal error: 'Foundation/Foundation.h' file not found

#import <Foundation/Foundation.h>
        ^~~~~~~~~~~~~~~~~~~~~~~~~

1 error generated.

it seems that benchmarks are not supported on linux? SE 6669

I tried removing the objective c benchmarks and fixing the swift, but there seems to be a problem with the driver utils

Linking ./.build/x86_64-unknown-linux/release/SwiftBench
swift/swift/benchmark/.build/x86_64-unknown-linux/release/DriverUtils.build/DriverUtils.swift.o(.data+0x138): error: undefined reference to '$sSo6rusageV9ru_maxrssSivpMV'
swift/swift/benchmark/.build/x86_64-unknown-linux/release/DriverUtils.build/DriverUtils.swift.o(.data+0x190): error: undefined reference to '$sSo6rusageV9ru_nivcswSivpMV'
swift/swift/benchmark/.build/x86_64-unknown-linux/release/DriverUtils.build/DriverUtils.swift.o(.data+0x1e8): error: undefined reference to '$sSo6rusageV8ru_nvcswSivpMV'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Yea. I am fixing that now. Give me a bit. I am iterating on getting it working here:

https://github.com/apple/swift/pull/20121

1 Like

I finally got it to work. Something interesting I found:

#,TEST,SAMPLES,MIN(μs),MAX(μs),MEAN(μs),SD(μs),MEDIAN(μs)
315,FlattenListFlatMap,1,15943,15943,15943,0,15943
316,FlattenListLoop,1,3635,3635,3635,0,3635

So the perf difference may be more significant on Linux vs macOS. That being said this was in a container. So take that with a grain of Salt.

I just got it to work/landed both it and the test. You should be able to build this against a development snapshot from swift.org so you don't need to build the compiler itself.

here’s what i’m seeing

$ .build/release/SwiftBench FlattenListLoop FlattenListFlatMap
#,TEST,SAMPLES,MIN(μs),MAX(μs),MEAN(μs),SD(μs),MEDIAN(μs)
315,FlattenListFlatMap,1,9088,9088,9088,0,9088
316,FlattenListLoop,1,2502,2502,2502,0,2502

Total performance tests executed: 2

The intermediate arrays ought to be stack allocated.

But I don't think it's reasonable for the compiler to take care of the array growth reallocations. Bear in mind the closure can generate any kind of sequence. For the optimizer to look at the closure, see it's producing the same length array every iteration, then use that to generate code that multiplies the length of the original collection by that fixed length, and then pre-allocate that much space in the array beforehand, is an unrealistic expectation.

1 Like