Why is `Array.flatMap` so slow?

taylorswift · October 27, 2018, 7:42pm

I’ve got a loop that looks like this:

            return image.rgba(of: Component.self).map 
            { 
                let array:[Component] = $0.flatMap 
                {
                    [$0.r, $0.g, $0.b, $0.a]
                } 
                
                return (array, size: image.properties.size) 
            }

When i rewrite it with a normal for loop like this

            return image.rgba(of: Component.self).map 
            { 
                var array:[Component] = []
                    array.reserveCapacity($0.count << 2)
                for pixel:RGBA<Component> in $0 
                {
                    array.append(pixel.r)
                    array.append(pixel.g)
                    array.append(pixel.b)
                    array.append(pixel.a)
                }
                
                return (array, size: image.properties.size) 
            }

It runs ~30% faster!

I tested on a large input and found the .rgba(of:) call takes 488,286 clock cycles, out of a total of 570,665 ticks for the whole function with the for loop, and 789,350 for the flatMap version. That means the flatMap itself is almost 4x slower than the for loop itself. What is going on?

I ran this on a release configuration, of course.

Nevin · October 27, 2018, 7:56pm

Extra allocations both for the intermediate arrays (which don’t exist in the fast version) and when resizing the final array (which gets capacity reserved up front in the fast version).

taylorswift · October 27, 2018, 8:02pm

isn’t this the kind of thing everyone tells us “the compiler is supposed to take care of”?

Michael_Gottesman · October 27, 2018, 9:07pm

@taylorswift can you provide a small test case that shows this? Or even better a single source benchmark that shows this behavior?

taylorswift · October 28, 2018, 2:14am

// flatmap.swift
import func Glibc.clock

struct S 
{
    let a:Int, 
        b:Int, 
        c:Int, 
        d:Int 
}

let structured:[S] = (0 ..< 1 << 16).map 
{
    _ in
    .init(  a: Int.random(in: .min ... .max), 
            b: Int.random(in: .min ... .max), 
            c: Int.random(in: .min ... .max), 
            d: Int.random(in: .min ... .max))
}

func flattenFunctional(_ structured:[S]) -> [Int]
{
    return structured.flatMap 
    {
        [$0.a, $0.b, $0.c, $0.d]
    }
}

func flattenLoop(_ structured:[S]) -> [Int]
{
    var flattened:[Int] = []
        flattened.reserveCapacity(structured.count << 2)
    for aggregate:S in structured 
    {
        flattened.append(aggregate.a)
        flattened.append(aggregate.b)
        flattened.append(aggregate.c)
        flattened.append(aggregate.d)
    }
    
    return flattened
}

let t1:Int = clock()
let flattened1:[Int] = flattenLoop(structured)
print(clock() - t1)
print(flattened1.last as Any)

let t2:Int = clock()
let flattened2:[Int] = flattenFunctional(structured)
print(clock() - t2)
print(flattened2.last as Any)

$ swiftc -O flatmap.swift 
$ ./flatmap 
1056
Optional(1651662832467973056)
3168
Optional(1651662832467973056)

Michael_Gottesman · October 28, 2018, 8:40pm

Thanks! I'll throw it in a PR in a bit.

Michael_Gottesman · October 28, 2018, 11:00pm

Can you verify that this shows the same perf characteristics you are seeing?

When I tested it locally I was seeing around a 11% change:

(38460.0 - 34605)/34605.0 = 0.11140008669267447

(I am assuming that the old was the non-flatMap version and the new was the flat map). But I did not really stabilize my CPU so take that with a grain of salt.

AlexanderM · October 29, 2018, 1:37am

Is there a reason why flatMap doesn't store the segments, sum their count, reserve capacity up front, and then populate it?

It would take 2 passes over the segments (one to sum the count, another to copy the values), but that hit might be worth it in cases where the segments are really small and can cause many array reallocations.

taylorswift · October 29, 2018, 1:56am

I don’t know how to run that file but I modified my test file to use a tuple like you did, and it’s still showing the >3x change.

// flatmap.swift 
import func Glibc.clock

struct S 
{
    let a:Int, 
        b:Int, 
        c:Int, 
        d:Int 
}

let structured:[S] = (0 ..< 1 << 16).map 
{
    _ in
    .init(  a: Int.random(in: .min ... .max), 
            b: Int.random(in: .min ... .max), 
            c: Int.random(in: .min ... .max), 
            d: Int.random(in: .min ... .max))
}

let structured2:[(Int, Int, Int, Int)] = (0 ..< 1 << 16).map 
{
    _ in
    (
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max)
    )
}

func flattenFunctional(_ structured:[S]) -> [Int]
{
    return structured.flatMap 
    {
        [$0.a, $0.b, $0.c, $0.d]
    }
}

func flattenLoop(_ structured:[S]) -> [Int]
{
    var flattened:[Int] = []
        flattened.reserveCapacity(structured.count << 2)
    for aggregate:S in structured 
    {
        flattened.append(aggregate.a)
        flattened.append(aggregate.b)
        flattened.append(aggregate.c)
        flattened.append(aggregate.d)
    }
    
    return flattened
}

func flattenFunctional2(_ structured:[(Int, Int, Int, Int)]) -> [Int]
{
    return structured.flatMap 
    {
        [$0.0, $0.1, $0.2, $0.3]
    }
}

func flattenLoop2(_ structured:[(Int, Int, Int, Int)]) -> [Int]
{
    var flattened:[Int] = []
        flattened.reserveCapacity(structured.count << 2)
    for (x, y, z, w):(Int, Int, Int, Int) in structured 
    {
        flattened.append(x)
        flattened.append(y)
        flattened.append(z)
        flattened.append(w)
    }
    
    return flattened
}

let t1:Int = clock()
let flattened1:[Int] = flattenLoop(structured)
print(clock() - t1)
print(flattened1.last as Any)

let t2:Int = clock()
let flattened2:[Int] = flattenFunctional(structured)
print(clock() - t2)
print(flattened2.last as Any)


let t3:Int = clock()
let flattened3:[Int] = flattenLoop2(structured2)
print(clock() - t3)
print(flattened3.last as Any)

let t4:Int = clock()
let flattened4:[Int] = flattenFunctional2(structured2)
print(clock() - t4)
print(flattened4.last as Any)

$ swiftc -O flatmap.swift 
$ ./flatmap 
1003
Optional(-8153965953306206946)
3121
Optional(-8153965953306206946)
998
Optional(6228559059253042443)
3240
Optional(6228559059253042443)

Jon_Shier · October 29, 2018, 2:17am

Aren't there currently bugs where global-level declarations don't get the same optimizations as things inside types? Perhaps trying putting everything statically inside an enum and see if that changes the output?

taylorswift · October 29, 2018, 3:28am

nope

import func Glibc.clock

let structured2:[(Int, Int, Int, Int)] = (0 ..< 1 << 16).map 
{
    _ in
    (
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max), 
        Int.random(in: .min ... .max)
    )
}

enum Flatten 
{
    static 
    func functional(_ structured:[(Int, Int, Int, Int)]) -> [Int]
    {
        return structured.flatMap 
        {
            [$0.0, $0.1, $0.2, $0.3]
        }
    }

    static 
    func loop(_ structured:[(Int, Int, Int, Int)]) -> [Int]
    {
        var flattened:[Int] = []
            flattened.reserveCapacity(structured.count << 2)
        for (x, y, z, w):(Int, Int, Int, Int) in structured 
        {
            flattened.append(x)
            flattened.append(y)
            flattened.append(z)
            flattened.append(w)
        }
        
        return flattened
    }
    
    static 
    func benchmark(_ structured:[(Int, Int, Int, Int)])
    {
        let t1:Int = clock()
        let flattened1:[Int] = loop(structured)
        print(clock() - t1)
        print(flattened1.last as Any)

        let t2:Int = clock()
        let flattened2:[Int] = functional(structured)
        print(clock() - t2)
        print(flattened2.last as Any)
    }
}

Flatten.benchmark(structured2)

$ swiftc -O flatmap.swift
$ ./flatmap 
1011
Optional(4744574673919927740)
3350
Optional(4744574673919927740)

Michael_Gottesman · October 29, 2018, 5:41am

I just landed a fix so that swiftpm based build should just work. My suggestion: download a swift-4.2 toolchain, use that swiftpm to build using swift build --configuration release. And then run the executable swift-bench using:

./swiftbench FlattenListLoop
./swiftbench FlattenListFlatMap

Michael_Gottesman · October 29, 2018, 5:44am

Keep in mind I need to land that PR first with the benchmark... but after that it should be good to go.

Michael_Gottesman · October 29, 2018, 6:43am

@taylorswift Just merged it. You should be able to use a nightly toolchain to build the benchmarks now without needing to use cmake or anything. @Aciid is committing an integration test to make sure that it keeps on working.

taylorswift · October 29, 2018, 5:44pm

i got

swift build -c release
'swiftbench' swift/swift/benchmark: warning: Ignoring duplicate product 'PrimsSplit' (static)
Compile ObjectiveCTests ObjectiveCTests.m
Compile Swift Module 'TestsUtils' (1 sources)
In file included from swift/swift/benchmark/utils/ObjectiveCTests/ObjectiveCTests.m:13:
swift/swift/benchmark/utils/ObjectiveCTests/ObjectiveCTests.h:13:9: fatal error: 'Foundation/Foundation.h' file not found

#import <Foundation/Foundation.h>
        ^~~~~~~~~~~~~~~~~~~~~~~~~

1 error generated.

it seems that benchmarks are not supported on linux? SE 6669

I tried removing the objective c benchmarks and fixing the swift, but there seems to be a problem with the driver utils

Linking ./.build/x86_64-unknown-linux/release/SwiftBench
swift/swift/benchmark/.build/x86_64-unknown-linux/release/DriverUtils.build/DriverUtils.swift.o(.data+0x138): error: undefined reference to '$sSo6rusageV9ru_maxrssSivpMV'
swift/swift/benchmark/.build/x86_64-unknown-linux/release/DriverUtils.build/DriverUtils.swift.o(.data+0x190): error: undefined reference to '$sSo6rusageV9ru_nivcswSivpMV'
swift/swift/benchmark/.build/x86_64-unknown-linux/release/DriverUtils.build/DriverUtils.swift.o(.data+0x1e8): error: undefined reference to '$sSo6rusageV8ru_nvcswSivpMV'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Michael_Gottesman · October 29, 2018, 6:21pm

Yea. I am fixing that now. Give me a bit. I am iterating on getting it working here:

https://github.com/apple/swift/pull/20121

Michael_Gottesman · October 29, 2018, 7:26pm

I finally got it to work. Something interesting I found:

#,TEST,SAMPLES,MIN(μs),MAX(μs),MEAN(μs),SD(μs),MEDIAN(μs)
315,FlattenListFlatMap,1,15943,15943,15943,0,15943
316,FlattenListLoop,1,3635,3635,3635,0,3635

So the perf difference may be more significant on Linux vs macOS. That being said this was in a container. So take that with a grain of Salt.

Michael_Gottesman · October 29, 2018, 8:56pm

I just got it to work/landed both it and the test. You should be able to build this against a development snapshot from swift.org so you don't need to build the compiler itself.

taylorswift · October 29, 2018, 11:00pm

here’s what i’m seeing

$ .build/release/SwiftBench FlattenListLoop FlattenListFlatMap
#,TEST,SAMPLES,MIN(μs),MAX(μs),MEAN(μs),SD(μs),MEDIAN(μs)
315,FlattenListFlatMap,1,9088,9088,9088,0,9088
316,FlattenListLoop,1,2502,2502,2502,0,2502

Total performance tests executed: 2

Ben_Cohen · October 30, 2018, 12:07am

The intermediate arrays ought to be stack allocated.

But I don't think it's reasonable for the compiler to take care of the array growth reallocations. Bear in mind the closure can generate any kind of sequence. For the optimizer to look at the closure, see it's producing the same length array every iteration, then use that to generate code that multiplies the length of the original collection by that fixed length, and then pre-allocate that much space in the array beforehand, is an unrealistic expectation.