Why is `Array.flatMap` so slow?

taylorswift · October 30, 2018, 1:09am

if Swift is going to offer FP-style constructs and present them as “the preferred idiom” it should do its best to make them as performant as the iterative style. maybe this is a problem with the Swift community’s general “use the high level construct and trust the compiler to make this fast” attitude but you can’t really advance that as a language philosophy and at the same time give up and say something as heavily used and fundamental as flatMap is too hard for the compiler to optimize.

Joe_Groff · October 30, 2018, 1:53am

flatMap-ing sequences is a particularly difficult case to optimize away intermediate collections from, since the number of result elements that can arise from each input element is arbitrary, and the computation of the size can't be cleanly separated from the computation of the result. It could be done with some special case optimization passes, eventually. There's nothing wrong with writing a for loop when you need one.

Ben_Cohen · October 30, 2018, 1:54am

But saying "flatMap should be as performant as the iterative loop" is not the same as thinking that the compiler can step in and transform code beyond what is practical. There are many ways to solve this problem.

The clear difference between your iterative and declarative implementations is not the functional style. It is that in the imperative version you, the programmer, are adding extra information: you know the final size of the array and you're reserving it up front.

In this case, it is theoretically (though not practically) possible for the compiler to deduce this information. But in many cases, this wouldn't even be possible. Maybe the mapping function is complex, or is non-deterministic. Maybe it's an opaque function call.

A different approach would be to add a size hint to the flat map call. This could be useful pattern for other similar transformations like filter or joined, where the user has better knowledge of the final length or perhaps is willing to expend more memory that they might not end up using to get a performance boost.

taylorswift · October 30, 2018, 2:18am

are we sure array reallocs are the reason for a 3.5x slowdown?

Ben_Cohen · October 30, 2018, 2:46am

I’d be very surprised if they were. But it will be a small factor when once other issues are eliminated. For example, in the joined benchmark reallocation accounts for about a 10% slowdown iirc.

taylorswift · October 30, 2018, 2:48am

okay that makes more sense cause i always assumed the reallocs were pretty cheap. any idea what’s killing this flatMap then?

Ben_Cohen · October 30, 2018, 10:24pm

From some experimentation, it seems to be the intermediate temporary arrays that are the problem. You can see this by changing the input array from [(Int,Int,Int,Int)] to [[Int]]. That pretty much eliminates the disparity, at least on my Darwin machine.

But it appears you can do even better if you conform the struct to itself be a collection, allowing it to be flat mapped directly:

extension S: RandomAccessCollection {
  var startIndex: Int { return 0 }
  var endIndex: Int { return 4 }
  subscript(i: Int) -> Int {
    switch i {
    case 0: return a
    case 1: return b
    case 2: return c
    case 3: return d
    default: fatalError()
    }
  }
}

// then either
func structCollectionFunctional(_ input: [S]) -> [Int] {
  return input.flatMap { $0 }
}

// or
func structCollectionLoopNoRes(_ input: [S]) -> [Int] {
  var flattened: [Int] = []
  for s in input {
    flattened += s
  }
  return flattened
}

This seems to give the optimizer a much better handle on what's going on, resulting in big speedups with both the functional and imperative approaches.

Michael_Gottesman · October 31, 2018, 12:16am

@Ben_Cohen can you add a version of this to that benchmark and add your note about the performance characteristics?

taylorswift · October 31, 2018, 12:18am

I did not think this could possibly be true, yet my tests show the same thing! (the percent change is small because my project benchmarks don’t isolate the flatMap from the rest of the library work, but in terms of transformation overhead, that’s a lot). Unfortunately it seems to absolutely tank cross-module performance.

before:

RGBA8   (interleaved, internal): 543,430
RGBA8   (interleaved, public  ): 536,800

after:

RGBA8   (interleaved, internal): 524,484
RGBA8   (interleaved, public  ): 1,215,603

This is how i implemented it:

public 
protocol _RGBAColor:RandomAccessCollection
{
    associatedtype Component:FixedWidthInteger & UnsignedInteger
    var r:Component 
    {
        get 
    }
    var g:Component 
    {
        get 
    }
    var b:Component 
    {
        get 
    }
    var a:Component 
    {
        get 
    }
}

extension _RGBAColor 
{
    public 
    var startIndex:Int 
    {
        return 0
    }
    public 
    var endIndex:Int 
    {
        return 4
    }
    
    public 
    subscript(index:Int) -> Component 
    {
        switch index 
        {
            case 0:
                return self.r
            case 1:
                return self.g 
            case 2:
                return self.b 
            case 3:
                return self.a
            default:
                fatalError("(_RGBAColor) index \(index) out of range")
        }
    }
}

extension Array where Element:_RGBAColor
{
    ...
    
    /** Flattens this matrix of RGBA colors into an unstructured array 
        of its interleaved components.
        
        *Inlinable*.
        
        - Returns: An unstructured array containing interleaved color components 
            in red, green, blue, alpha order.
        
        - Note: In most cases, it is better to temporarily rebind a structured pixel 
            matrix to a flattened array type than to convert it to interleaved form.
    */
    @inlinable
    public 
    func interleaved() -> [Element.Element] 
    {
        return self.flatMap{ (element:Element) in element }
    }
}


    @_fixed_layout
    public
    struct RGBA<Component>:Equatable, CustomStringConvertible, _RGBAColor
        where Component:FixedWidthInteger & UnsignedInteger
    {
        /// The red component of this color. 
        public
        let r:Component
        /// The green component of this color. 
        public 
        let g:Component
        /// The blue component of this color. 
        public 
        let b:Component
        /// The alpha component of this color. 
        public 
        let a:Component

        ...
    }

Any idea what’s going on? I feel like it has something to do with inlining getting inhibited or something.

I also noticed this only compiles if I set the return type of interleaved() to [Element.Element] and not Element.Component. Aren’t they supposed to be the same type?

Also, this is really interesting, though it still has a big drawback: the RandomAccessCollection conformance has to be public for the interleaved() method to be public. How can I keep this out of my library’s public API?

I tried running this on the small example file, and I’m seeing an improvement, but still about a 70% slowdown from the loop version without RandomAccessCollection conformance.

(loop with RandomAccessCollection) 798
(flatMap with RandomAccessCollection) 1794

(loop without RandomAccessCollection) 1064
(flatMap without RandomAccessCollection) 3179

import func Glibc.clock

struct S:RandomAccessCollection
{
    let a:Int, 
        b:Int, 
        c:Int, 
        d:Int 
    
    var startIndex:Int 
    {
        return 0
    }
    
    var endIndex:Int 
    {
        return 4
    }

    subscript(index:Int) -> Int 
    {
        switch index 
        {
            case 0:
                return self.a
            case 1:
                return self.b 
            case 2:
                return self.c 
            case 3:
                return self.d
            default:
                fatalError("(_RGBAColor) index \(index) out of range")
        }
    }
}

let structured:[S] = (0 ..< 1 << 16).map 
{
    _ in
    .init(  a: Int.random(in: .min ... .max), 
            b: Int.random(in: .min ... .max), 
            c: Int.random(in: .min ... .max), 
            d: Int.random(in: .min ... .max))
}

func flattenFunctional(_ structured:[S]) -> [Int]
{
    return structured.flatMap{$0}
}

func flattenLoop(_ structured:[S]) -> [Int]
{
    var flattened:[Int] = []
        flattened.reserveCapacity(structured.count << 2)
    for aggregate:S in structured 
    {
        flattened += aggregate
    }
    
    return flattened
}


let t1:Int = clock()
let flattened1:[Int] = flattenLoop(structured)
print(clock() - t1)
print(flattened1.last as Any)

let t2:Int = clock()
let flattened2:[Int] = flattenFunctional(structured)
print(clock() - t2)
print(flattened2.last as Any)

dabrahams · November 3, 2018, 5:36am

It should be possible to do the same thing, then append a lazy flatMap, thus preserving most of the functional flavah.