if Swift is going to offer FP-style constructs and present them as “the preferred idiom” it should do its best to make them as performant as the iterative style. maybe this is a problem with the Swift community’s general “use the high level construct and trust the compiler to make this fast” attitude but you can’t really advance that as a language philosophy and at the same time give up and say something as heavily used and fundamental as flatMap is too hard for the compiler to optimize.
flatMap
-ing sequences is a particularly difficult case to optimize away intermediate collections from, since the number of result elements that can arise from each input element is arbitrary, and the computation of the size can't be cleanly separated from the computation of the result. It could be done with some special case optimization passes, eventually. There's nothing wrong with writing a for loop when you need one.
But saying "flatMap
should be as performant as the iterative loop" is not the same as thinking that the compiler can step in and transform code beyond what is practical. There are many ways to solve this problem.
The clear difference between your iterative and declarative implementations is not the functional style. It is that in the imperative version you, the programmer, are adding extra information: you know the final size of the array and you're reserving it up front.
In this case, it is theoretically (though not practically) possible for the compiler to deduce this information. But in many cases, this wouldn't even be possible. Maybe the mapping function is complex, or is non-deterministic. Maybe it's an opaque function call.
A different approach would be to add a size hint to the flat map call. This could be useful pattern for other similar transformations like filter
or joined
, where the user has better knowledge of the final length or perhaps is willing to expend more memory that they might not end up using to get a performance boost.
are we sure array reallocs are the reason for a 3.5x slowdown?
I’d be very surprised if they were. But it will be a small factor when once other issues are eliminated. For example, in the joined
benchmark reallocation accounts for about a 10% slowdown iirc.
okay that makes more sense cause i always assumed the reallocs were pretty cheap. any idea what’s killing this flatMap then?
From some experimentation, it seems to be the intermediate temporary arrays that are the problem. You can see this by changing the input array from [(Int,Int,Int,Int)]
to [[Int]]
. That pretty much eliminates the disparity, at least on my Darwin machine.
But it appears you can do even better if you conform the struct to itself be a collection, allowing it to be flat mapped directly:
extension S: RandomAccessCollection {
var startIndex: Int { return 0 }
var endIndex: Int { return 4 }
subscript(i: Int) -> Int {
switch i {
case 0: return a
case 1: return b
case 2: return c
case 3: return d
default: fatalError()
}
}
}
// then either
func structCollectionFunctional(_ input: [S]) -> [Int] {
return input.flatMap { $0 }
}
// or
func structCollectionLoopNoRes(_ input: [S]) -> [Int] {
var flattened: [Int] = []
for s in input {
flattened += s
}
return flattened
}
This seems to give the optimizer a much better handle on what's going on, resulting in big speedups with both the functional and imperative approaches.
@Ben_Cohen can you add a version of this to that benchmark and add your note about the performance characteristics?
I did not think this could possibly be true, yet my tests show the same thing! (the percent change is small because my project benchmarks don’t isolate the flatMap from the rest of the library work, but in terms of transformation overhead, that’s a lot). Unfortunately it seems to absolutely tank cross-module performance.
before:
RGBA8 (interleaved, internal): 543,430
RGBA8 (interleaved, public ): 536,800
after:
RGBA8 (interleaved, internal): 524,484
RGBA8 (interleaved, public ): 1,215,603
This is how i implemented it:
public
protocol _RGBAColor:RandomAccessCollection
{
associatedtype Component:FixedWidthInteger & UnsignedInteger
var r:Component
{
get
}
var g:Component
{
get
}
var b:Component
{
get
}
var a:Component
{
get
}
}
extension _RGBAColor
{
public
var startIndex:Int
{
return 0
}
public
var endIndex:Int
{
return 4
}
public
subscript(index:Int) -> Component
{
switch index
{
case 0:
return self.r
case 1:
return self.g
case 2:
return self.b
case 3:
return self.a
default:
fatalError("(_RGBAColor) index \(index) out of range")
}
}
}
extension Array where Element:_RGBAColor
{
...
/** Flattens this matrix of RGBA colors into an unstructured array
of its interleaved components.
*Inlinable*.
- Returns: An unstructured array containing interleaved color components
in red, green, blue, alpha order.
- Note: In most cases, it is better to temporarily rebind a structured pixel
matrix to a flattened array type than to convert it to interleaved form.
*/
@inlinable
public
func interleaved() -> [Element.Element]
{
return self.flatMap{ (element:Element) in element }
}
}
@_fixed_layout
public
struct RGBA<Component>:Equatable, CustomStringConvertible, _RGBAColor
where Component:FixedWidthInteger & UnsignedInteger
{
/// The red component of this color.
public
let r:Component
/// The green component of this color.
public
let g:Component
/// The blue component of this color.
public
let b:Component
/// The alpha component of this color.
public
let a:Component
...
}
Any idea what’s going on? I feel like it has something to do with inlining getting inhibited or something.
I also noticed this only compiles if I set the return type of interleaved()
to [Element.Element]
and not Element.Component
. Aren’t they supposed to be the same type?
Also, this is really interesting, though it still has a big drawback: the RandomAccessCollection
conformance has to be public
for the interleaved()
method to be public
. How can I keep this out of my library’s public API?
I tried running this on the small example file, and I’m seeing an improvement, but still about a 70% slowdown from the loop version without RandomAccessCollection
conformance.
(loop with RandomAccessCollection) 798
(flatMap with RandomAccessCollection) 1794
(loop without RandomAccessCollection) 1064
(flatMap without RandomAccessCollection) 3179
import func Glibc.clock
struct S:RandomAccessCollection
{
let a:Int,
b:Int,
c:Int,
d:Int
var startIndex:Int
{
return 0
}
var endIndex:Int
{
return 4
}
subscript(index:Int) -> Int
{
switch index
{
case 0:
return self.a
case 1:
return self.b
case 2:
return self.c
case 3:
return self.d
default:
fatalError("(_RGBAColor) index \(index) out of range")
}
}
}
let structured:[S] = (0 ..< 1 << 16).map
{
_ in
.init( a: Int.random(in: .min ... .max),
b: Int.random(in: .min ... .max),
c: Int.random(in: .min ... .max),
d: Int.random(in: .min ... .max))
}
func flattenFunctional(_ structured:[S]) -> [Int]
{
return structured.flatMap{$0}
}
func flattenLoop(_ structured:[S]) -> [Int]
{
var flattened:[Int] = []
flattened.reserveCapacity(structured.count << 2)
for aggregate:S in structured
{
flattened += aggregate
}
return flattened
}
let t1:Int = clock()
let flattened1:[Int] = flattenLoop(structured)
print(clock() - t1)
print(flattened1.last as Any)
let t2:Int = clock()
let flattened2:[Int] = flattenFunctional(structured)
print(clock() - t2)
print(flattened2.last as Any)
It should be possible to do the same thing, then append a lazy flatMap, thus preserving most of the functional flavah.