First I’d like to apologize for my longwinded ways. Recent Sequence
-related threads in the forums lead me to think we are in a sort of “Blind men and an elephant” situation. I’m sorry if this will read like an origin story, but I think it’s better to describe all the nitty-gritty of my experience and let you decide what is relevant. I cannot do that alone, as I’m confused by it all. Perhaps I’m the only blind man here and someone with more holistic perspective will be able to enlighten it for me and calm my growing worries. Let’s start with my motivating example:
Mandelbrot Set Playground
After the Swift introduction in 2014 I was very excited! Based on previous experiences, I’ve filed Swift in my head as a statically compiled version of Scala, due to the similar fusion of functional and object-oriented programming styles, emphasis on generic code, strong type system and the need for seamless interoperability with the established platform predecessor (Objective-C and Java, respectively). But after the initial exploration I’ve ultimately postponed switching to the language because of compiler and Xcode instability.
After two year hiatus, I’ve restarted the intense exploration of the Swift language about a year and a half ago, in December 2016. Inspired by all the marvelous Swift Playgrounds that were coming out, I thought I’ll teach my son about complex numbers and related math through the exploration of the amazing Mandelbrot set. Its visualization lends itself naturally to modeling using a sequence of numbers derived by iterative application of a simple function. In my opinion the imperative style algorithm as expressed in most procedural languages terribly obfuscates the beautiful simplicity of the underlying calculation. I’m an ardent proponent of descriptive style of programming, expressing your intent in terms of higher level concepts, rather then telling a machine step-by-step what to do. To my eyes, the functional programming style enables much more natural expression of the Mandelbrot set computation, so I set out to build a playground based on this strikingly simple Haskell code .
Looking at my code archive, I see that Mandelbrot set was also the toy example I’ve been playing with back in December 2014, probably inspired by Colin Eberhardt’s article. Swift’s collections framework was extremely anemic at the time. I had to roll my own implementations of take
, takeWhile
and iterate
functions from Haskell’s Prelude.
These I’ve originally built as free functions producing Generator
s. But serious performance issues made me try numerous variations (using GeneratorOf
or custom Sequence
s) to find the culprit or some workaround. Some of these were crashing at runtime when compiled with optimizations, so after filing a bug report I have tabled the project.
When I dug it out after two years break, rewrote it to Iterator
s in Swift 3, I still had to manually back-port takeWhile
equivalent prefix(while:)
from SE-0045 before it shipped in Swift 3.1. The compiler was no longer crashing left and right, but the performance issues were still present.
Core Iteration
This gist FuncVsImperative shows the different kinds of iteration-and-summing variations I was exploring (excuse the few duplicate inconsistencies after the migration to Swift 3, that’s all I found in my archive). Basically I was doing a form of higher level monkey testing or fuzzing: trying different but semantically equivalent expressions of the basic iteration patterns and their combinations (chained sequence operations) and observing their performance. To break it all down for myself, I started with measuring and comparing the performance of summation of an integer range using for-in
, manual while
looping with an Iterator
and the reduce
method. Then built up to combining functions into bigger blocks.
It was quite disturbing that after such intense investigations I wasn’t able to build much intuition for writing code that performed well. In my experience, high level Swift code in functional style lead to unpredictable results with regards to the performance of the final optimized build. The Swift compiler was at this time routinely tripped by unknown (or hard to judge in advance) factors into a state where it failed to inline code, produce specializations for generic algorithms or even introduced boxing for primitive types and making you pay full price of reference counting for each and every element of the sequence processed in a tight loop. The variation in benchmarking of various functionally equivalent expressions was sometimes two orders of magnitude.
Zero Cost Abstraction?
In my implementations I was initially striving for minimal code, leading me to build on top of AnyIterator
and AnySequence
that have convenient closure taking initializers.
Since my initial attempts have largely failed, probably because I had in my head an incorrect model of a sufficiently smart compiler, I have turned my attention towards the source code in the standard library to find out how to write performant code that takes advantage of strengths and avoids the pitfalls in the Swift compiler as it currently exists.
Judging by the standard library’s proliferation of concrete types implementing Sequence
and Iterator
protocols, it seems that going with implementations based on concrete types is the way to go. I wondered why creating ad-hoc Iterators
and Sequences
using functional composition would be inherently slower then pre-baking them as structs or classes. I feel like this proliferation is a sign of certain immaturity of the compiler at the time those parts of standard library were written.
Cost of Type Erasure
I now realize that the low performance of my initial functional implementation comes from the use of AnyIterator
. My penchant for concise code kept me reducing the implementation until I got this compact body inside the iterator (effectively 3 lines that aren’t boilerplate):
func takeWhile<S : Sequence>(
_ includeElement: @escaping (S.Iterator.Element) -> Bool,
source: S
) -> AnyIterator<S.Iterator.Element> {
var takeMore = true
var it = source.makeIterator()
return AnyIterator {
guard takeMore, let e = it.next() else { return nil }
takeMore = includeElement(e)
return takeMore ? e : nil
}
}
At first glance, AnyIterator
seemed like the right tool:
- Convenient closure taking initializer.
- Trivial
Sequence
protocol conformance. - Encapsulation of uninteresting implementation details.
- Stops propagation of the deeply nested composite types all over the code base.
I didn’t realize that it is also a hard optimization barrier that currently stop the compiler from generating anything remotely fast. It is a performance trap!
If I understand correctly, one of the big ideas behind Swift language is that most of it is defined in the standard library. That even primitive types like Int
are conceptually of type struct
, and it’s the optimizing compiler that eliminates the cost of this abstraction away, producing code that performs on par with languages that have them built in.
It seems like this philosophy was used to implement the type erasure using classes or capturing closures (which are classes in disguise, if I’m not mistaken), but the zero-cost abstraction seems to falls down on its face in this case. Maybe I’m mistaken here, but as I understand it, classes (and capturing closures) are always heap allocated and passing them to functions always incurs reference counting costs. Is there a more lightweight way to achieve type erasure for value based type? Can I have a type erasure of Iterator/Sequence
implemented as struct
that wouldn’t involve reference counting? Is our generics story still missing an important part? Is there anything on Swift’s roadmap that will improve this? Will generalized existentials also impact performance or just the expressivity? I have a feeling that the change of default calling convention to borrowed
that Michael Gottesmann is working on might help here, but is that enough?
Unfold Sequence
Creating concrete types for all iterators and sequences due to performance reasons, when their functional body can be expressed in 2–3 lines of code seemed really unfortunate to me. So I kept searching and found the sequence
function that was introduced in SE-0094. Thanks to the perseverance of @Lily_Ballard and @Erica_Sadun it made it through the rocky evolution review and landed in Swift 3. I totally fell in love with this beautifully elegant creature!
Since I was back-porting the prefix(while:)
from Swift 3.1 anyway, I’ve also converted the AnyIterator
-based takeWhile
into sequence
-based one as a code kata exercise. It was fun!
func takeWhile<S: Sequence>(
while predicate: @escaping (S.Element) -> Bool,
seq: S
) -> UnfoldSequence<S.Element, PredicatedIterator<S.Iterator>> {
return sequence(
state: PredicatedIterator(predicate, seq.makeIterator()),
next: { (state: inout PredicatedIterator) -> S.Element? in
guard let e = state.iterator.next() else { return nil }
return state.predicate(e) ? e : nil
})
}
Main body of the closure reduces to just two lines, because the UnfoldSequence
takes care of satisfying the post nil
guarantee introduced in SE-0052, hiding its cost from you.
The PredicatedIterator
would ideally look like this:
public typealias PredicatedIterator = (predicate: Predicate, iterator: Self.Iterator)
but that crashes the compiler, because re-abstraction for inout
values is not yet implemented. To work around this limitation, it’s defined as:
public struct PredicatedIterator<I: IteratorProtocol> {
var predicate: (I.Element) -> Bool
var iterator: I
init(_ _predicate: @escaping (I.Element) -> Bool, _ _iterator: I) {
predicate = _predicate
iterator = _iterator
}
}
I came up with following sequence
-based replacements for concrete Sequence and Iterator types produced by various sequence methods: enumerated
, zip
, prefix
, prefix(while:)
, dropFirst
, drop(while:)
.
Why This, Why That and Why?
The reason why I started to explore this area is that I made a mistake when Swift 3.1 made my back-port of prefix(while:)
unnecessary. I thought I have discovered major performance regression in Swift 3.1, but the issue was that my back-ported version in 3.0 was using the lazy implementation, but I wasn’t calling the .lazy method after switching to the standard library implementation in 3.1.
That’s how I discovered that chaining several eager sequence operation is much slower than chaining lazy sequence operations. My questions about the implementation details of AnySequence/AnyIterator
on the compiler list went mostly without response (maybe @dabrahams can chime in now? ).
This lead to the exploration of the curiously ad-hoc implementations of default Sequence
operations, with strange mix of eager and lazy implementations. The prefix
and drop(while:)
were implemented lazily using internal classes, while the other operations are eager and accumulate the results in an Array
. All of this is masked by them all returning the type-erased AnySequence
. Then there’s the wholly parallel world of LazySequence
. If it interests you, I’ve documented some of the history of how it all came to be in the discussion of the merged PR #9811 that improved the performance of prefix
and drop(while:)
by 50% just by switching them from class
to struct
.
That PR was originally just a benchmark to provide supporting evidence for another PR seeking to switch all eager Sequence
methods to return Array
directly instead of AnySequence
(but that was crashing the compiler a year ago…), as described in this pre-pitch proposal.
Cost of Abstractions, Revisited
The table below is the summary of runtimes (μs) measured from sequence slicing benchmarks added a year ago. While adding support for GYB (Generate Your Boilerplate) in benchmarks it was easy to add more comprehensive performance tests. All of these benchmarks are processing a sequence of 4096 Int
elements:
DropFirst
andDropWhile
skip the first 1024 elements and sum the remaining 3072.DropLast
drops the last 3072 elements and sums the first 1024.Prefix
andPrefixWhile
sum the first 3072 elements and skip the rest.Suffix
skips the first 3072 elements and sums the last 1024.
This means part of the sequence is processed internally by the operation and the rest is wended out as a new SubSequence
for processing in a for-in
loop. The use of primitive value type Int
here is significant, because it stresses how good is the Swift compiler at eliminating the abstractions. If we were iterating over sequences of classes, we would have to incur the cost of reference counting as the elements are passed between the sequence’s Iterator
s and our summation loop etc. Another important factor is how well the compiler can specialize the generic Sequence
algorithms for the concrete case of our Int
summation. One more aspect is turning the virtual dispatch of methods into direct dispatch. In an ideal case all these abstractions should fall away and yield code that performs on par with manual summation with a for
loop in C. We are processing 3 basic underlying sequence types: Array
, Range
and UnfoldSequence
. First two are also Collection
s that have optimized implementations of the slicing methods defined in the Sequence
protocol. This means they can skip around in constant time, be processed from both sides and can return optimized SubSequence
types like ArraySlice
. The UnfoldSequence
returned from the sequence
function represents a generated sequence, which must be processed like a stream, one element at a time, from the beginning, using the default implementations of the Sequence
slicing operations.
Since all the default implementations of Sequence
slicing operations return type erased AnySequence
as their SubSequence
, wrapping the 3 base types in it simulates the operation performed as a second method in a chain and stresses how well the compiler deals with the type-erasure. (This is the case with the default eager methods on Sequence
, that wrap an Array
in AnySequence
… but I now realize that that particular case isn’t measured… ) One factor is the ability to see through it and be able to use the optimized method from the refined Collection
types, another is dealing with closures used to hide the underlying type.
Similarly to sequence
, the underlying IndexingIterator
from Range
is used wrapped inside the AnySequence
to test the default Sequence
slicing operations without the access to the optimized implementations from Collection
. To test different existential collection, Array
is wrapped in AnyCollection
instead of AnySequence
. (I’m not sure if that is used anywhere in practice, but since it exists in the standard library, I took the liberty to test its performance, too…)
Another measured performance axis is the use of .lazy
method to switch from eager to lazy processing. Overall this leads to exploding number of benchmarks: 14 sequences per operation, all obtained as variations from the 3 tested base types.
Sequence | DropFirst | DropLast | DropWhile | Prefix | PrefixWhile | Suffix |
---|---|---|---|---|---|---|
AnyCollection | 179 | 71 | 232 | 180 | 334 | 71 |
AnyCollectionLazy | 327093 | 110117 | 332 | 327441 | 238 | 110531 |
AnySeqCRangeIter | 87412 | 26689 | 73902 | 68193 | 64800 | 22462 |
AnySeqCRangeIterLazy | 87224 | 26864 | 331 | 68367 | 237 | 22479 |
AnySeqCntRange | 160 | 51 | 212 | 160 | 315 | 51 |
AnySeqCntRangeLazy | 160 | 51 | 332 | 160 | 237 | 51 |
AnySequence | 28524 | 39602 | 28833 | 23534 | 71075 | 35484 |
AnySequenceLazy | 28464 | 40369 | 11557 | 23572 | 8962 | 35588 |
Array | 162 | 46 | 230 | 123 | 326 | 59 |
ArrayLazy | 198 | 56 | 507 | 159 | 197 | 69 |
CountableRange | 77 | 26 | 79 | 77 | 81 | 26 |
CountableRangeLazy | 77 | 26 | 254 | 77 | 77 | 26 |
Sequence | 10767 | 9126 | 8098 | 8148 | 6288 | 34846 |
SequenceLazy | 11842 | 8993 | 193 | 9027 | 116 | 35574 |
These are results from my machine as measured during my attempt to improve robustness of the Swift Benchmark Suite. Here’s this table in Numbers, with various views comparing:
- Sequence Slowdown relative to fastest Operation
- Operation Slowdown relative to fastest Sequence
- Lazy vs Eager
- Slowdown Any vs Direct
In my opinion, these results are not a very good look. AnyCollectionLazy
can be 4,248x slower than CountableRange
in 4 out of 6 operations. AnyCollectionLazy
is 1,376x slower doing Prefix
the almost equivalent PrefixWhile
. So let’s ignore that one for the moment. Making the operation lazy can make it 3x slower, or 40x faster. Wrapping something in AnySequence
usually makes it 2-4x slower, except the cases when it’s 11x, 60x or 77x slower. The UnfoldSequence
stands out as 60x slower than the other two sequences when used directly, with the exception of the lazy DropWhile
and PrefixWhile
variants which perform completely on par. More on this later.
What is going on there?!? Well, in the case of DropFirstAnyCollectionLazy
we have apparently achieved Kandinsky and Mondrian levels of abstraction and Swift compiler just gave up on optimizing this code. Looking at the stack trace in Instruments shows a Jackson Pollock-esque mix of protocol witness functions nested 6 times deep for each wrapped generic protocol implementation, sprinkled with swift_unknownRetain
, swift_getGenericMetadata
, swift_unknownRelease
, _swift_release_dealloc
. The trace from Instruments is some 183 calls deep.
I don't know why this image doesn't appear here… Discourse bug? Grab it from here instead:
Contrasting this to how effective is the Swift compiler in eliminating the abstractions for the underlying Range
without the type erasure is stunning: whole DropFirstCountableRange
as well as DropFirstCountableRangeLazy
boils down to single call to protocol witness for static Equatable.== infix(_:_:) in conformance Int
.
I thought in all fairness it would probably make sense to test another benchmark that’s based on the IndexingIterator
(like is the case with Range
) directly as a Sequence
and it being wrapped in AnySequence
. So I’ve tested StrideTo
. The DropWhileAnyStrideTo
didn’t fail so horribly, as its trace was only 95 calls deep and it was about an order of magnitude faster than the worst case.
So to sum up, using type erasure with the standard library provided wrappers can in the worst case lead to:
- boxing of the value type, which incurs the reference counting penalty;
- failure to specialize the generic algorithm, which results in performing all operations on the element type using the runtime generics witness table.
Procedural Sidenote
Since adding these benchmarks, one year has passed. I have been sidetracked by working on improving the robustness of the measurements in the Swift Benchmark Suite (SBS) and have not followed up on the above mentioned anomalous results. Mere presence of them in the SBS did not magically improve the situation. We seem to be lacking a good visibility into the complex web of performance interactions in Swift. Quite a few important and core parts seem to lack thorough coverage and without personal interest, not much happens after a benchmark is added. I probably should have at least field bugs for the slow benchmarks, as coming up with SIL optimization passes is too far out of my league at the moment and I don’t know how else can I help. I’m sorry about failing to follow up!