Troubling `Sequence`

First I’d like to apologize for my longwinded ways. Recent Sequence-related threads in the forums lead me to think we are in a sort of “Blind men and an elephant” situation. I’m sorry if this will read like an origin story, but I think it’s better to describe all the nitty-gritty of my experience and let you decide what is relevant. I cannot do that alone, as I’m confused by it all. Perhaps I’m the only blind man here and someone with more holistic perspective will be able to enlighten it for me and calm my growing worries. Let’s start with my motivating example:

Mandelbrot Set Playground

After the Swift introduction in 2014 I was very excited! Based on previous experiences, I’ve filed Swift in my head as a statically compiled version of Scala, due to the similar fusion of functional and object-oriented programming styles, emphasis on generic code, strong type system and the need for seamless interoperability with the established platform predecessor (Objective-C and Java, respectively). But after the initial exploration I’ve ultimately postponed switching to the language because of compiler and Xcode instability.

After two year hiatus, I’ve restarted the intense exploration of the Swift language about a year and a half ago, in December 2016. Inspired by all the marvelous Swift Playgrounds that were coming out, I thought I’ll teach my son about complex numbers and related math through the exploration of the amazing Mandelbrot set. Its visualization lends itself naturally to modeling using a sequence of numbers derived by iterative application of a simple function. In my opinion the imperative style algorithm as expressed in most procedural languages terribly obfuscates the beautiful simplicity of the underlying calculation. I’m an ardent proponent of descriptive style of programming, expressing your intent in terms of higher level concepts, rather then telling a machine step-by-step what to do. To my eyes, the functional programming style enables much more natural expression of the Mandelbrot set computation, so I set out to build a playground based on this strikingly simple Haskell code :heart_eyes:.

Looking at my code archive, I see that Mandelbrot set was also the toy example I’ve been playing with back in December 2014, probably inspired by Colin Eberhardt’s article. Swift’s collections framework was extremely anemic at the time. I had to roll my own implementations of take, takeWhile and iterate functions from Haskell’s Prelude.

These I’ve originally built as free functions producing Generators. But serious performance issues made me try numerous variations (using GeneratorOf or custom Sequences) to find the culprit or some workaround. Some of these were crashing at runtime when compiled with optimizations, so after filing a bug report I have tabled the project.

When I dug it out after two years break, rewrote it to Iterators in Swift 3, I still had to manually back-port takeWhile equivalent prefix(while:) from SE-0045 before it shipped in Swift 3.1. The compiler was no longer crashing left and right, but the performance issues were still present.

Core Iteration

This gist FuncVsImperative shows the different kinds of iteration-and-summing variations I was exploring (excuse the few duplicate inconsistencies after the migration to Swift 3, that’s all I found in my archive). Basically I was doing a form of higher level monkey testing or fuzzing: trying different but semantically equivalent expressions of the basic iteration patterns and their combinations (chained sequence operations) and observing their performance. To break it all down for myself, I started with measuring and comparing the performance of summation of an integer range using for-in, manual while looping with an Iterator and the reduce method. Then built up to combining functions into bigger blocks.

It was quite disturbing that after such intense investigations I wasn’t able to build much intuition for writing code that performed well. In my experience, high level Swift code in functional style lead to unpredictable results with regards to the performance of the final optimized build. The Swift compiler was at this time routinely tripped by unknown (or hard to judge in advance) factors into a state where it failed to inline code, produce specializations for generic algorithms or even introduced boxing for primitive types and making you pay full price of reference counting for each and every element of the sequence processed in a tight loop. The variation in benchmarking of various functionally equivalent expressions was sometimes two orders of magnitude.

Zero Cost Abstraction?

In my implementations I was initially striving for minimal code, leading me to build on top of AnyIterator and AnySequence that have convenient closure taking initializers.

Since my initial attempts have largely failed, probably because I had in my head an incorrect model of a sufficiently smart compiler, I have turned my attention towards the source code in the standard library to find out how to write performant code that takes advantage of strengths and avoids the pitfalls in the Swift compiler as it currently exists.

Judging by the standard library’s proliferation of concrete types implementing Sequence and Iterator protocols, it seems that going with implementations based on concrete types is the way to go. I wondered why creating ad-hoc Iterators and Sequences using functional composition would be inherently slower then pre-baking them as structs or classes. I feel like this proliferation is a sign of certain immaturity of the compiler at the time those parts of standard library were written.

Cost of Type Erasure

I now realize that the low performance of my initial functional implementation comes from the use of AnyIterator. My penchant for concise code kept me reducing the implementation until I got this compact body inside the iterator (effectively 3 lines that aren’t boilerplate):

func takeWhile<S : Sequence>(
    _ includeElement: @escaping (S.Iterator.Element) -> Bool,
    source: S
    ) ->  AnyIterator<S.Iterator.Element> {
    var takeMore = true
    var it = source.makeIterator()
    
    return AnyIterator {
        guard takeMore, let e = it.next() else { return nil }
        takeMore = includeElement(e)
        return takeMore ? e : nil
    }
}

At first glance, AnyIterator seemed like the right tool:

  • Convenient closure taking initializer.
  • Trivial Sequence protocol conformance.
  • Encapsulation of uninteresting implementation details.
  • Stops propagation of the deeply nested composite types all over the code base.

I didn’t realize that it is also a hard optimization barrier that currently stop the compiler from generating anything remotely fast. It is a performance trap!

If I understand correctly, one of the big ideas behind Swift language is that most of it is defined in the standard library. That even primitive types like Int are conceptually of type struct, and it’s the optimizing compiler that eliminates the cost of this abstraction away, producing code that performs on par with languages that have them built in.

It seems like this philosophy was used to implement the type erasure using classes or capturing closures (which are classes in disguise, if I’m not mistaken), but the zero-cost abstraction seems to falls down on its face in this case. Maybe I’m mistaken here, but as I understand it, classes (and capturing closures) are always heap allocated and passing them to functions always incurs reference counting costs. Is there a more lightweight way to achieve type erasure for value based type? Can I have a type erasure of Iterator/Sequence implemented as struct that wouldn’t involve reference counting? Is our generics story still missing an important part? Is there anything on Swift’s roadmap that will improve this? Will generalized existentials also impact performance or just the expressivity? I have a feeling that the change of default calling convention to borrowed that Michael Gottesmann is working on might help here, but is that enough?

Unfold Sequence

Creating concrete types for all iterators and sequences due to performance reasons, when their functional body can be expressed in 2–3 lines of code seemed really unfortunate to me. So I kept searching and found the sequence function that was introduced in SE-0094. Thanks to the perseverance of @Lily_Ballard and @Erica_Sadun it made it through the rocky evolution review and landed in Swift 3. I totally fell in love with this beautifully elegant creature!

Since I was back-porting the prefix(while:) from Swift 3.1 anyway, I’ve also converted the AnyIterator-based takeWhile into sequence-based one as a code kata exercise. It was fun!

func takeWhile<S: Sequence>(
    while predicate: @escaping (S.Element) -> Bool,
    seq: S
    ) -> UnfoldSequence<S.Element, PredicatedIterator<S.Iterator>> {
    return sequence(
        state: PredicatedIterator(predicate, seq.makeIterator()),
        next: { (state: inout PredicatedIterator) -> S.Element? in
            guard let e = state.iterator.next() else { return nil }
            return state.predicate(e) ? e : nil
    })
}

Main body of the closure reduces to just two lines, because the UnfoldSequence takes care of satisfying the post nil guarantee introduced in SE-0052, hiding its cost from you.

The PredicatedIterator would ideally look like this:

public typealias PredicatedIterator = (predicate: Predicate, iterator: Self.Iterator)

but that crashes the compiler, because re-abstraction for inout values is not yet implemented. To work around this limitation, it’s defined as:

public struct PredicatedIterator<I: IteratorProtocol> {
    var predicate: (I.Element) -> Bool
    var iterator: I
    init(_ _predicate: @escaping (I.Element) -> Bool, _ _iterator: I) {
        predicate = _predicate
        iterator = _iterator
    }
}

I came up with following sequence-based replacements for concrete Sequence and Iterator types produced by various sequence methods: enumerated, zip, prefix, prefix(while:), dropFirst, drop(while:).

Why This, Why That and Why? :headphones:

The reason why I started to explore this area is that I made a mistake when Swift 3.1 made my back-port of prefix(while:) unnecessary. I thought I have discovered major performance regression in Swift 3.1, but the issue was that my back-ported version in 3.0 was using the lazy implementation, but I wasn’t calling the .lazy method after switching to the standard library implementation in 3.1.

That’s how I discovered that chaining several eager sequence operation is much slower than chaining lazy sequence operations. My questions about the implementation details of AnySequence/AnyIterator on the compiler list went mostly without response (maybe @dabrahams can chime in now? :sunglasses:).

This lead to the exploration of the curiously ad-hoc implementations of default Sequence operations, with strange mix of eager and lazy implementations. The prefix and drop(while:) were implemented lazily using internal classes, while the other operations are eager and accumulate the results in an Array. All of this is masked by them all returning the type-erased AnySequence. Then there’s the wholly parallel world of LazySequence. If it interests you, I’ve documented some of the history of how it all came to be in the discussion of the merged PR #9811 that improved the performance of prefix and drop(while:) by 50% just by switching them from class to struct.

That PR was originally just a benchmark to provide supporting evidence for another PR seeking to switch all eager Sequence methods to return Array directly instead of AnySequence (but that was crashing the compiler a year ago…), as described in this pre-pitch proposal.

Cost of Abstractions, Revisited

The table below is the summary of runtimes (μs) measured from sequence slicing benchmarks added a year ago. While adding support for GYB (Generate Your Boilerplate) in benchmarks it was easy to add more comprehensive performance tests. All of these benchmarks are processing a sequence of 4096 Int elements:

  • DropFirst and DropWhile skip the first 1024 elements and sum the remaining 3072.
  • DropLast drops the last 3072 elements and sums the first 1024.
  • Prefix and PrefixWhile sum the first 3072 elements and skip the rest.
  • Suffix skips the first 3072 elements and sums the last 1024.

This means part of the sequence is processed internally by the operation and the rest is wended out as a new SubSequence for processing in a for-in loop. The use of primitive value type Int here is significant, because it stresses how good is the Swift compiler at eliminating the abstractions. If we were iterating over sequences of classes, we would have to incur the cost of reference counting as the elements are passed between the sequence’s Iterators and our summation loop etc. Another important factor is how well the compiler can specialize the generic Sequence algorithms for the concrete case of our Int summation. One more aspect is turning the virtual dispatch of methods into direct dispatch. In an ideal case all these abstractions should fall away and yield code that performs on par with manual summation with a for loop in C. We are processing 3 basic underlying sequence types: Array, Range and UnfoldSequence. First two are also Collections that have optimized implementations of the slicing methods defined in the Sequence protocol. This means they can skip around in constant time, be processed from both sides and can return optimized SubSequence types like ArraySlice. The UnfoldSequence returned from the sequence function represents a generated sequence, which must be processed like a stream, one element at a time, from the beginning, using the default implementations of the Sequence slicing operations.

Since all the default implementations of Sequence slicing operations return type erased AnySequence as their SubSequence, wrapping the 3 base types in it simulates the operation performed as a second method in a chain and stresses how well the compiler deals with the type-erasure. (This is the case with the default eager methods on Sequence, that wrap an Array in AnySequence… but I now realize that that particular case isn’t measured… ) One factor is the ability to see through it and be able to use the optimized method from the refined Collection types, another is dealing with closures used to hide the underlying type.

Similarly to sequence, the underlying IndexingIterator from Range is used wrapped inside the AnySequence to test the default Sequence slicing operations without the access to the optimized implementations from Collection. To test different existential collection, Array is wrapped in AnyCollection instead of AnySequence. (I’m not sure if that is used anywhere in practice, but since it exists in the standard library, I took the liberty to test its performance, too…)

Another measured performance axis is the use of .lazy method to switch from eager to lazy processing. Overall this leads to exploding number of benchmarks: 14 sequences per operation, all obtained as variations from the 3 tested base types.

Sequence DropFirst DropLast DropWhile Prefix PrefixWhile Suffix
AnyCollection 179 71 232 180 334 71
AnyCollectionLazy 327093 110117 332 327441 238 110531
AnySeqCRangeIter 87412 26689 73902 68193 64800 22462
AnySeqCRangeIterLazy 87224 26864 331 68367 237 22479
AnySeqCntRange 160 51 212 160 315 51
AnySeqCntRangeLazy 160 51 332 160 237 51
AnySequence 28524 39602 28833 23534 71075 35484
AnySequenceLazy 28464 40369 11557 23572 8962 35588
Array 162 46 230 123 326 59
ArrayLazy 198 56 507 159 197 69
CountableRange 77 26 79 77 81 26
CountableRangeLazy 77 26 254 77 77 26
Sequence 10767 9126 8098 8148 6288 34846
SequenceLazy 11842 8993 193 9027 116 35574

These are results from my machine as measured during my attempt to improve robustness of the Swift Benchmark Suite. Here’s this table in Numbers, with various views comparing:

  • Sequence Slowdown relative to fastest Operation
  • Operation Slowdown relative to fastest Sequence
  • Lazy vs Eager
  • Slowdown Any vs Direct

In my opinion, these results are not a very good look. AnyCollectionLazy can be 4,248x slower than CountableRange in 4 out of 6 operations. AnyCollectionLazy is 1,376x slower doing Prefix the almost equivalent PrefixWhile. So let’s ignore that one for the moment. Making the operation lazy can make it 3x slower, or 40x faster. Wrapping something in AnySequence usually makes it 2-4x slower, except the cases when it’s 11x, 60x or 77x slower. The UnfoldSequence stands out as 60x slower than the other two sequences when used directly, with the exception of the lazy DropWhile and PrefixWhile variants which perform completely on par. More on this later.

What is going on there?!? Well, in the case of DropFirstAnyCollectionLazy we have apparently achieved Kandinsky and Mondrian levels of abstraction and Swift compiler just gave up on optimizing this code. Looking at the stack trace in Instruments shows a Jackson Pollock-esque mix of protocol witness functions nested 6 times deep for each wrapped generic protocol implementation, sprinkled with swift_unknownRetain, swift_getGenericMetadata, swift_unknownRelease, _swift_release_dealloc. The trace from Instruments is some 183 calls deep.

![DropFirstAnyCollectionLazy%20InstrumetsTrace|167x500](upload://vT1dRp0e62Art9YKzDsfumHJexi.png)

I don't know why this image doesn't appear here… Discourse bug? Grab it from here instead:

https://imgur.com/a/mNWMFUX

Contrasting this to how effective is the Swift compiler in eliminating the abstractions for the underlying Range without the type erasure is stunning: whole DropFirstCountableRange as well as DropFirstCountableRangeLazy boils down to single call to protocol witness for static Equatable.== infix(_:_:) in conformance Int.

I thought in all fairness it would probably make sense to test another benchmark that’s based on the IndexingIterator (like is the case with Range) directly as a Sequence and it being wrapped in AnySequence. So I’ve tested StrideTo. The DropWhileAnyStrideTo didn’t fail so horribly, as its trace was only 95 calls deep and it was about an order of magnitude faster than the worst case.

So to sum up, using type erasure with the standard library provided wrappers can in the worst case lead to:

  • boxing of the value type, which incurs the reference counting penalty;
  • failure to specialize the generic algorithm, which results in performing all operations on the element type using the runtime generics witness table.

Procedural Sidenote

Since adding these benchmarks, one year has passed. I have been sidetracked by working on improving the robustness of the measurements in the Swift Benchmark Suite (SBS) and have not followed up on the above mentioned anomalous results. Mere presence of them in the SBS did not magically improve the situation. We seem to be lacking a good visibility into the complex web of performance interactions in Swift. Quite a few important and core parts seem to lack thorough coverage and without personal interest, not much happens after a benchmark is added. I probably should have at least field bugs for the slow benchmarks, as coming up with SIL optimization passes is too far out of my league at the moment and I don’t know how else can I help. I’m sorry about failing to follow up!

4 Likes

Benchmarking the Strange Attractors

The Mandelbrot set is the set of complex numbers c for which the function fc(z) = z2 + c does not diverge when iterated from z = 0, i.e., for which the sequence fc(0), fc(fc(0)), etc., remains bounded in absolute value.

The simplest algorithm for generating a representation of the Mandelbrot set is known as the “escape time” algorithm. A repeating calculation is performed for each x, y point in the plot area and based on the behavior of that calculation, a color is chosen for that pixel.

Coming back to the mesmerizing fractals, I have continued to play with the MandelbrotSwiftly toy project after porting it to Swift 4, so that I can come up with a new benchmark to be added into the SBS that would stress the chaining of the sequence slicing methods in a somewhat realistic manner.

I think this toy example is quite interesting for exploring the ability of Swift compiler to eliminate higher level abstractions. At its core it requires extending the language with new value type, complex number (which isn’t at the moment provided by the standard library). There is well know iterative implementation that gives us clear baseline and the implementation can be composed of several higher level functional concepts that make for a good stress test for the specialization and devirtualization in the compiler. The functional expression of the problem is at its core the most fundamental kind of reduction and iteration. It allows us to explore the cost of these higher level abstractions in isolation, but even more importantly in various combinations. In my experience it is this interaction that currently brings out various edge-cases where the compiler, and basic building blocks provided by standard library fail to deliver reliable performance.

So far I am still in an exploratory phase, where I have came up with tens of combinations of the basic programming patterns and underlying implementations to express the same computation. Always tweaking the implementation in search of a high level expression of the underlying concept that performs reasonably close to the imperative expression.

The core iterative computation at the heart of Mandelbrot set produces and infinite sequence of complex numbers. The escape time algorithm than it puts an upper bound on the sequence, when one of two conditions is met:

  • the number exceeds the set iteration limit and is considered to be part of the Mandelbrot set, or
  • the iterated number falls outside of the 2-unit circle from the origin and is definitely diverging under iteration to infinity and is therefore not part of the Mandelbrot set.
    Finally we count the length of the produced sequence, and choose a corresponding color (or symbol in our text style) showing how long it took for it to escape.

We first generate our “pixel grid” — a 2D Array of complex numbers. Next step uses the provided “renderer”, a function of type (ℂ) -> Int, that computes the “color” as the length of the iterated sequence. Final color transformation to ASCII art is shared between all implementations and is performed as functional reduction from 2 level map of complex numbers.

let asciiGradient = Array(" .,:;|!([$O0*%#@")
func toAscii(_ n: Int) -> Character { return asciiGradient[n - 1] }

let maxIter = asciiGradient.count // 16
func side(_ size: Int, _ start: Double, _ end : Double) -> StrideTo<Double> {
    return stride(from: start, to: end, 
    by: (end - start) / Double(size))
}
let m = 10 // use 1 when printing art
let w = 64 * m
let h = 32 * m

let sideX = side(w, -2, 2)
let sideY = side(h, -2, 2)

let grid = sideY.map({ y in sideX.map({ x in ℂ(x, i:y) }) })

func renderMandelbrot (_ renderer: (ℂ) -> Int) -> String {
    return grid.map {
        String($0.map(renderer).map(toAscii))
        }.joined(separator: "\n")
}

We are interested in the performance of various renderers relative to the baseline set by the following function in imperative style:

func imperative(c: ℂ) -> Int {
    var z = ℂ(0)
    var i = 0;
    repeat {
        z = z*z + c
        i += 1
    } while (z.isPotentiallyInSet() && i < maxIter)
    return i
}

For consistency the exit condition for escaping points is implemented as extension on the ComplexNumber type:

typealias ℂ = ComplexNumber
extension ℂ {
    @inline(__always)
    func isPotentiallyInSet() -> Bool {
        return (Re * Re + Im * Im) <= 4
    }
}

Following is an example of the most structurally equivalent translation from the original Haskell to Swift that I came up with using sequence to generate the iterated quadratic function application and using the Sequence methods prefix(while:) and prefix to slice and dice it:

func slpwpc(c: ℂ) -> Int {
    return sequence(first: ℂ(0), next: {z in z * z + c})
        .lazy
        .prefix(while: {$0.isPotentiallyInSet()})
        .prefix(maxIter)
        .count()
}

It is 33x slower then the imperative baseline. The count is custom Sequence extension that computes the length of resulting sequence:

extension Sequence {   
    @inline(__always)
    public func count() -> Int {
        return reduce(0, {i, _ in i + 1})
    }
}

These were just two examples of renderer. As I said, during this exploratory phase I have crated maybe a hundred different versions. The main axes of exploration are different styles of:

  • computing the length of the sequence
    • purely functional reduction (.reduce)
    • inout reduction (.reduce(into:))
    • enumeration (.enumerated)
    • manual counting
  • expression of the two limiting conditions
    • depth of nested function calls (or inversely, level of manual inlining)
    • order of operations performed
  • various re-implementations of the sequence operations (for the composed variants)
    • lazy prefix reimplementation in style of SE-0045
    • sequence based generator-like structures (using both first and state variants)
      • variants using defer keyword
      • use of inline closures vs functions for the next parameter

My main point, and end goal we should aim for is that all sequence operations (that also return sequence) must support method chaining without significant loss of performance. After optimization they should behave as if we were adding if statements and local variables into manual for loop. Their current state of performance is unacceptable.

Frankly, I am quite lost in the sea of implementations I have created. I believe functional composition deserves much more thorough coverage by performance tests than we currently have, therefore I think the next step is to take the patterns I came up with a convert them into format suitable for SBS, while generating all interesting permutations using GYB. Is this a research direction worthy of further exploration?

Observations from the Preliminary Results

I’m not going to post any tabulated results here, but encourage you to run the MandelbrotSwifty project for yourself. Suffice it to say that the performance varies by several orders of magnitude. I haven’t yet analyzed all the reasons in much detail. Perhaps the main issue here is proper calibration of my expectations: It’s hard to say what is unreasonable wishful thinking on my part (sufficiently smart compiler) and what are real performance bugs. So this is a grab bag of things that strike me as suspicious.

Default Sequence Operations don’t Compose Well

I’m not talking about the default eager methods, because chaining them processes the sequence as many times as the length of the method chain — from this perspective they don’t compose at all. My focus is with lazy methods, which should in theory be fused together into single optimized block of code that passes through the base sequence once, producing the result sequence.

I’ve analyzed why chaining calls to prefix(while:) and prefix on LazySequence don’t compose that well from a performance standpoint. It’s because the prefix implementation was created before the LazySequence APIs and never refactored to match that style. In particular you alway get the base sequence wrapped in AnySequence and the final reduction is dominated by going over the AnyIterator in unspecialized generic calls to protocol witness for IteratorProtocol. I’ve re-wrote it to match the implementation from SE-0045 and got a 10x speedup. More on that below.

So some of the problems stem from suboptimal implementation — the default sequence methods carry significant technical debt in my opinion, due to piecewise, over-the-time additions and no refactoring.

But some of the problems could be caused by design: putting the type erasure using AnySequence into this iteration heavy API, seems like a particularly ill-advised choice. Or maybe it is more suboptimal implementation of those type erasers? AnySequence doesn’t appear to be forwarding the .lazy call to the underlying wrapped sequence, which might be the cause of miserable performance when basic sequences get wrapped in it. I haven’t tested this hypothesis yet.

Performance Impact of Closures is Inconsistent

This is, in my opinion, the most “swifty” expression of the Mandelbrot renderer:

func mandelbrot_(_ c: ℂ) -> Int {
    return sequence(first: ℂ(0), next: {z in
        guard z.isPotentiallyInSet() else {return nil}
        return z * z + c
    }).lazy.prefix(maxIter).count()
}

It fuses one limiting bound of the infinite sequence into the inline closure with the guard statement and leaves the other one for the prefix method. The closure for next parameter is capturing the value of c. We’re using the first variant of sequence that internally contains another capturing closure. Using the master dev toolchain snapshot from May 30, 2018, this is 28x slower than imperative version. But as we’ll see later, this is mainly due to prefix implementation’s use of AnySequence.

First vs. State

It’s interesting to observe the difference between the two forms of the sequence functions. The state form propagates the types across the call chain, while the first form performs a kind of closure-based type erasure, albeit with a strange (T, Bool) type. As I was worried about the cost of that closure, I have reformulated the Mandelbrot set iteration to use sequence(state:next:):

typealias OrbitState = (z: ℂ, c: ℂ)

func mandelbrotS(_ c: ℂ) -> Int {
    return sequence(state: (ℂ(0), c), next: {
        (state: inout OrbitState) -> ℂ? in
        let (z, c) = state
        guard z.isPotentiallyInSet() else {return nil}
        state.z = z * z + c
        return z
    }).prefix(maxIter).count()
}

This version doesn’t capture the c value inside the closure, but stores it explicitly in the tuple passed to the state argument. In turn, the inout State argument passed to the next allows us to write it as non-capturing closure expression. I know it is not a pure function by traditional definition, because it takes inout parameter and therefore has a side-effect, but if we view that as a manual optimization technique, than it is kind of pure. It has shape similar to reduce(into:) which is hybrid inout variant of the traditional, purely functional reduce.

This seems to be an improvement over the first version of our Mandelbrot renderer, it’s just 17x slower than imperative baseline.

Originally I was fantasizing that using the sequence(state:next:) to implement all the sequence methods would enable better chaining, where all the iteration state could be potentially promoted to stack, eliminating reference counting traffic for heap allocated structures, but it ultimately didn’t pan out like that with the current state of the compiler. Maybe this is limited by the extra bounds check mandated by SE-0052.

Closure Expressions vs. Function References

The function signature of both versions of sequence declare the next argument as @escaping closure, because it will be stored inside the UnfoldSequence for later use. Wondering about the cost is this closure, I have tried variants that declare it inline as closure expression as well as passing in a function reference. According to the Swift Language Guide, global functions are closures that have a name and do not capture any values. Methods are functions that are associated with particular type. If I’m not mistaken, under the hood, the capturing closure looks a lot like an instance of a class: heap allocated, reference counted thing. I was hoping that passing a reference to a function would have a lighter footprint, making it easier for the compiler to eliminate the abstractions, given there should be no need for reference counting to keep an instance alive.

I have used this idea in two places: the base sequence that’s generating the complex numbers, and then in the sequence-based variants of prefix and prefix(while:) reimplementations. Performance impact of these variations was rather inconsistent. Sometimes it had no impact, sometimes the function pointer performed worse than the closure expression defined inline. I have tested defining the function that was passed by reference both as normal internal instance method as well as static method, both in the extension Sequence — there was no difference between them. This seems rather surprising. I’m not sure why.

I have a hunch that sometimes there is a bit of extra reference counting thrown in for a good measure. It looks like there are some strange de-optimizations going on.

I haven’t yet explored this area thoroughly. Although, there has been some improvement related to certain instances of this problem between master and Swift 4.1.2 that ships with Xcode 9.4 — with master it’s more likely closure expression and function reference perform the same.

Best Of and Worst Of

Just this week I was finally able to cobble together a composed “swifty” version that performs on par with the imperative baseline:

func mandelbrotL(_ c: ℂ) -> Int {
    return sequence(first: ℂ(0), next: {z in
        guard z.isPotentiallyInSet() else {return nil}
        return z * z + c
    }).lazy.prefixL(maxIter).count()
}

If you think this looks familiar, you’re right. Only difference from mandelbrot_ from above is the use of .prefixL — a proper LazySequence implementation of the method like those from SE-0045. Quite anticlimactic, if I may say.

Interestingly, using prefixL with the sequence(state:next:) variants (with closure expression and function reference for the next attribute) results in a 1.7x slowdown compared to the imperative baseline.

The naming convention used for the composed style renderers starts with s when the complex numbers are generated from sequence(first:next:) base and _s for sequence(state:next:). So if it starts with underscore, it means there’s no capturing closure inside the base sequence. Rest of the name is composed from prefixes of method names in the chain. See example below.

As shown earlier, the composed renderer slpwpc was 33x slower than imperative baseline (IB), _slpwpc is 22x IB. Manually inlining the c.isPotentiallyInSet condition into the base sequence-generator, as was done in the fused renderers clearly helps a bit: mandelbrot_ is 28x IB, mandelbrotS is 18x IB.

My previous best effort for composed renderers was based on fusing the two conditions into single prefix(while:) while operating on an enumerated sequence. This works around the performance killing prefix implementation, completely eliminating the need for it:

func _selpwc(c: ℂ) -> Int {
    return sequence(state: (ℂ(0), c), next: orbit)
        .enumerated()
        .lazy
        .prefix(while: {$1.isPotentiallyInSet() && $0 < maxIter})
        .count()
}

This is 2.6x IB, while the selpwc is 12x IB.

So how good is prefixL (abbreviated as pL in function names here) in improving the performance of this composed style renderers? The _slpLpwc is 2.7x IB and slpLpwc is 12x IB, both matching the previous best. Not bad…

Side note: Also the order of methods that have different implementation complexity matters for the final performance. Just swapping the prefix and prefix(while:) is absolutely disastrous: _slppwc is 101x IB, slppwc is 110x IB. This worst case is already an improvement in master dev toolchain, same code compiled with Swift 4.2.1 that ships with Xcode 9.4 was giving me slowdowns of 144x and 148x IB respectively.

Type Erasure to the Rescue?! :exploding_head:

From my earlier tests in Swift 3, I have concluded that type erasure was disastrous for performance. But due to my stupid naming convention for alternative prefix methods, I have mistyped one more underscore and accidentally used an _AnyIterator-based reimplementation of the prefix method I had lying around in my code from way back when. _AnyIterator is just renamed old version of _ClosureBasedIterator ripped out from the AnyIteraror implementation.

extension Sequence {
    @_inlineable
    public func ___prefix(_ maxLength: Int) ->
        _AnyIterator<Iterator.Element> {
            var iterator = makeIterator()
            var count = maxLength
            
            return _AnyIterator {
                count -= 1
                return (0 <= count) ? iterator.next() : nil
            }           
    }
}

To my great surprise, I got this: mandelbrot__a is 1x IB, same as mandelbrotL! Moreover using this with sequence(state:next:)-based variants of the mandelbrotS and mandelbrotSS (later uses “static” function for next instead of closure expression) also yields the same: both mandelbrotS__a and mandelbrotSS__a are 1x IB! This is even better than the 1.7x IB we got by using prefixL on the same cases.

Trying this with composed style renderers is another mind blown: sl____pw___pc, _sl____pw___pc , sl____pws___pc and _sl____pws___pc — all at 1x IB, overtaking the previous best composed renderer.

Notice how all of these implementations are even more robust than prefixL-based versions when it comes to not caring about whether the base sequence is using a closure or not or how the while predicate gets defined. All the abstractions disappear and we end up with same performance as iterative baseline. Its as if erasing the underlying sequence type inside a closure spurred the compiler into much more aggressive inlining.

Or maybe this is result of simply ignoring the iterator post-nil guarantee (SE-0052)?

Just to be sure I have not bee wrongly accusing type erasure from killing performance before, I have retested with new variant of prefixAI that builds on top of AnyIterator provided by the standard library. So, mandelbrotAI is 32x IB, mandelbrotSAI and mandelbrotSSAI are 19x IB. This discrepancy is clearly caused by the much more complex internals of AnyIterator that uses two different types of type erasers:

  • struct _ClosureBasedIterator — used for trailing closure init.
  • class _AnyIteratorBoxBase — used for wrapping existing types conforming to IteratorProtocol.

To accommodate both forms, the first one always gets wrapped — along with all the performance optimization potential — inside another internal final class _IteratorBox. No wonder the optimizer chokes on it.

I don’t quite know what to make of all this. It feel like there is quite some potential to come up with some kind of a simpler and better performing hybrid unification of the type erasing form of sequence and _ClosureBasedIterator, that could replace all the custom types floating around the Sequence API. IHMO this is very promising direction for further exploration, wouldn’t you agree?


Notes on sequence

During my research I have kept notes of relevant points from the history of sequence design and implementation. FWIW, here are my loosely structured notes, if you’re into that sort of Swift Evolution archeology.

SE-0045 Review & Announcement summary:
SE-0094 Review and Acceptance with Revision:

Kevin mentions that UnfoldSequence is equivalent to AnyIterator, Dave reacts:

The design of AnySequence and AnyIterator dates from a time when the compiler was very immature and many design avenues we might have taken were not available. I find the sequence forms to be superior in general, and IMO at some point we should re-evaluate the interfaces to AnySequence and AnyIterator.

!!!!!!!!!!!!!!!!!!!!!!!

Kevin realizes UnfoldSequence<T> would need to perform type erasure using capturing closure, changes implementation signature to UnfoldSequence<T, (T?, Bool)>.

I think it's better to diverge slightly from the proposal than it is to introduce unnecessary (albeit small) performance cost.

Long type signatures are a known fact of life as Kevin notes:

LazyMapSequence<LazyFilterSequence<LazyMapSequence<Self.Elements, ElementOfResult?>>, ElementOfResult>


Crying Wolf?

I’m truly sorry the ending will probably sound like an angry and misinformed rant. From my quite limited understanding, I really am very worried about the future and hope the Swift’s roadmap contains solutions and I just don’t see the carefully planned road from here to there. Any and all explanations as well as pointers to sources for further study are much appreciated!

Disclaimer: I should fully disclose the levels of my ignorance of the key aspects here — mainly the ABI stability. I’m probably overestimating the limits imposed by it. I’ve read all the manifestos, but I don’t grasp the all implications, especially how they interact. I don’t understand how the mayor future changes will be introduced into the Swift language: ownership, generalized existentials, coroutines…

From my perspective, getting the implementation of sequence APIs is of paramount importance. If we are ever to free ourselves from the need to manually write for and while loops in imperative style to get a reasonable performance, we must get these basic building blocks for high-level iteration right.

Are we aiming to have zero cost abstractions like Rust here? How can we achieve this, when the true cost of abstractions in Swift is so unpredictable and potentially very high? How can this be done currently without the major enabling features Rust has? (cc @Graydon_Hoare) I don’t mean to insult anyone and disparage all the hard work that went into Swift already. From my limited experience documented above, when it comes to eliminating abstractions the comparison between Swift and Rust is pretty stark. At the moment Swift comes out looking like a junior Heldenkompiler with a β-STDLIB. Am I completely misguided here and zero-cost high-level iteration is an explicit non-goal at this point in time? If so, how can this be improved down the road?

Core Abstractions in ABI Stable Future

It looks like core team members (@dabrahams, @Joe_Groff) are also concerned about design aspects surrounding the Sequence protocol:

(Though, I am quite skeptical of the practical feasibility of Dave’s suggestion, given all the fragile performance documented above.)

The implementation of the Sequence API contains seemingly random assortment of lazy and eager sequence wrappers. This proliferation of concrete types and incoherent design choices seems particularly smelly to me…

I keep asking about the implications of ABI stability for the future evolution of Swift. Do we need to get the basic building blocks of high-level (functional) iteration right, before declaring ABI stability? Maybe I’m wrong here, but my current understanding is that:

  • inlining is the mother of all optimizations and
  • ABI stability will freeze the current state, forever limiting the potential gains to be had.

Therefore quite a lot of care should be put into correctly annotating methods with attributes like @inlineable and @usableFromInline for all types that are part of stdlib, because declaring the ABI stability will erect some kind of resiliency barriers (I’m hand-gesturing heavily, as I don’t really know what I’m talking about here) and Swift’s ever-increasing commitment to backwards compatibility will require maintaining the current code practically forever into the future for backwards compatibility reasons. New APIs could of course be added on top later. But I think it implies that whatever parts of Swift implementation will be later superseded with newer and better replacements, will still be carried around as old baggage? I’ve noticed the idea thrown around, of some kind of a split of standard library that could house the deprecated parts. Is this on the roadmap as potential solution?

I’m assuming that some kind of systematic code audit of standard library was performed in preparation for ABI stability. Why is the current state of implementation surrounding Sequence deemed not a mayor blocker for Swift 5? I think that at minimum a mayor refactoring is in order.

I believe at a minimum following steps are necessary before shipping Swift 5 (absent a mayor redesign of `Sequence`):
  • Implement LazySequence versions for dropFirst, dropLast, prefix and suffix modeled after the prefix(while:) and drop(while:) implementation (SE-0045).
  • Investigate unifying the implementation, building the eager methods with Array initialized from the lazy sequence. This would eliminate several internal types used to define these operations as custom sequences, giving us a tiny code size win and smaller potential bug surface, with single place to focus the optimizations.
  • Investigate removing the AnySequence wrapper for eager methods in favor of returning the Array directly.

(The last suggestion is based on my vague impression that the lazy versions are never wrapped in AnySequence and still conform to the protocol requirements — I’m probably missing a key insight here.)

Dave’s and Joe’s concerns probably imply more serious design changes as well.
What is the current game plan?

4 Likes

Who said we want that? You will have to wrest imperative coding from my cold dead hands...

2 Likes

No one is suggesting that imperative should be worse, so please don't prevent those who prefer other styles from improving the language and compiler.

4 Likes

I suggest - to increase the chances of getting substantial feedback from knowledgeable (but busy) core team members - that somebody writes a TL/DR with the essence of this elaboration ; )

8 Likes

I suggest you watch Guy Steele's talk from 2009:
Organizing Functional Code for Parallel Execution;
or, foldl and foldr Considered Slightly Harmful

Here are the slides. Key takeaway from page 70:

1 Like

Sorry but I'm not a follower of the church of Haskel. And I mostly don't need highly parallelized code and I'm willing to take big slow downs if they protect me from having to use functional programming - Maintainability is much more important.

Again, none of this is going to make imperative code perform worse. The goal here is to make, as well as possible, functional code perform as well as its imperative equivalent. If you don't use functional code, then this doesn't directly effect you at all.

2 Likes

You've done this same kind of comment in other threads, where you try and shut down discussion of things you clearly state don't concern you because you don't use them. You think functional code is less maintainable, but that's bias speaking. No one style is more maintainable than any other -- it all depends on who is doing the maintenance. And again, it doesn't concern you because no one is asking, never mind forcing, you to use such a style.

You are clearly trying to be a hindrance, and not to provide any constructive criticism. Please stop.

2 Likes

@Avi Sorry to say that but you are misrepresenting his post. And tbh it sounds more like you are trying to shut him down. He just stated his own opinions. He did NOT in any way even remotely try to "shut down" anything or anybody. If you have different opinions: Great. Post yours. If you find he presents a weak case and you want him to argue more substantially: Great. But please stop this modern style of declaring unliked opinions as invalid participation.

Yes being a hindrance concerning fp is intended. I don't want Swift to become fp focused. The std library is to strongly influenced by it already. So it IS a problem for me as the more it invaded the std lib I have to use it.

So what should I do to prevent this instead of being part of fp feature discussions and voicing my concerns?

Well, from computability and efficiency, Imperative style and Functional style of programming are equivalent. A really good compiler can show that, by compiling two equivalent programs from IP and FP to the same exact binary.

What Pavol is asking for here (I think. His post is pretty long) is that this equivalence can be reached, s.t. both styles get the efficiency that they deserve.

1 Like

And Pavol, you're right in wanting to be able to get good performance from FP. But parallel execution isn't available for all problems; there are certain structures of algorithms that can't be parallelized. Pinkamena is right in that regard: you don't need FP for inherently sequential programs, even though you might be able to argue otherwise for parallel ones.

1 Like

This would be a state I'm fine with. What got me turned on was this:

Blockquote
If we are ever to free ourselves from the need to manually write for and while loops in imperative style to get a reasonable performance

This was leading me to think this should replace for and while in Swift.

1 Like

Also we should remember that a lot of this FP functionality is based under the hood on for/while loops and if we don’t want this, we can implement this behavior ourselves directly. This is more about trying to mitigate performance limitations from FP stuff, I think, which is quite reasonable.

1 Like

Ah, no worries. I'm sure that, even if that's what he meant (which he might've, but I really dunno), there's no chance we'd ever take fors and whiles out of Swift. They are very useful, both to express an algorithm and to get it done. And I'm sure most of the community, including the Core Team, would agree on that :slight_smile:

2 Likes

If you read attentively, you would understand that I chose my words quite carefully. The quoted sentence says that current state of Swift is forcing one to write in imperative style for performance reasons. It is a one way dictate. No one is pushing anyone to adopt functional programming against their will. FP as applied to Swift is just about giving names to the most useful and generally reusable patterns, lifting them one level higher. If you see map, reduce, filter or prefix(while:), you should not only know that we are iterating under the hood, but the author is clearly communicating the reason why we’re doing so. You shouldn’t need to simulate all the loop’s side-effects step-by-step in your head, as is the case with manual iteration. It’s quite freeing once you internalize the handful of these higher-level concepts into your mental vocabulary!

Swift, as all modern programming languages, has recognized these benefits and provides the basic building blocks as part of the standard library. It’s the compiler’s job to assemble all the pieces into a tight and efficient loop. This thread is about Swift failing to do so in surprisingly many cases that are just a small step off from the beaten path.

I’m willing to patiently educate anyone who encounters the high-level iteration concepts for the first time about their benefits, but I would prefer if this thread could stay on the original topic, instead of devolving into a religious war.

5 Likes