Troubling `Sequence`

palimondo · June 2, 2018, 11:42pm

Benchmarking the Strange Attractors

The Mandelbrot set is the set of complex numbers c for which the function f_c(z) = z² + c does not diverge when iterated from z = 0, i.e., for which the sequence f_c(0), f_c(f_c(0)), etc., remains bounded in absolute value.
…
The simplest algorithm for generating a representation of the Mandelbrot set is known as the “escape time” algorithm. A repeating calculation is performed for each x, y point in the plot area and based on the behavior of that calculation, a color is chosen for that pixel.

Coming back to the mesmerizing fractals, I have continued to play with the MandelbrotSwiftly toy project after porting it to Swift 4, so that I can come up with a new benchmark to be added into the SBS that would stress the chaining of the sequence slicing methods in a somewhat realistic manner.

I think this toy example is quite interesting for exploring the ability of Swift compiler to eliminate higher level abstractions. At its core it requires extending the language with new value type, complex number (which isn’t at the moment provided by the standard library). There is well know iterative implementation that gives us clear baseline and the implementation can be composed of several higher level functional concepts that make for a good stress test for the specialization and devirtualization in the compiler. The functional expression of the problem is at its core the most fundamental kind of reduction and iteration. It allows us to explore the cost of these higher level abstractions in isolation, but even more importantly in various combinations. In my experience it is this interaction that currently brings out various edge-cases where the compiler, and basic building blocks provided by standard library fail to deliver reliable performance.

So far I am still in an exploratory phase, where I have came up with tens of combinations of the basic programming patterns and underlying implementations to express the same computation. Always tweaking the implementation in search of a high level expression of the underlying concept that performs reasonably close to the imperative expression.

The core iterative computation at the heart of Mandelbrot set produces and infinite sequence of complex numbers. The escape time algorithm than it puts an upper bound on the sequence, when one of two conditions is met:

the number exceeds the set iteration limit and is considered to be part of the Mandelbrot set, or
the iterated number falls outside of the 2-unit circle from the origin and is definitely diverging under iteration to infinity and is therefore not part of the Mandelbrot set.
Finally we count the length of the produced sequence, and choose a corresponding color (or symbol in our text style) showing how long it took for it to escape.

We first generate our “pixel grid” — a 2D Array of complex numbers. Next step uses the provided “renderer”, a function of type (ℂ) -> Int, that computes the “color” as the length of the iterated sequence. Final color transformation to ASCII art is shared between all implementations and is performed as functional reduction from 2 level map of complex numbers.

let asciiGradient = Array(" .,:;|!([$O0*%#@")
func toAscii(_ n: Int) -> Character { return asciiGradient[n - 1] }

let maxIter = asciiGradient.count // 16
func side(_ size: Int, _ start: Double, _ end : Double) -> StrideTo<Double> {
    return stride(from: start, to: end, 
    by: (end - start) / Double(size))
}
let m = 10 // use 1 when printing art
let w = 64 * m
let h = 32 * m

let sideX = side(w, -2, 2)
let sideY = side(h, -2, 2)

let grid = sideY.map({ y in sideX.map({ x in ℂ(x, i:y) }) })

func renderMandelbrot (_ renderer: (ℂ) -> Int) -> String {
    return grid.map {
        String($0.map(renderer).map(toAscii))
        }.joined(separator: "\n")
}

We are interested in the performance of various renderers relative to the baseline set by the following function in imperative style:

func imperative(c: ℂ) -> Int {
    var z = ℂ(0)
    var i = 0;
    repeat {
        z = z*z + c
        i += 1
    } while (z.isPotentiallyInSet() && i < maxIter)
    return i
}

For consistency the exit condition for escaping points is implemented as extension on the ComplexNumber type:

typealias ℂ = ComplexNumber
extension ℂ {
    @inline(__always)
    func isPotentiallyInSet() -> Bool {
        return (Re * Re + Im * Im) <= 4
    }
}

Following is an example of the most structurally equivalent translation from the original Haskell to Swift that I came up with using sequence to generate the iterated quadratic function application and using the Sequence methods prefix(while:) and prefix to slice and dice it:

func slpwpc(c: ℂ) -> Int {
    return sequence(first: ℂ(0), next: {z in z * z + c})
        .lazy
        .prefix(while: {$0.isPotentiallyInSet()})
        .prefix(maxIter)
        .count()
}

It is 33x slower then the imperative baseline. The count is custom Sequence extension that computes the length of resulting sequence:

extension Sequence {   
    @inline(__always)
    public func count() -> Int {
        return reduce(0, {i, _ in i + 1})
    }
}

These were just two examples of renderer. As I said, during this exploratory phase I have crated maybe a hundred different versions. The main axes of exploration are different styles of:

computing the length of the sequence
- purely functional reduction (.reduce)
- inout reduction (.reduce(into:))
- enumeration (.enumerated)
- manual counting
expression of the two limiting conditions
- depth of nested function calls (or inversely, level of manual inlining)
- order of operations performed
various re-implementations of the sequence operations (for the composed variants)
- lazy prefix reimplementation in style of SE-0045
- sequence based generator-like structures (using both first and state variants)
  - variants using defer keyword
  - use of inline closures vs functions for the next parameter

My main point, and end goal we should aim for is that all sequence operations (that also return sequence) must support method chaining without significant loss of performance. After optimization they should behave as if we were adding if statements and local variables into manual for loop. Their current state of performance is unacceptable.

Frankly, I am quite lost in the sea of implementations I have created. I believe functional composition deserves much more thorough coverage by performance tests than we currently have, therefore I think the next step is to take the patterns I came up with a convert them into format suitable for SBS, while generating all interesting permutations using GYB. Is this a research direction worthy of further exploration?

Observations from the Preliminary Results

I’m not going to post any tabulated results here, but encourage you to run the MandelbrotSwifty project for yourself. Suffice it to say that the performance varies by several orders of magnitude. I haven’t yet analyzed all the reasons in much detail. Perhaps the main issue here is proper calibration of my expectations: It’s hard to say what is unreasonable wishful thinking on my part (sufficiently smart compiler) and what are real performance bugs. So this is a grab bag of things that strike me as suspicious.

Default Sequence Operations don’t Compose Well

I’m not talking about the default eager methods, because chaining them processes the sequence as many times as the length of the method chain — from this perspective they don’t compose at all. My focus is with lazy methods, which should in theory be fused together into single optimized block of code that passes through the base sequence once, producing the result sequence.

I’ve analyzed why chaining calls to prefix(while:) and prefix on LazySequence don’t compose that well from a performance standpoint. It’s because the prefix implementation was created before the LazySequence APIs and never refactored to match that style. In particular you alway get the base sequence wrapped in AnySequence and the final reduction is dominated by going over the AnyIterator in unspecialized generic calls to protocol witness for IteratorProtocol. I’ve re-wrote it to match the implementation from SE-0045 and got a 10x speedup. More on that below.

So some of the problems stem from suboptimal implementation — the default sequence methods carry significant technical debt in my opinion, due to piecewise, over-the-time additions and no refactoring.

But some of the problems could be caused by design: putting the type erasure using AnySequence into this iteration heavy API, seems like a particularly ill-advised choice. Or maybe it is more suboptimal implementation of those type erasers? AnySequence doesn’t appear to be forwarding the .lazy call to the underlying wrapped sequence, which might be the cause of miserable performance when basic sequences get wrapped in it. I haven’t tested this hypothesis yet.

Performance Impact of Closures is Inconsistent

This is, in my opinion, the most “swifty” expression of the Mandelbrot renderer:

func mandelbrot_(_ c: ℂ) -> Int {
    return sequence(first: ℂ(0), next: {z in
        guard z.isPotentiallyInSet() else {return nil}
        return z * z + c
    }).lazy.prefix(maxIter).count()
}

It fuses one limiting bound of the infinite sequence into the inline closure with the guard statement and leaves the other one for the prefix method. The closure for next parameter is capturing the value of c. We’re using the first variant of sequence that internally contains another capturing closure. Using the master dev toolchain snapshot from May 30, 2018, this is 28x slower than imperative version. But as we’ll see later, this is mainly due to prefix implementation’s use of AnySequence.

First vs. State

It’s interesting to observe the difference between the two forms of the sequence functions. The state form propagates the types across the call chain, while the first form performs a kind of closure-based type erasure, albeit with a strange (T, Bool) type. As I was worried about the cost of that closure, I have reformulated the Mandelbrot set iteration to use sequence(state:next:):

typealias OrbitState = (z: ℂ, c: ℂ)

func mandelbrotS(_ c: ℂ) -> Int {
    return sequence(state: (ℂ(0), c), next: {
        (state: inout OrbitState) -> ℂ? in
        let (z, c) = state
        guard z.isPotentiallyInSet() else {return nil}
        state.z = z * z + c
        return z
    }).prefix(maxIter).count()
}

This version doesn’t capture the c value inside the closure, but stores it explicitly in the tuple passed to the state argument. In turn, the inout State argument passed to the next allows us to write it as non-capturing closure expression. I know it is not a pure function by traditional definition, because it takes inout parameter and therefore has a side-effect, but if we view that as a manual optimization technique, than it is kind of pure. It has shape similar to reduce(into:) which is hybrid inout variant of the traditional, purely functional reduce.

This seems to be an improvement over the first version of our Mandelbrot renderer, it’s just 17x slower than imperative baseline.

Originally I was fantasizing that using the sequence(state:next:) to implement all the sequence methods would enable better chaining, where all the iteration state could be potentially promoted to stack, eliminating reference counting traffic for heap allocated structures, but it ultimately didn’t pan out like that with the current state of the compiler. Maybe this is limited by the extra bounds check mandated by SE-0052.

Closure Expressions vs. Function References

The function signature of both versions of sequence declare the next argument as @escaping closure, because it will be stored inside the UnfoldSequence for later use. Wondering about the cost is this closure, I have tried variants that declare it inline as closure expression as well as passing in a function reference. According to the Swift Language Guide, global functions are closures that have a name and do not capture any values. Methods are functions that are associated with particular type. If I’m not mistaken, under the hood, the capturing closure looks a lot like an instance of a class: heap allocated, reference counted thing. I was hoping that passing a reference to a function would have a lighter footprint, making it easier for the compiler to eliminate the abstractions, given there should be no need for reference counting to keep an instance alive.

I have used this idea in two places: the base sequence that’s generating the complex numbers, and then in the sequence-based variants of prefix and prefix(while:) reimplementations. Performance impact of these variations was rather inconsistent. Sometimes it had no impact, sometimes the function pointer performed worse than the closure expression defined inline. I have tested defining the function that was passed by reference both as normal internal instance method as well as static method, both in the extension Sequence — there was no difference between them. This seems rather surprising. I’m not sure why.

I have a hunch that sometimes there is a bit of extra reference counting thrown in for a good measure. It looks like there are some strange de-optimizations going on.

I haven’t yet explored this area thoroughly. Although, there has been some improvement related to certain instances of this problem between master and Swift 4.1.2 that ships with Xcode 9.4 — with master it’s more likely closure expression and function reference perform the same.

Best Of and Worst Of

Just this week I was finally able to cobble together a composed “swifty” version that performs on par with the imperative baseline:

func mandelbrotL(_ c: ℂ) -> Int {
    return sequence(first: ℂ(0), next: {z in
        guard z.isPotentiallyInSet() else {return nil}
        return z * z + c
    }).lazy.prefixL(maxIter).count()
}

If you think this looks familiar, you’re right. Only difference from mandelbrot_ from above is the use of .prefixL — a proper LazySequence implementation of the method like those from SE-0045. Quite anticlimactic, if I may say.

Interestingly, using prefixL with the sequence(state:next:) variants (with closure expression and function reference for the next attribute) results in a 1.7x slowdown compared to the imperative baseline.

The naming convention used for the composed style renderers starts with s when the complex numbers are generated from sequence(first:next:) base and _s for sequence(state:next:). So if it starts with underscore, it means there’s no capturing closure inside the base sequence. Rest of the name is composed from prefixes of method names in the chain. See example below.

As shown earlier, the composed renderer slpwpc was 33x slower than imperative baseline (IB), _slpwpc is 22x IB. Manually inlining the c.isPotentiallyInSet condition into the base sequence-generator, as was done in the fused renderers clearly helps a bit: mandelbrot_ is 28x IB, mandelbrotS is 18x IB.

My previous best effort for composed renderers was based on fusing the two conditions into single prefix(while:) while operating on an enumerated sequence. This works around the performance killing prefix implementation, completely eliminating the need for it:

func _selpwc(c: ℂ) -> Int {
    return sequence(state: (ℂ(0), c), next: orbit)
        .enumerated()
        .lazy
        .prefix(while: {$1.isPotentiallyInSet() && $0 < maxIter})
        .count()
}

This is 2.6x IB, while the selpwc is 12x IB.

So how good is prefixL (abbreviated as pL in function names here) in improving the performance of this composed style renderers? The _slpLpwc is 2.7x IB and slpLpwc is 12x IB, both matching the previous best. Not bad…

Side note: Also the order of methods that have different implementation complexity matters for the final performance. Just swapping the prefix and prefix(while:) is absolutely disastrous: _slppwc is 101x IB, slppwc is 110x IB. This worst case is already an improvement in master dev toolchain, same code compiled with Swift 4.2.1 that ships with Xcode 9.4 was giving me slowdowns of 144x and 148x IB respectively.

Type Erasure to the Rescue?!

From my earlier tests in Swift 3, I have concluded that type erasure was disastrous for performance. But due to my stupid naming convention for alternative prefix methods, I have mistyped one more underscore and accidentally used an _AnyIterator-based reimplementation of the prefix method I had lying around in my code from way back when. _AnyIterator is just renamed old version of _ClosureBasedIterator ripped out from the AnyIteraror implementation.

extension Sequence {
    @_inlineable
    public func ___prefix(_ maxLength: Int) ->
        _AnyIterator<Iterator.Element> {
            var iterator = makeIterator()
            var count = maxLength
            
            return _AnyIterator {
                count -= 1
                return (0 <= count) ? iterator.next() : nil
            }           
    }
}

To my great surprise, I got this: mandelbrot__a is 1x IB, same as mandelbrotL! Moreover using this with sequence(state:next:)-based variants of the mandelbrotS and mandelbrotSS (later uses “static” function for next instead of closure expression) also yields the same: both mandelbrotS__a and mandelbrotSS__a are 1x IB! This is even better than the 1.7x IB we got by using prefixL on the same cases.

Trying this with composed style renderers is another mind blown: sl____pw___pc, _sl____pw___pc , sl____pws___pc and _sl____pws___pc — all at 1x IB, overtaking the previous best composed renderer.

Notice how all of these implementations are even more robust than prefixL-based versions when it comes to not caring about whether the base sequence is using a closure or not or how the while predicate gets defined. All the abstractions disappear and we end up with same performance as iterative baseline. Its as if erasing the underlying sequence type inside a closure spurred the compiler into much more aggressive inlining.

Or maybe this is result of simply ignoring the iterator post-nil guarantee (SE-0052)?

Just to be sure I have not bee wrongly accusing type erasure from killing performance before, I have retested with new variant of prefixAI that builds on top of AnyIterator provided by the standard library. So, mandelbrotAI is 32x IB, mandelbrotSAI and mandelbrotSSAI are 19x IB. This discrepancy is clearly caused by the much more complex internals of AnyIterator that uses two different types of type erasers:

struct _ClosureBasedIterator — used for trailing closure init.
class _AnyIteratorBoxBase — used for wrapping existing types conforming to IteratorProtocol.

To accommodate both forms, the first one always gets wrapped — along with all the performance optimization potential — inside another internal final class _IteratorBox. No wonder the optimizer chokes on it.

I don’t quite know what to make of all this. It feel like there is quite some potential to come up with some kind of a simpler and better performing hybrid unification of the type erasing form of sequence and _ClosureBasedIterator, that could replace all the custom types floating around the Sequence API. IHMO this is very promising direction for further exploration, wouldn’t you agree?

Notes on `sequence`

During my research I have kept notes of relevant points from the history of sequence design and implementation. FWIW, here are my loosely structured notes, if you’re into that sort of Swift Evolution archeology.

SE-0045 Review & Announcement summary:

Joe Groff suggests inout reduce
Erica prefers terms-of-art
iterate(nee.)unfold rejected
Erica resurrects iterate by proposing sequence, the name originated from core team discussion (it was Dave’s idea)
Kevin fuses names and return types
suggestion to return AnySequence, 2 and Kevin’s performance counter argument,
in other thread suggestion for expanding AnyIterator constructor
Dave kicks of future renaming (“This whole area needs some design love.”), doesn’t like dropFirst.
Kevin: A random sampling of functions that generate non-Array sequences and their corresponding sequence type: FlattenSequence, JoinedSequence, LazyMapSequence, LazyFilterSequence. — HAH — this is explanation why the name of the return type should be SequenceSequence!
Kevin mentions reimplementing zip as unfold.

SE-0094 Review and Acceptance with Revision:

Kevin mentions that UnfoldSequence is equivalent to AnyIterator, Dave reacts:

The design of AnySequence and AnyIterator dates from a time when the compiler was very immature and many design avenues we might have taken were not available. I find the sequence forms to be superior in general, and IMO at some point we should re-evaluate the interfaces to AnySequence and AnyIterator.

!!!!!!!!!!!!!!!!!!!!!!!

Kevin realizes UnfoldSequence<T> would need to perform type erasure using capturing closure, changes implementation signature to UnfoldSequence<T, (T?, Bool)>.

I think it's better to diverge slightly from the proposal than it is to introduce unnecessary (albeit small) performance cost.

Long type signatures are a known fact of life as Kevin notes:

LazyMapSequence<LazyFilterSequence<LazyMapSequence<Self.Elements, ElementOfResult?>>, ElementOfResult>

Crying Wolf?

I’m truly sorry the ending will probably sound like an angry and misinformed rant. From my quite limited understanding, I really am very worried about the future and hope the Swift’s roadmap contains solutions and I just don’t see the carefully planned road from here to there. Any and all explanations as well as pointers to sources for further study are much appreciated!

Disclaimer: I should fully disclose the levels of my ignorance of the key aspects here — mainly the ABI stability. I’m probably overestimating the limits imposed by it. I’ve read all the manifestos, but I don’t grasp the all implications, especially how they interact. I don’t understand how the mayor future changes will be introduced into the Swift language: ownership, generalized existentials, coroutines…

From my perspective, getting the implementation of sequence APIs is of paramount importance. If we are ever to free ourselves from the need to manually write for and while loops in imperative style to get a reasonable performance, we must get these basic building blocks for high-level iteration right.

Are we aiming to have zero cost abstractions like Rust here? How can we achieve this, when the true cost of abstractions in Swift is so unpredictable and potentially very high? How can this be done currently without the major enabling features Rust has? (cc @Graydon_Hoare) I don’t mean to insult anyone and disparage all the hard work that went into Swift already. From my limited experience documented above, when it comes to eliminating abstractions the comparison between Swift and Rust is pretty stark. At the moment Swift comes out looking like a junior Heldenkompiler with a β-STDLIB. Am I completely misguided here and zero-cost high-level iteration is an explicit non-goal at this point in time? If so, how can this be improved down the road?

Core Abstractions in ABI Stable Future

It looks like core team members (@dabrahams, @Joe_Groff) are also concerned about design aspects surrounding the Sequence protocol:

(Though, I am quite skeptical of the practical feasibility of Dave’s suggestion, given all the fragile performance documented above.)

The implementation of the Sequence API contains seemingly random assortment of lazy and eager sequence wrappers. This proliferation of concrete types and incoherent design choices seems particularly smelly to me…

I keep asking about the implications of ABI stability for the future evolution of Swift. Do we need to get the basic building blocks of high-level (functional) iteration right, before declaring ABI stability? Maybe I’m wrong here, but my current understanding is that:

inlining is the mother of all optimizations and
ABI stability will freeze the current state, forever limiting the potential gains to be had.

Therefore quite a lot of care should be put into correctly annotating methods with attributes like @inlineable and @usableFromInline for all types that are part of stdlib, because declaring the ABI stability will erect some kind of resiliency barriers (I’m hand-gesturing heavily, as I don’t really know what I’m talking about here) and Swift’s ever-increasing commitment to backwards compatibility will require maintaining the current code practically forever into the future for backwards compatibility reasons. New APIs could of course be added on top later. But I think it implies that whatever parts of Swift implementation will be later superseded with newer and better replacements, will still be carried around as old baggage? I’ve noticed the idea thrown around, of some kind of a split of standard library that could house the deprecated parts. Is this on the roadmap as potential solution?

I’m assuming that some kind of systematic code audit of standard library was performed in preparation for ABI stability. Why is the current state of implementation surrounding Sequence deemed not a mayor blocker for Swift 5? I think that at minimum a mayor refactoring is in order.

I believe at a minimum following steps are necessary before shipping Swift 5 (absent a mayor redesign of `Sequence`):

Implement LazySequence versions for dropFirst, dropLast, prefix and suffix modeled after the prefix(while:) and drop(while:) implementation (SE-0045).
Investigate unifying the implementation, building the eager methods with Array initialized from the lazy sequence. This would eliminate several internal types used to define these operations as custom sequences, giving us a tiny code size win and smaller potential bug surface, with single place to focus the optimizations.
Investigate removing the AnySequence wrapper for eager methods in favor of returning the Array directly.

(The last suggestion is based on my vague impression that the lazy versions are never wrapped in AnySequence and still conform to the protocol requirements — I’m probably missing a key insight here.)

Dave’s and Joe’s concerns probably imply more serious design changes as well.
What is the current game plan?