Best practices for optimizing performance-critical code using Swift Arrays?

RonAvitzur · November 16, 2021, 12:32am

While translating 35-year-old C++ code (ye olde PowerPC Graphing Calculator), I've run into performance issues in a few places. Where the old C code used local fixed-size stack-allocated arrays, translating that in the natural way to using Swift Array introduced overhead which Instruments showed to be the hot spot.

The performance sensitive routine is a numeric function evaluate<T>() where T is either Double or a struct of two or four Doubles.

The first attempt at translating to Swift used Array[T] so the start of the function looked like this (since stack allocation is essentially free in the original C code)

    var stack = [T](repeating: T.nan, count: maxStackDepth)
    var cseVariables  = [T](repeating: T.nan, count: numCSEvariables)
    var loopVariables = [T](repeating: T.nan, count: numLoopVariables)

Instruments showed significant allocation and ARC overhead both here and accessing the arrays, even with safety checks disabled. My next attempt at optimizing this combines the three arrays as UnsafeMutableBufferPointers:

    let stack = UnsafeMutableBufferPointer<T>.allocate(capacity: maxStackDepth + numCSEvariables + numLoopVariables)
    let cseVariables  = UnsafeMutableBufferPointer(rebasing: stack[maxStackDepth...])
    let loopVariables = UnsafeMutableBufferPointer(rebasing: stack[(maxStackDepth + numCSEvariables)...])
    defer { stack.deallocate() }

That was a significant performance improvement, but allocation on every call was still a hot spot in the profile. My next step was to hoist the allocation out of the call to evaluate. However, evaluate is a generic function, but its class is not a generic type. The class has specific calls to evaluateAsComplex, evaluateAsInterval, or evaluateAsDouble which call a specialization of the generic evaluate. So to allocate the stack outside the generic function, I tried this in Program.init:

        let n = maxStackDepth + numCSEvariables + numLoopVariables
        self.stack = UnsafeMutableRawBufferPointer.allocate(byteCount: n * MemoryLayout<ComplexInterval>.stride, alignment: MemoryLayout<Double>.alignment)

and then the various evaluate calls look like:

func evaluateAsComplex(x: Complex) -> Complex {
    program.run(x: x, on: program.stack.bindMemory(to: Complex.self))
}
func evaluateAsInterval(x: Interval) -> Interval {
    program.run(x, on: program.stack.bindMemory(to: Interval.self)).0
}

stack is passed into evaluate rather than allocated on each call. It "seems to work", as far as I can tell. It is significantly faster. Yet, as a Swift newbie, I have concerns.

Are there best practices in Swift for when using Array introduces too much overhead? Is there a better way to do this? Are there hidden assumptions in the code above that will bite me later? Will SE-0322 provide a better solution? Should I just wait and use that instead? Having just watched Safely manage pointers in Swift - WWDC20 - Videos - Apple Developer are there other resources I should study to better understand the issues and approaches here?

dabrahams · November 19, 2021, 2:44am

Hi Ron,

RonAvitzur:

My next attempt at optimizing this combines the three arrays as UnsafeMutableBufferPointers:

    let stack = UnsafeMutableBufferPointer<T>.allocate(capacity: maxStackDepth + numCSEvariables + numLoopVariables)
    let cseVariables  = UnsafeMutableBufferPointer(rebasing: stack[maxStackDepth...])
    let loopVariables = UnsafeMutableBufferPointer(rebasing: stack[(maxStackDepth + numCSEvariables)...])
    defer { stack.deallocate() }

That was a significant performance improvement,

Okay, did you try just using one Array? I'd be really surprised if the speedup you saw here wasn't due to eliminating two of the 3 allocations.

My next step was to hoist the allocation out of the call to evaluate .

Hoisting the Array might be a good idea too.

IIUC, program.stack is effectively a global variable here. That's awful, but if you can get away with it, it might be your cleanest answer. But try to use an Array rather than raw memory if you can. And if you can't use an Array, there's always UnsafeRawBufferPointer or Data.

Are there best practices in Swift for when using Array introduces too much overhead?

Have you seen Array.withUnsafe[Mutable]BufferPointer? That's always been enough for me to get speed, if I discover safety checks are slowing down my array code..

Will SE-0322 provide a better solution?

Quite possibly, if you really need to use unsafe memory, it would.

If you could post a reduced example that shows what you're doing and demonstrates that it's slow, I could see if there's a way to speed it up without resorting to unsafe code.

Further musings…

Sadly, Swift doesn't expose an alloca-like facility. I found a post where @jrose claimed to have observed heap-to-stack optimization for non-escaping classes, which presumably should work for Array just as well as ManagedBuffer, but I can't reproduce it with either one.

You can use tuples to build a fixed-sized array on the stack, but as the article notes, there's no way to avoid an initialization cost.

If you're willing to mix C and Swift, maybe you should consider something like this: https://github.com/dabrahams/alloca-swift, which seems to work (though I haven't timed it). It's really inconvenient that you have to funnel all the data through a void*, but maybe you can make something better out of it.

scanon · November 19, 2021, 3:00am

IIRC, Array isn't actually non-escaping from the compiler's viewpoint due to an unfortunate implementation detail with capacity, which blocks this optimization from triggering. @David_Smith probably remembers the specifics.

ksluder · November 19, 2021, 5:40am

Swift 5.6 does, in fact, thanks to @grynspan’s SE-0322: Temporary uninitialized buffers

David_Smith · November 19, 2021, 5:52am

It works in some cases, but yeah, if we stop calling malloc_size it gets much more reliable. I have a joint PR in the works with @Michael_Gottesman that preliminarily delivers some nice 4x wins on a few benchmarks.

RonAvitzur · November 20, 2021, 5:50pm

I had not. When I tried that in the reduced example, it was the slowest variation, which I do not understand at all. But when used with an .withUnsafeMutableBufferPointer closure it is substantially faster.

Would that require passing it as an inout parameter to avoid making a local copy?

It is a property of the CompiledExpr class that holds the program representation of an Expr.

Yes. I've included it in the test cases below.

Here is a reduced example that shows some of the context and illustrates the performance differences. The performance-critical section is copy/pasted six times with variations.

When the code Array was a var property, timing was substantially slower. I should understand why, but I do not.

//  ArrayOverheadTimingTests Created by avitzur on 11/19/21.

import Foundation

/// An attempt to extract a skeleton of Graphing Calculator's performance critical innermost loop
/// to investigate abstraction penalties imposed by using Swift Array
///
/// For context:
///     Expr is the tree representation of a user-inputted equation (not necessary for this sample)
///     CompiledExpr holds a linear representation optimized for evaluation
///         Program is the list of instructions
///
///     a single compiledExpr instance can be called to evaluate on different numeric types (and also, not shown here, on a vector of input values)
///
///     program.run is copy/pasted with six variations below to compare timing of different ways
///
///     Building this as a command-line tool, optimized for speed, with Safety Checks disabled, outputs this on my M1 Mini:
///        run0 returns 10100.0 in 1.96s: Local Arrays
///        run1 returns 10100.0 in 1.41s: Local Arrays, pre-allocated, manually indexed
///        run2 returns 10100.0 in 2.13s: Combined Local Array, pre-allocated, manually indexed
///        run3 returns 10100.0 in 0.63s: Combined Local Array, pre-allocated, manually indexed, withUnsafeMutableBufferPointer
///        run4 returns 10100.0 in 0.58s: Passing pre-allocated UnsafeMutableRawBufferPointer
///        run5 returns 10100.0 in 0.57s: Passing pre-allocated UnsafeMutableRawBufferPointer without rebasing
///        run0 returns Interval(min: 10100.0, max: 10100.0) in 2.18s: Local Arrays
///        run1 returns Interval(min: 10100.0, max: 10100.0) in 1.51s: Local Arrays pre-allocated, manually indexed
///        run2 returns Interval(min: 10100.0, max: 10100.0) in 2.12s: Combined Local Array pre-allocated, manually indexed
///        run3 returns Interval(min: 10100.0, max: 10100.0) in 0.78s: Combined Local Array pre-allocated, manually indexed, withUnsafeMutableBufferPointer
///        run4 returns Interval(min: 10100.0, max: 10100.0) in 0.76s: Passing pre-allocated UnsafeMutableRawBufferPointer
///        run5 returns Interval(min: 10100.0, max: 10100.0) in 0.72s: Passing pre-allocated UnsafeMutableRawBufferPointer without rebasing
///        Program ended with exit code: 0


// Since I'm unsure whether or in what ways generic specialization contributes to performance issues,
// here is a toy protocol with two numeric types for timing comparison
protocol Computable {
    static var nan: Self { get }
    var isNaN: Bool { get }
    init(_ x: Double)
    static func +(a: Self, b: Self) -> Self
    static func /(a: Self, b: Self) -> Self
    static func >(a: Self, b: Self) -> Bool
}

extension Double: Computable {}

struct Interval : Computable {
    var min: Double
    var max: Double
    var isNaN: Bool { min.isNaN || max.isNaN }
    static let nan = Interval(.nan)
    init(_ x: Double) {
        self.min = x
        self.max = x
    }
    static func +(a: Self, b: Self) -> Self { Interval(a.min + b.min) }
    static func /(a: Self, b: Self) -> Self { Interval(a.min / b.min) }
    static func >(a: Self, b: Self) -> Bool { a.min > b.min }
}


enum Instruction: Equatable {
    case constant(Double)          // push constant onto stack
    case x                         // push graphing variable onto stack
    case getVariable(Int)          // push CSE variable onto stack
    case setVariable(Int)          // copy top-of-stack into CSE variable
    case getLoopVariable(Int)      // push loop variable onto stack
    case setLoopVariable(Int)      // copy top-of-stack into loop variable
    case pop                       // discard top-of-stack
    case loopIf(_ variable: Int, _ branchDistance: Int) // for looping on sums and products
    case incrementAndLoop(_ variable: Int, _ branchDistance: Int)
    case plus
    case result
}

class Program {
    let code: [Instruction] // TODO: change to var code and observe timing changes
    let numCSEvariables: Int       // size of local cseVariables  in run()
    let numLoopVariables: Int      // size of local loopVariables in run()
    let maxStackDepth: Int         // number of elements needed for stack[]

    var stack: UnsafeMutableRawBufferPointer // pre-allocate memory needed by program.run

    init() {
        code = [ // Sample program to evaluate the (sum of k from 0 to x of k) + (sum of k from 0 to x of k)
            .constant(0),
            .setLoopVariable(0),
            .x,
            .setLoopVariable(1),
            .pop,
            .loopIf(0, 3),
            .getLoopVariable(0),
            .plus,
            .incrementAndLoop(0, 3),
            .setVariable(0),
            .getVariable(0),
            .plus,
            .result
        ]
        numCSEvariables = 1
        numLoopVariables = 2
        maxStackDepth = 4
        let n = maxStackDepth + numCSEvariables + numLoopVariables
        self.stack = UnsafeMutableRawBufferPointer.allocate(byteCount: n * MemoryLayout<Interval>.stride, alignment: MemoryLayout<Double>.alignment)
    }

    // Using Array<T> in the most natural way
    func run0<T: Computable>(x: T) -> T {
        var stack: [T] = []
        var cseVariables  = [T](repeating: T.nan, count: numCSEvariables)
        var loopVariables = [T](repeating: T.nan, count: numLoopVariables)
        var y = T.nan // top of stack held separately

        var  pop: T       { stack.removeLast() }
        func push(_ x: T) { stack.append(x) }

        var pc = 0
        execution: while true {
            switch code[pc] {
            case .constant(let x):        push(y); y = T(x)
            case .x:                      push(y); y = x
            case .getVariable(let i):     push(y); y = cseVariables[i]
            case .getLoopVariable(let i): push(y); y = loopVariables[i]
            case .setVariable(let i):     cseVariables[i] = y
            case .setLoopVariable(let i): loopVariables[i] = y
            case .plus:                   y = pop + y
            case .pop:                    y = pop
            case .result:                 break execution
            case .loopIf(let loopIndex, let branchDistance):
                let k    = loopVariables[loopIndex]
                let kMax = loopVariables[loopIndex + 1]
                if  k > kMax { pc += branchDistance }
            case .incrementAndLoop(let loopIndex, let branchDistance):
                loopVariables[loopIndex] = loopVariables[loopIndex] + T(1)
                pc -= branchDistance
                continue // skip pc += 1
            }
            pc += 1
            if !(y is Double) && y.isNaN && !stack.isEmpty {
                return .nan
            }
        }
        return y
    }

    // Pre-allocate stack to avoid over overhead imposed by append/removeLast, keep track of position with stack pointer index sp
    func run1<T: Computable>(x: T) -> T {
        var stack = [T](repeating: T.nan, count: maxStackDepth)
        var cseVariables  = [T](repeating: T.nan, count: numCSEvariables)
        var loopVariables = [T](repeating: T.nan, count: numLoopVariables)
        var sp = 0
        var y = T.nan // top of stack held separately

        var  pop: T       { sp -= 1; return stack[sp + 1] }
        func push(_ x: T) { sp += 1; stack[sp] = x }

        var pc = 0
        execution: while true {
            switch code[pc] {
            case .constant(let x):        push(y); y = T(x)
            case .x:                      push(y); y = x
            case .getVariable(let i):     push(y); y = cseVariables[i]
            case .getLoopVariable(let i): push(y); y = loopVariables[i]
            case .setVariable(let i):     cseVariables[i] = y
            case .setLoopVariable(let i): loopVariables[i] = y
            case .plus:                   y = pop + y
            case .pop:                    y = pop
            case .result:                 break execution
            case .loopIf(let loopIndex, let branchDistance):
                let k    = loopVariables[loopIndex]
                let kMax = loopVariables[loopIndex + 1]
                if  k > kMax { pc += branchDistance }
            case .incrementAndLoop(let loopIndex, let branchDistance):
                loopVariables[loopIndex] = loopVariables[loopIndex] + T(1)
                pc -= branchDistance
                continue // skip pc += 1
            }
            pc += 1
            if !(y is Double) && y.isNaN && sp != 0 {
                return .nan
            }
        }
        return y
    }

    // combine the 3 Array<T> to reduce construction/destruction overhead
    func run2<T: Computable>(x: T) -> T {
        var stack = [T](repeating: T.nan, count: maxStackDepth + numCSEvariables + numLoopVariables)
        var sp = 0
        var y = T.nan // top of stack held separately

        var  pop: T       { sp -= 1; return stack[sp + 1] }
        func push(_ x: T) { sp += 1; stack[sp] = x }

        var pc = 0
        execution: while true {
            switch code[pc] {
            case .constant(let x):        push(y); y = T(x)
            case .x:                      push(y); y = x
            case .getVariable(let i):     push(y); y = stack[i + maxStackDepth]
            case .getLoopVariable(let i): push(y); y = stack[i + maxStackDepth + numCSEvariables]
            case .setVariable(let i):     stack[i + maxStackDepth] = y
            case .setLoopVariable(let i): stack[i + maxStackDepth + numCSEvariables] = y
            case .plus:                   y = pop + y
            case .pop:                    y = pop
            case .result:                 break execution
            case .loopIf(let loopIndex, let branchDistance):
                let k    = stack[maxStackDepth + numCSEvariables + loopIndex]
                let kMax = stack[maxStackDepth + numCSEvariables + loopIndex + 1]
                if  k > kMax { pc += branchDistance }
            case .incrementAndLoop(let loopIndex, let branchDistance):
                stack[maxStackDepth + numCSEvariables + loopIndex] = stack[maxStackDepth + numCSEvariables + loopIndex] + T(1)
                pc -= branchDistance
                continue // skip pc += 1
            }
            pc += 1
            if !(y is Double) && y.isNaN && sp != 0 {
                return .nan
            }
        }
        return y
    }

    // access stack .withUnsafeMutableBufferPointer
    func run3<T: Computable>(x: T) -> T {
        var stack = [T](repeating: T.nan, count: maxStackDepth + numCSEvariables + numLoopVariables)
        return stack.withUnsafeMutableBufferPointer{ stack in
            var sp = 0
            var y = T.nan // top of stack held separately

            var  pop: T       { sp -= 1; return stack[sp + 1] }
            func push(_ x: T) { sp += 1; stack[sp] = x }

            var pc = 0
            execution: while true {
                switch code[pc] {
                case .constant(let x):        push(y); y = T(x)
                case .x:                      push(y); y = x
                case .getVariable(let i):     push(y); y = stack[i + maxStackDepth]
                case .getLoopVariable(let i): push(y); y = stack[i + maxStackDepth + numCSEvariables]
                case .setVariable(let i):     stack[i + maxStackDepth] = y
                case .setLoopVariable(let i): stack[i + maxStackDepth + numCSEvariables] = y
                case .plus:                   y = pop + y
                case .pop:                    y = pop
                case .result:                 break execution
                case .loopIf(let loopIndex, let branchDistance):
                    let k    = stack[maxStackDepth + numCSEvariables + loopIndex]
                    let kMax = stack[maxStackDepth + numCSEvariables + loopIndex + 1]
                    if  k > kMax { pc += branchDistance }
                case .incrementAndLoop(let loopIndex, let branchDistance):
                    stack[maxStackDepth + numCSEvariables + loopIndex] = stack[maxStackDepth + numCSEvariables + loopIndex] + T(1)
                    pc -= branchDistance
                    continue // skip pc += 1
                }
                pc += 1
                if !(y is Double) && y.isNaN && sp != 0 {
                    return .nan
                }
            }
            return y
        }
    }


    // pre-allocate stack outside of call
    func run4<T: Computable>(x: T, on stack: UnsafeMutableBufferPointer<T>) -> T {
        let cseVariables  = UnsafeMutableBufferPointer(rebasing: stack[maxStackDepth...])
        let loopVariables = UnsafeMutableBufferPointer(rebasing: stack[(maxStackDepth + numCSEvariables)...])
        var sp = 0
        var y = T.nan // top of stack held separately

        var  pop: T       { sp -= 1; return stack[sp + 1] }
        func push(_ x: T) { sp += 1; stack[sp] = x }

        var pc = 0
        execution: while true {
            switch code[pc] {
            case .constant(let x):        push(y); y = T(x)
            case .x:                      push(y); y = x
            case .getVariable(let i):     push(y); y = cseVariables[i]
            case .getLoopVariable(let i): push(y); y = loopVariables[i]
            case .setVariable(let i):     cseVariables[i] = y
            case .setLoopVariable(let i): loopVariables[i] = y
            case .plus:                   y = pop + y
            case .pop:                    y = pop
            case .result:                 break execution
            case .loopIf(let loopIndex, let branchDistance):
                let k    = loopVariables[loopIndex]
                let kMax = loopVariables[loopIndex + 1]
                if  k > kMax { pc += branchDistance }
            case .incrementAndLoop(let loopIndex, let branchDistance):
                loopVariables[loopIndex] = loopVariables[loopIndex] + T(1)
                pc -= branchDistance
                continue // skip pc += 1
            }
            pc += 1
            if !(y is Double) && y.isNaN && sp != 0 {
                return .nan
            }
        }
        return y
    }

    // pre-allocate stack outside of call, single array without rebasing
    func run5<T: Computable>(x: T, on stack: UnsafeMutableBufferPointer<T>) -> T {
        var sp = 0
        var y = T.nan // top of stack held separately

        var  pop: T       { sp -= 1; return stack[sp + 1] }
        func push(_ x: T) { sp += 1; stack[sp] = x }

        var pc = 0
        execution: while true {
            switch code[pc] {
            case .constant(let x):        push(y); y = T(x)
            case .x:                      push(y); y = x
            case .getVariable(let i):     push(y); y = stack[i + maxStackDepth]
            case .getLoopVariable(let i): push(y); y = stack[i + maxStackDepth + numCSEvariables]
            case .setVariable(let i):     stack[i + maxStackDepth] = y
            case .setLoopVariable(let i): stack[i + maxStackDepth + numCSEvariables] = y
            case .plus:                   y = pop + y
            case .pop:                    y = pop
            case .result:                 break execution
            case .loopIf(let loopIndex, let branchDistance):
                let k    = stack[maxStackDepth + numCSEvariables + loopIndex]
                let kMax = stack[maxStackDepth + numCSEvariables + loopIndex + 1]
                if  k > kMax { pc += branchDistance }
            case .incrementAndLoop(let loopIndex, let branchDistance):
                stack[maxStackDepth + numCSEvariables + loopIndex] = stack[maxStackDepth + numCSEvariables + loopIndex] + T(1)
                pc -= branchDistance
                continue // skip pc += 1
            }
            pc += 1
            if !(y is Double) && y.isNaN && sp != 0 {
                return .nan
            }
        }
        return y
    }
}

class CompiledExpr {
    var program: Program = Program()

    func evaluate0(x: Double)             -> Double   { program.run0(x: x) }
    func evaluate1(x: Double)             -> Double   { program.run1(x: x) }
    func evaluate2(x: Double)             -> Double   { program.run2(x: x) }
    func evaluate3(x: Double)             -> Double   { program.run3(x: x) }
    func evaluate4(x: Double)             -> Double   { program.run4(x: x, on: program.stack.bindMemory(to: Double.self)) }
    func evaluate5(x: Double)             -> Double   { program.run5(x: x, on: program.stack.bindMemory(to: Double.self)) }

    func evaluateAsInterval0(x: Interval) -> Interval { program.run0(x: x) }
    func evaluateAsInterval1(x: Interval) -> Interval { program.run1(x: x) }
    func evaluateAsInterval2(x: Interval) -> Interval { program.run2(x: x) }
    func evaluateAsInterval3(x: Interval) -> Interval { program.run3(x: x) }
    func evaluateAsInterval4(x: Interval) -> Interval { program.run4(x: x, on: program.stack.bindMemory(to: Interval.self)) }
    func evaluateAsInterval5(x: Interval) -> Interval { program.run5(x: x, on: program.stack.bindMemory(to: Interval.self)) }
}

func time<T: Computable>(_ functionName: String, _ description: String, function: () -> T) {
    let iterations = 1000000
    let start = DispatchTime.now().uptimeNanoseconds
    var sum: T = T(0)
    for _ in 1 ... iterations {
        sum = sum + function()
        }
    sum = sum / T(Double(iterations))
    let end = DispatchTime.now().uptimeNanoseconds
    print("\(functionName) returns \(sum) in \(String(format: "%.2f", Double(end - start)/1.0e9))s: \(description)")
}

func doTimingComparisons() {
    let f = CompiledExpr()
    time("run0", "Local Arrays") { f.evaluate0(x: 100) }
    time("run1", "Local Arrays, pre-allocated, manually indexed")                                         { f.evaluate1(x: 100) }
    time("run2", "Combined Local Array, pre-allocated, manually indexed")                                 { f.evaluate2(x: 100) }
    time("run3", "Combined Local Array, pre-allocated, manually indexed, withUnsafeMutableBufferPointer") { f.evaluate3(x: 100) }
    time("run4", "Passing pre-allocated UnsafeMutableRawBufferPointer")                                   { f.evaluate4(x: 100) }
    time("run5", "Passing pre-allocated UnsafeMutableRawBufferPointer without rebasing")                  { f.evaluate5(x: 100) }

    time("run0", "Local Arrays")                                                                          { f.evaluateAsInterval0(x: Interval(100)) }
    time("run1", "Local Arrays pre-allocated, manually indexed")                                          { f.evaluateAsInterval1(x: Interval(100)) }
    time("run2", "Combined Local Array pre-allocated, manually indexed")                                  { f.evaluateAsInterval2(x: Interval(100)) }
    time("run3", "Combined Local Array pre-allocated, manually indexed, withUnsafeMutableBufferPointer")  { f.evaluateAsInterval3(x: Interval(100)) }
    time("run4", "Passing pre-allocated UnsafeMutableRawBufferPointer")                                   { f.evaluateAsInterval4(x: Interval(100)) }
    time("run5", "Passing pre-allocated UnsafeMutableRawBufferPointer without rebasing")                  { f.evaluateAsInterval5(x: Interval(100)) }
}

doTimingComparisons()

dabrahams · November 22, 2021, 4:28am

Hi @RonAvitzur,

Looking through your example, I found I could make run0 as fast as run1 by calling reserveCapacity on the stack array to make sure it had enough space.

But the thing that most strikes me about this is that your array is tiny; just 7 elements long. It's no surprise, then, that the overheads of array allocation are swamping other effects. Certainly reallocating the arrays is going to be very costly. It seems like you can afford to keep around an array for every type you're working on.

Is this really representative of your actual problem? What are the numbers going to actually be like:

How many different types (Double, Interval, Complex…)
How many CSE variables
How many loop variables
How deep is the stack going to get, really?

-Dave

RonAvitzur · November 22, 2021, 3:27pm

Hi @dabrahams,

Thank you for spending time on this.

Six: Double, Interval, Complex, ComplexInterval, [Double], and [Complex]. In the vectorized cases where the input values are [Double] or [Complex], the local array sized needed are multiplied by the length of the input array, but those cases have a qualitatively different profile dominated by Accelerate vDSP calls doing the calculations a vector at a time, so I did not include that in the sample code.

The short answer is I don't actually know, since the old code used alloca for this (when possible), it had not previously come up for detailed investigation. The numbers depend on the equations the user entered. In the vast majority of cases, the numbers are quite tiny. The number of loop variables is now many extended sums or products (Σ or Π notation), the CSE variables depend on how many identical subexpressions appear, the stack depth is the number of terms in the longest sum or product plus the depth of the Expr tree, which is usually shallow. All of which is complicated by function definitions, variable definitions, derivatives and matrix operations in the user's document, all of which are symbolically expanded in place to create the Expr used for generating the Program above. In this test case, https://twitter.com/RonAvitzur/status/1461023215724097536, the stack depth is 80 and that is an extreme case.

Yes, I think so.

dabrahams · November 22, 2021, 7:12pm

From your Twitter thread, it seems like you've gotten pretty deeply into it—that's a pretty impressive saga. I could try to help more, but I might need to get my hands on the whole codebase. A few other things came up given what I can see:

Is there a reason you want Program and CompiledExpr to remain classes?
Is there a reason they shouldn't be generics parameterized on your data type? Then you could just store the array in a property.
You might save some arithmetic ops by encoding the numCSEVariables offset into the loop variable indices.
Given that you are trying to micro-optimize this code, dropping down to unsafe operations may be inevitable, but you should maybe reckon with your tolerance for unsafety. Personally I'm fine turning off checks that protect me from my own bugs once I know my code is right, but some of your optimizations create a security risk if maxStackDepth is exceeded, which is up to the user. Is that acceptable?
Once you are sure of your arithmetic you can bypass overflow checking without turning off safety altogether by using modulus arithmetic (e.g. &+)
If maxStackDepth is a global constant, you might be better off storing it in a global rather than in instances; I haven't looked at the disassembly to see, but dynamic initialization logic could prevent it from being optimized.
I think there are some small errors in your stack logic in run2, and you can make the code a bit cleaner by pushing and popping at the end of the stack. Not that any of this is going to improve performance…

    // combine the 3 Array<T> to reduce construction/destruction overhead
    func run2<T: Computable>(x: T) -> T {
        let countVariables = numCSEvariables + numLoopVariables
        var stack = [T](repeating: T.nan, count: countVariables + maxStackDepth)
        var y = T.nan // top of stack held separately
        var sp = countVariables

        var  pop: T       { sp -= 1; return stack[sp] }
        func push(_ x: T) { stack[sp] = x; sp += 1 }

        var pc = 0
        execution: while true {
            switch code[pc] {
            case .constant(let x):        push(y); y = T(x)
            case .x:                      push(y); y = x
            case .getVariable(let i):     push(y); y = stack[i]
            case .getLoopVariable(let i): push(y); y = stack[i + numCSEvariables]
            case .setVariable(let i):     stack[i] = y
            case .setLoopVariable(let i): stack[i + numCSEvariables] = y
            case .plus:                   y = pop + y
            case .pop:                    y = pop
            case .result:                 break execution
            case .loopIf(let loopIndex, let branchDistance):
                let k    = stack[numCSEvariables + loopIndex]
                let kMax = stack[numCSEvariables + loopIndex + 1]
                if  k > kMax { pc += branchDistance }
            case .incrementAndLoop(let loopIndex, let branchDistance):
                stack[numCSEvariables + loopIndex] += 1.0
                pc -= branchDistance
                continue // skip pc += 1
            }
            pc += 1
            if !(y is Double) && y.isNaN && sp != 0 {
                return .nan
            }
        }
        return y
    }

HTH,
Dave

RonAvitzur · November 22, 2021, 7:42pm

Program should be a value type and it was a struct last week, but profiling showed that ARC overhead due to multiple Array and class properties was significant. https://twitter.com/RonAvitzur/status/1459183687883386886?s=20

@Ben_Cohen said "We really need “indirect struct” to stop people having to use classes for that particular optimization." https://twitter.com/AirspeedSwift/status/1459282504381124609

I haven't looked into making CompiledExpr a struct. I don't think there's any fundamental need for it to be a reference type; that was one of countless decisions driven by starting with a mechanical translation from the C++ version.

To avoid the need for duplicate CompiledExprs and Programs when the algorithms using them need to make both Scalar and Interval or Scalar and Vectorized evaluations of the same equations.

I've resigned myself to that, but have hope that https://github.com/apple/swift-evolution/blob/main/proposals/0322-temporary-buffers.md will help in keeping the unsafe operations to a more limited scope and strongly typed.

That is a difficult question. My goal at the moment is bug-for-bug compatibility with the shipping version. If the risk is no larger than in the C++ version, I can accept it for now.

It is computed from the code [Instruction] array, per-CompiledExpr, as part of the transformation from the Expr tree to the linear Program. The function computing the stack needed is very simple. The bigger risk are bugs in program.run().

I had forgotten about that, thanks!

dabrahams · November 22, 2021, 9:48pm

~~Maybe~~Definitely I don't understand the context well enough, but from what I can see here it looks like that would be an acceptable cost; after all, you're already talking about keeping around an array for each datatype anyhow(?)

dabrahams · November 22, 2021, 9:54pm

Also, I once did some work with optimized type-erased array buffers… but I'm not sure it's worth your time to dig into it. It was used successfully by GitHub - borglab/SwiftFusion, FWIW..

RonAvitzur · November 22, 2021, 9:58pm

Possibly an acceptable performance cost; a trivial one if I copied the Program Instructions rather than re-built it. It would be a slightly annoying tiny bit of source code clutter to keep different CompiledExpr properties solely for the purpose of evaluating them on different types, but that could be a reasonable tradeoff for more robust type safety and less of of Unsafe APIs. That's something for me to think about more in context where this is used.

I started this translation as a mere feasibility study and a vehicle to learn Swift & SwiftUI with little concern for performance or shipping. Now, though it has reached the point where feasibility is a given and I'm starting to think about what's needed to ship a new release to customers, which is an entirely different set of constraints.

dabrahams · November 27, 2021, 6:06pm

I have to admit I’m a bit mystified as to why this would still be an issue now that we’re using the guaranteed calling convention. It seems to me that we shouldn’t normally be paying ARC costs except where an argument needs to escape