Swift Performance

There is a larger optimization there around structural arc opt. It also has interesting possibilities around improving arc perf when traversing generic graphs (I.e. once you understand you are traversing the graph and the graphs structure you can just retain the node you start at)

3 Likes

I realize that is a little cryptic. For a long time, there has a been a (seemingly) weird perf difference in between traversing a graph recursively vs imperatively (with the recursive case having no ARC traffic IIRC in the loop and the imperative one ... lets not get into it...). The key thing is the optimizer needs to analyze the loop and understand that one is walking such a structure and the pattern of (lets for instance) imply that if one part of the graph is alive, all of it must be.

The easiest example to see of this is traversing a linked list imperatively vs recursively.

2 Likes

I just want to elaborate a bit more on this.

Specifically, for a long time there has been this hypothesis that in most programs, the vast majority of objects never leave the thread in which they are allocated. This hypothesis results in optimization approaches that attempt to determine the thread that an object is in, and if the object is still in that thread when an ARC operation occurs, avoid the atomics since we haven't left.

In modern UI apps this is not as true since much of the allocation of data, processing of data, loading data of the network happen on side threads in order to not block the main thread and maintain responsiveness. Once the work is completed, the data is then passed off to the main thread where it remains until program exit or deallocation. This forces upon us a slightly different dictum: /most/ objects are "eventually" thread local.

This suggests that what is important in terms of optimization is /not/ being able to recognize if an object is still in the thread that it was allocated in, but in recognizing that steady state when the object has been passed to its final destination thread (generally the main UI thread).

The first optimization Dave is talking about is around the ability of COW to force an object to become thread local again dynamically. COW doesn't care about where the object came frame, what thread it was allocated on, if the object was returned from a dynamically dispatched method, etc. Once the value has been uniqued, we can use non-atomic operations. In terms of the optimizer rewriting retains releases this would as Dave mentioned necessarily require us to use escape analysis since we can only maintain this property until the object escapes. There are tricks that one could play around using mutating getters/etc to make this uniqueness criterion happen in more dynamic places.

When we tried such an optimization many years ago we found that it helped performance inside Dictionary (array was helped as well since this pass also eliminated unneeded uniqueness checks). Sadly, I don't think it was ever upstreamed in the repo by the person who was messing with it. That being said, I am a fan of that optimization. Patches welcome = p.

Another 2nd optimization that Dave has mentioned here before in a previous thread, but I would like to re-iterate is around re-localizing just the reference count itself to a thread. Specifically, for a thread to guarantee that an object stays alive in a thread safe way, it only needs to perform a single atomic ref count and decrement the object when thread thead is torn down. So, clearly having an atomic ref count > 1 is not needed for the object to stay alive in the thread. So why do we always do these atomic ref counts? To maintain the book keeping. So one could imagine a world where we have some representation of a thread local ref count that manages that single atomic ref count that binds the object's lifetime to the thread. To work well we would need the concurrency model.

Fun thread. = ).

21 Likes

This hasn’t been my experience at all (except of course for Swift 1 & 2). I recently had a bake-off with a very veteran programmer who had written naive C, hand-tuned assembly, and hand-tuned MMX and SSE code to run through a buffer of gigabytes of UTF-8 data, and my four-line swift code was something like 10x faster than their fastest (and much, much longer) implementation.

That’s just one data point, of course, but another is that VERY early on when I was switching from Obj-C to Swift I found swift was 3x slower...then it was about the same speed, then it was 3x faster. Now nobody is going to accuse Obj-C of being a speed daemon but I was really impressed because, like, I was going to be using Obj-C if Swift hadn’t come along.

Note that while I always welcome more speed, it’s definitely not #1 for what I want in a language. Processors get so much faster every year, but my ability to write and understand code doesn’t get that much better unless the language itself improves, and that’s where Swift shines.

But I’m glad you started this conversation — at the very least it lets us get another data point on speeds of languages. I think the first step would be to rewrite the Swift code to use Swift fastest practices and see how it comes out.

-Wil

6 Likes

Ok, I got this to be 4x faster (almost exactly) on my machine by using an UnsafeMutableBufferPointer. This may be considered cheating, but it’s what I’d do in real life to eliminate retain/release/safe memory access.

Not that ¼ of the remaining time is from the random number generator!

Here’s my code.

import Foundation

typealias NodeIndex = Int


// MARK: -Node
fileprivate struct Node {
    // MARK: properties
    var x: Int
    var y: Int = Int.random(in: 0...Int.max)
    var left: NodeIndex?
    var right: NodeIndex?
}


// MARK: -Tree
fileprivate class Tree {
    // MARK: methods
    func contains(_ value: Int) -> Bool {
        let splitted = split(root: mRoot, value: value)
        mRoot = merge(lower: splitted.lower, equal: splitted.equal, greater: splitted.greater)
        return splitted.equal != nil
    }
    
    func insert(_ value: Int) {
        var splitted = split(root: mRoot, value: value)
        if splitted.equal == nil {
            splitted.equal = nextNodeIndex
            nodeHolder[nextNodeIndex] = Node(x: value)
            nextNodeIndex += 1
        }
        mRoot = merge(lower: splitted.lower, equal: splitted.equal, greater: splitted.greater)
    }
    
    func erase(_ value: Int) {
        let splited = split(root: mRoot, value: value)
        mRoot = merge(lower: splited.lower, greater: splited.greater)
    }

    func merge(lower: NodeIndex?, equal: NodeIndex?, greater: NodeIndex?) -> NodeIndex? {
        return merge(lower: merge(lower: lower, greater: equal), greater: greater)
    }

    func merge(lower: NodeIndex?, greater: NodeIndex?) -> NodeIndex? {
        if let lower = lower, let greater = greater {
            if nodeHolder[lower]!.y < nodeHolder[greater]!.y {
                nodeHolder[lower]!.right = merge(lower: nodeHolder[lower]!.right, greater: greater)
                return lower
            } else {
                nodeHolder[greater]!.left = merge(lower: lower, greater: nodeHolder[greater]!.left)
                return greater
            }
        } else if lower == nil {
            return greater
        } else {
            return lower
        }

    }

    func split(root: NodeIndex?, value: Int) -> (lower: NodeIndex?, equal: NodeIndex?, greater: NodeIndex?) {
        let (lower, equalGreater) = splitBinary(root: root, value: value)
        let (equal, greater) = splitBinary(root: equalGreater, value: value + 1)
        return (lower: lower, equal: equal, greater: greater)
    }

    func splitBinary(root: NodeIndex?, value: Int) -> (NodeIndex?, NodeIndex?) {
        guard let root = root else { return (nil, nil) }

        if nodeHolder[root]!.x < value {
            let splitPair = splitBinary(root: nodeHolder[root]!.right, value: value)
            nodeHolder[root]!.right = splitPair.0
            return (root, splitPair.1)
        } else {
            let splitPair = splitBinary(root: nodeHolder[root]!.left, value: value)
            nodeHolder[root]!.left = splitPair.1
            return (splitPair.0, root)
        }
    }


    // MARK: private properties
    private var mRoot: NodeIndex?
    private var nodeHolder = UnsafeMutableBufferPointer<Node?>.allocate(capacity: 1_000_000)
    private var nextNodeIndex: NodeIndex = 0
}


// MARK: -The Test
func runNaive() -> Int
{
    let tree = Tree()
    var current = 5
    var result = 0

    for i in 1..<1_000_000 {
        current = (current * 57 + 43) % 10007
        switch i % 3 {
            case 0: tree.insert(current)
            case 1: tree.erase(current)
            case 2: if tree.contains(current) { result += 1 }
            default: break
        }
    }
    return result
}

let startTime = Date().timeIntervalSince1970
print("Naive result \(runNaive())")
print("total time: \(Date().timeIntervalSince1970 - startTime)s")
4 Likes

Definitely bug in my book.

interesting, thanks for posting this.

I have a question:

What are the resources for writing Swift in a performant way?
Sure, I watched a few talks I randomly found etc. But is there anything "official"?

From my standpoint - I've been learning to code for 2 years now, almost exclusively Swift - it's kind of hard to find good docs on topics like these. The "Swift book" is great, but I don't feel it mentions things like performance enough.

Reason why I'm writing this is that I started reading Rust's book, and since performance is one of the main focus points of Rust, it's mentioned quite often - and in a (relatively) beginner-friendly way (i.e. there's something like "The default Integer type in Rust is i32 because 32-bit integers are the fastest in most situations" and things like that)

What I'm trying to say is: I want to learn how to write performant Swift. But I feel like if I want to be able to do that, I need to dive into the implementation of Swift itself...

1 Like

This is an interesting - and old - challenge. As programming languages & libraries have evolved over the decades - with ever-increasing layers of abstraction - it becomes harder and harder to intuit what any given piece of code is going to do when it really comes down to it. Performance is particularly challenging because it’s so often given the short stick re. documentation and the like, but the same goes for notable other topics like concurrency safety & reentrancy.

Inevitably what this means for the practicioner is that you need to rely on investigation, experience, and tools. Mystery tends to promote mysticism, with all the associated dogma, cargo-culting, and rituals - things like that rather disturbing quote from the Rust documentation. Avoid that sort of thing. Use tools (e.g. Instruments) to see what’s going on in practice, and work from there. To accelerate that beyond your own personal experience, search out case studies and focus on techniques, moreso than assertions, rules, and “You won’t believe these 10 simple steps to optimise your code!” clickbait.

4 Likes

Thanks for your reply! :sunglasses:

I feel that what you write is true - and that's my problem.
I'm fine with doing my own reasearch - as I HAVE to with almost anything else because the documentation is often short or even missing.

Yes, there are many great resources for learning Swift & related stuff - but sometimes there is nothing, or hard to find. I don't even know where to start...
Except for mentioned Instruments of course - but that's another thing: I don't believe Instruments is available on Linux, or is it?

That's why I mentioned Rust specifically - they have tons of official docs, that are in my opinion easier to read than Swift's.

P.S.:
I don't think there's necessarily anything wrong with the provided i32 example - because they were explaining why they went with 32 bit-singed Int for their "base" Int to someone just starting out with Rust in a beginner friendly way.

The same argument could be used about Swift's "If you're unsure on which to use, just pick struct" IMO.
Sure - that's not a good advice to give to someone experienced, but it gives someone just starting out a peace of mind in a way of "I don't have to worry about this for now, I'll learn more when I need to"

Not sure if it was already mentioned in this thread, but I highly recommend this WWDC 2016 gem: Understanding Swift Performance - WWDC16 - Videos - Apple Developer

Please take your time and watch this session; it has a lot of useful tips and also goes in depth about how the language operates with data structures. It may give you ideas how to start writing performant code in Swift.

3 Likes

I think one of the best ways to reason about performance is through metrics.

It would be awesome if there were a very clear and recommended way to gather performance metrics so that people can easily test their code to identify what really is "faster" for their program (there may be and I just haven't seen it).

I'm sure there probably are 3rd party libraries or other ways one could go about doing that themself, but take python for example which has the timeit module. It would be great if swift had something official like that which made it really easy to test code for performance.

5 Likes

You'll find plenty of instructions online for kernel-centric tools, e.g. gprof, oprofile, perf, etc. They're ugly and difficult to use and typically provide very limited analysis depth, but they're reliable and fairly straightforward once you memorise their arcane incantations. They can get you started, but you'll very quickly hit their limitations, or get frustrated with their difficult user experience.

There used to be Zoom, a true Linux-native profiler, written by some of the folks that worked on Shark back in the day, but unfortunately it's discontinued (and while free to download now, the source isn't available). If you can still get it to work on your Linux install of choice, it's likely your best option as a beginner.

There's also some degree of support for Linux from Intel's VTune. VTune's the most powerful profiling tool out there these days, on any platform, but it has a reputation for being ugly, difficult to use, and unintuitive. That was true last time I used it, several years back, but maybe it's improved since then. It's worth a look, at least.

If nothing else, there's always the Wikipedia list of profiling tools. I haven't tried everything there - maybe there's some unrecognised gems.

But they don't need to present dubious justifications - they need only say "i32 is the default". Seeding people's minds with misleading generalisations about performance has no upside and the very real risk of establishing idiotic dogma.

2 Likes

One option is the time measurement support in XCTest. On macOS it has some convenient Xcode integrations, and it is supported at least in basic form on Linux (I haven't used it there, so I cannot attest to its maturity on that platform).

Keep in mind, however, that timeit-style tools are not the best way to profile your code. They are almost always used as the performance equivalent of printf-debugging. Use a real profiler if you're doing active analysis & optimisation - it takes much less effort and you'll get much more useful results.

If, on the other hand, your goal is merely to track performance over time in a simple fashion, that's exactly what XCTest's measurement capabilities are about. (and when it detects a regression, or you otherwise decide you need to improve things, you can run your cases under a profiler to get more insight)

2 Likes

I mentioned timeit, but I completely forgot about the python profile/cProfile module as well.

Thanks a lot! Definitely will check it out! :+1:t2::sunglasses:

Thanks! :+1:t2::sunglasses:

I'm using a Mac, so I'm more than fine with Xcode. Still thanks for your suggestions - maybe they'll come in handy some day later!

Wasn't that the purpose of GitHub - apple/swift-metrics: Metrics API for Swift ?

The swift metrics package provides the means to collect, record metrics for a variety of purposes. The package itself doesn't measure anything, but provides an apparatus to collect whatever you want to measure.

2 Likes

Is immortality a one way street?

Imagine so. Once you make an object immortal, then its stored ref count is no longer representative of its "true ref count", so ARC is unable to ever manage it correctly ever again. Is that correct?