Swift Performance

jamesgh · September 12, 2019, 11:41am

Hi all,

I want to know if there are any plans to effort improving Swift's performance? Benchmark after benchmark have...pretty dismal... metrics for Swift when compared to other systems languages.

For example, GitHub - frol/completely-unscientific-benchmarks: Naive performance comparison of a few programming languages (JavaScript, Kotlin, Rust, Swift, Nim, Python, Go, Haskell, D, C++, Java, C#, Object Pascal, Ada, Lua, Ruby) - the naive implementation run takes 12x longer than the C implementation. For reference, the Javascript implementation takes 6.8x longer. I added a non-refcounted version to that repo that brings the time down to 4.5x.

Now you could make an argument that benchmarks aren't necessarily representative of real-world use-cases. And that's absolutely true. But you can also look at this benchmark:

This is a network driver implemented in various languages. Swift does about as well as Javascript on this test here too. And this IS actually a somewhat worrying one because servers deal with packets and if the idea is to use swift on the server, maybe it should be able to be performant at stuff servers do.

Anyway, something I've been wondering about for a while.

jawbroken · September 12, 2019, 1:15pm

I have no real opinion on your general point, except that you should consider filing bugs and contributing benchmarks where you see room for improvement, but I doubt two random benchmarks are a good way to judge this. e.g. at a very brief glance it seems like the C++ version is building a treap of 32-bit integers and the Swift version is building a treap of 64-bit integers. And I would expect the classes to be final where performance was a concern. I don't know if those things would make much of a difference either way, but it doesn't give the immediate impression that a lot of care was taken to make the implementations comparable.

jamesgh · September 12, 2019, 1:24pm

These two are just two examples, sure. If it was easy to make Swift code run quickly I'm sure these benchmarks would show better scores for Swift. If it's a slog just to get it to run slightly faster than Javascript (which it seems to be at this point in time) I don't know what you expect from people. You can sit there and go "oh this benchmark is suboptimal because your code is wrong", but the fact of the matter is they've written the same benchmark in relatively obscure languages like Nim and Crystal and come out much faster than Swift.

scanon · September 12, 2019, 1:26pm

A cursory glance at instruments shows that the majority of the time in the Swift implementations (both "naive" and "unmanaged") is spent in retain/release traffic, which suggests that there's an easy 2-3x performance improvement to be had with very minor changes (and also that further improvements to how retain/release are handled at the language level would net big wins here--cc @Michael_Gottesman).

jamesgh · September 12, 2019, 1:33pm

That was what I noticed when running that benchmark to begin with too -- I tried to get the "unmanaged" version to avoid the retain/releases, but maybe I didn't do it all the way right. Of course the benchmark also releases everything at the end.

zack2012 · September 12, 2019, 2:08pm

Swift performance is bad, you must be very carefully use any feature of swift when you care performance. Swift unsafe family api is also cumbersome.

Joe_Groff · September 12, 2019, 4:25pm

The tests were done with Swift 5.0 as well, and there are already substantial improvements to RC optimization in 5.1 (and even more in top-of-tree). Using Unmanaged is "cheating" in a sense because it's unsafe, and it should not be necessary in normal circumstances to get adequate performance from idiomatic code—if it's safe to use unmanaged references, then the compiler should know that and avoid reference counting in the first place. The current Swift implementation is still nowhere near representative of the performance that should be possible, since we've been primarily in the "make it work" phase of development, and are only now starting to get into the "make it fast" work.

ccashman · September 12, 2019, 4:30pm

and are only now starting to get into the "make it fast" work

Is there a document or charter that outlines the specifics of that work?

scanon · September 12, 2019, 4:31pm

Also the particular design choices here are very non-optimal for reference counting, and also not very idiomatic for Swift. An array-backed implementation, for example, would avoid the RC issues, and allow it to conform to Swift's protocols more easily. This is very much fighting against the currents of the language.

Joe_Groff · September 12, 2019, 4:37pm

That may be the case, but classes are probably still the easiest thing to reach for in the language today for this sort of thing, and there's a lot of slack in our handling of classes that could be tightened up. Value types define away a lot of the issues with class-heavy code, and will likely lead to the highest-performance in the fullness of time, but they don't today because of their own implementation issues, and we still lack key bits of language ergonomics for solving basic design problems when using value types, such as building heterogenous collections of them, or representing object graphs and relations between values in a systematic way.

jamesgh · September 12, 2019, 4:41pm

Sure, there are lots of ways you can make this test faster by doing things differently than how it's designed and end up with the same result, but the point is to make the implementations operate in the same manner. You could probably do an arena-allocated version that skips a lot of RC/allocs in the first place - and I think there are some optimized tests for other languages that do that (edit: i think this is probably what you meant, actually). But switching from the tree to an array would probably step outside the spirit of this specific benchmark.

I think it's important not to get too hung up on this particular benchmark since the algorithm itself is intentionally weird, a good real-world implementation with poor performance to look at might be the network driver.

Joe_Groff · September 12, 2019, 4:43pm

I don't know that we have a centralized document, but some of the bigger pieces of work include improvements to the SIL IR to make better optimization possible. Ownership SSA will provide a stronger model for representating relationships between values, approximately like the explicit ownership model in Rust, which should allow for a substantial reduction in the amount of ARC traffic and implicit copying of value types. Opaque value SIL will unify the representation of all types in SIL, allowing generic code to get the same level of optimization without relying on specialization. The runtime itself is also getting better optimized to reduce the cost of the runtime calls that remain.

jamesgh · September 12, 2019, 4:46pm

Thanks for the reply, I'm glad to hear that you guys have a roadmap for that.

Michael_Ilseman · September 12, 2019, 4:50pm

CC @johannesweiss, who has been involved in SwiftNIO. It's being used in production on the server, where performance translates pretty directly to cost savings, by companies that care very much about cost savings. They've actually been finding Swift performance to be a selling point. They did have to work through some performance issues at first, which has led to a lot of the recent improvements (and they are still uncovering others).

jamesgh · September 12, 2019, 4:54pm

I've been using Swift-NIO myself for a media server and am really happy with it, kudos to you guys for doing a good job on it. I really like swift a lot on the server and I want to see it succeed - I would say it's a big productivity improvement coming from C++, so I apologize if I'm coming off harsh in this thread.

scanon · September 12, 2019, 4:56pm

FWIW, I don't think you're being overly harsh at all.

There are a bunch of real performance issues to be fixed.
This particular benchmark happens to exercise something of a worst-case for Swift.
There's better ways it could be written, but we need to do a better job of guiding developers to those ways. We should make the preferred way also easier to implement.

AlexanderM · September 12, 2019, 5:18pm

I generally try to avoid "jumping in" on threads like this, where it looks like it's OP vs the world, but this excerpt broke my self-restraint haha

There's absolutely no chance in hell that the operations work even remotely the same. If you (generic you) implement the same program the same way in both Haskell and Java, then you very clearly either don't know Haskell, or you don't know Java. I think this mentality stems from people whose entire programming career has consisted of exposure to exclusively procedural/OO languages. That leads to the sort of mentality where the first questions about a new language is "how do I spell a for loop?", as if all programming languages are roughly identical, short of the spelling of some language keywords, and naming of the standard library's types.

Take that mentality to a functional language like Haskell, or a logic programming language like Prolog, and your whole world will shatter instantly. What if I told you there was no for loop?

Additionally, even if you did have all the implementations have roughly the same design, you've still produced completely useless results. The fact of the matter is, even if you write Haskell like Java, no other Haskell programmer does. Your single data point on Haskell performance is not just bad because it's singular, but it's bad because it's not even representative of real Haskell programs.

Here's an example to consider:

In Haskell, tree data structures would seem incredibly efficient, because of Haskell's topologically sortable memory graph, which makes garbage collection incredibly fast.
- However, pointers to parents might be hard to implement (impossible?).
In C, your full manual control over memory could allow you to write very optimized allocators that maximize locality and minimize cache misses.
- If you get clever with unions, you could probably even remove a lot of references/pointer chasing, and only insert them into the data structure when they become necessary.
- Naturally, this comes at a great deal of memory management headache, and a lot of complexity.
- It's easy to get it wrong.
In Java, tree data structures are incredibly easy to allocate, but too many mutations would produce a lot of garbage.
- GCs are only performant when they have a lot of spare memory to play with (on the order of 4-6x what the app really needs).
- When memory becomes tight (from a large data set, or from being on a restricted device like a smartphone), GC thrashing starts becoming a problem, and you may be forced to doing things like implementing object pools and such to recycle nodes rather than allocating/GCing them.
In Swift, trees are pretty easy to construct.
- The only common complication is about the parentNode reference needing to be weak.
- Swift has definite deinitialization guarantees that ARC must honour.
  - Pro: RAII. You can use classes to model other resources (threads, sockets, file handles, etc.), and have ARC automate the management of those other resources.
  - Pro: your app never keeps around unusable objects (garbage)
  - Con: ARC can cause delays when releasing large object graphs, because one deinit causes another deinit to run, which causes another... and so on. To uphold deterministic deinit guarantees, all of these deinits need to happen synchronously, blocking further progress of the program.
- If deinit pauses start happening in your program, then you might need to sacrifice some of the perks by implementing a deinitalization pool (a term I just came up with, IDK what the common term for this is, but to be fair, it's quite rarely needed). Your deinitialization pool would have strong references to the objects in question, and would titrate their deinitalizations on a background thread.
  - Because garbage lives longer than it otherwise would have, you using more memory than strictly necessary, and give up your deterministic deinit guarantee. A familiar situation to anyone who has dealt with a GC.
  - In a sense, ARC (minimal memory use, great at many small allocs/deallocs, bad at large clean ups) is the opposite of GC (really high memory requirements, bad at many small allocs/deallocs, great at large clean ups).

As you can see, there's an incredibly large variability to something as simple as a tree, that's deeply influenced by a very large set of design choices each language has made. Naturally, this leads you to pick trees more frequently in some languages than in others. In Swift we reverse arrays all the time. In Haskell, you would almost always want to avoid reversing a list. Prescribing a "one size fits all" unified implementation design across a broad set of languages like this is complete non-sense.

The correct thing to do here, IMO, is to say "look programmers, here's the problem, here's the acceptance criteria, go solve it the best way you can", leaving each programmer the flexibility to think in the mindset that their language prefers.

jamesgh · September 12, 2019, 5:20pm

That's a fair criticism, thanks for the thoughtful reply.

AlexanderM · September 12, 2019, 5:26pm

Who is this replying to? You didn't reply to anyone or mention anyone's name

Lantua · September 12, 2019, 5:27pm

Likely the last person, this forum omits reply to if you reply to the last person. Not sure if bug or design.