Swift Performance

I've written a small chat system using SwiftNIO (has it all, web clients, swiftui clients, server, chatbots), you can find it over here: GitHub - NozeIO/swift-nio-irc-server: A Internet Relay Chat (IRC) server for SwiftNIO. I'll demo it in this video: https://youtu.be/FPGf652O90Y?t=1254. Though it might be interesting :woman_shrugging:

P.S.: With chat systems you often need to address the scalability, not the performance (how many connections you can handle, not how fast you handle a single one). I think the default model of Go is a little worse here (green threads), while you will hardly run into issues with something like SwiftNIO (or Netty on Java for that matter).

Just curious: why is your system doing so much allocation? Heavy use of classes, perhaps?

2 Likes

Can we please try to put a cap on these abstract discussions of performance?

"I heard Go was fast ... what about Java? or Rust? or Swift?"

These discussions are near meaningless without a more grounded discussion of the system, its requirements, its constraints, and its bottlenecks.

10 Likes

Heavy use of classes would kill us ;-). Imagine a system where there is no heap allocation per compute, and we engage all available cores on a host system for a whiff of 200ms, and we asynchronously recognize tens to hundreds of millions of possible inputs in a dynamic "parse". We are doing probably 4 orders of magnitude more compute per second than you are imagining.

I’m not really imagining anything. Why so much allocation then?

1 Like

It's worth remembering Swift's implementation of ARC is still far from optimal. Swift 5.3's optimizer significantly reduces the number of ARC calls in optimized code; we've seen up to 2x improvements in hot parts of SwiftUI and Combine without code changes. There will continue to be ARC optimizer improvements as we propagate OSSA SIL through the optimizer pipeline as well. (However, the main benefit of ARC over other forms of GC will always be lower-overhead interop with non-GC-managed resources, such as the large volumes of C, C++, and growing amount of value-oriented Swift code that implements the lower level parts of the OS, rather than the performance of the heap management itself.)

17 Likes

Joe, happy to look at Instruments profiles with you sometime.

Think of the SwiftUI/Combine case you mentioned as an important and also lock bound high message count overhead case, with not much actual compute. I was the chief architect of JavaFX, SwiftUI's great grand dad, and am aware of the case you are dealing with.

Think of Brighten AI, our system - as as a wide open concurrent compute system all cores engaged for 200ms, and unlike SwiftUI/Combine doesn't have a bunch of threading constraints (SwiftUI and main), so compute can go wide open. What we found building it is that ARC tends to insidiously get in the way, either because of variable sized structs, collection use, ownership through stack pops, etc. To be clear, we also built new storage sub systems to manage our systems streaming inputs. SQL/CoreData/Realm/etc. all had the same stomp on the allocator and locks performance issues that would have killed us.

We use actors for course grained concurrency, and the ultra high perf stuff for our streaming recognition inside one of the actors.

1 Like

Im late to the party i know..

But all of this makes me think how it was a missed opportunity for Swift
to have used its own Generics subsystem to support tagging the objects
instead of backing it in directly in the compiler as a keyword.

If we had something like:

weak<X>
shared<X>
singleton<X>

etc.., it would much better for the community to come up with its own solutions to different scenarios and grow in a organic fashion.

Now we need to wait until things become keywords, and theres no way to fight against a compiler that will hardcode release/retain as a compiler pipeline step.

Imagine if instead of having to wait for a 'strong' keyword we could have designed a strong<> holder that would use simple alloc and free directives.

Swift is one of the most beautiful and cool languages to develop in out there, its a pitty that most of its flaws on the design are there because it had to be compatible with Objective C.

1 Like

I'm not an expert on this but I found that Nim lang recently started to use ARC, and they mention that this change has improved Nim's performance. The difference between Swift and Nim ARC implementation is that Nim's one doesn't use Atomic ARC. They explain all of this here. Maybe swift with a concurrency model, could make optimizations under the hood and perform ARC nonatomic like Nim and could improve language performance.

The last two posts are along the lines we are advocating- one way to do it is to replace ARC's global memory ideas, and let eng teams use API constructs to tell ARC how to act. We get a lot of bang out of generational alloc- an infant generation that looks alot like stack alloc, but is actually a mini heap- but importantly- because there is no (lock/atomic), you get ultra fast alloc, and you just graduate infants to the another heap as you need to. And as Fabio says, lets just make it possible to get ARC out of the way, or change its characteristics. Using Unsafe stuff works, but there are more sugary ways to do it.

There are some limited optimizations to turn refcounts into nonatomic operations when we know the object being referenced doesn't escape to other threads, but it's difficult to do much of that today because Swift as a whole does not have a strong notion of threads. However, recent Apple Silicon hardware, like the A14 CPU in the latest iPhones and iPads, new makes uncontended atomic operations almost as cheap as nonatomic accesses, making that kind of compile-time optimization less necessary.

7 Likes

? Your link takes me to a page with no code on it AFAICT.

Dave, the code on the left is an actor DSL, and the code on the right in a swift knowledge graph DSL.
At Swift | Brighten AI.

Joe, yeah we have to add Api to let engineering teams talk about threads. The actor model is a way to do that. Once in the actor, you have to be able to say "actor local allocation". Then you can use an actor private heap, with no lock.

What we do for fine grained concurrency, is make a set of "operations" that each have their own infant generation heaps inside them, and we "go wide" to many cores, we hand one to each core. Each core operation parties in its container, free of locks, and graduates results from that operation that are back on the main ARC heap. This is a massive win- because there's tons of compute in each core, and for 1000,000 inputs, & 10x that temporaries, there's just a few outputs. All of that infant generation garbage is high perf, no atomics, no locks allocs.

Just to be clear, im not advocating against ref-count per-se.

Its a very reasonable GC when you need deterministic guarantees. Thanks for letting us be aware about how the compiler can make some optimizations by avoiding atomics in certain circumstances.

But lets imagine that Swift started with two small changes: Struct destructors and using something like

struct shared<T> {
   .. t ...
    init(_t: T) {
      _refcounted_alloc(t)
    }
    deinit {
       _refcounted_dealloc(t)
    } 
}

To do the exact same thing the compiler is doing now, it would be a matter of bake in the recognition of '_refcounted_alloc' as a special function and make the compiler do it as a no-op in the cases it cleverly diagnosed that a ref-count is actually not needed.

I'm not a compiler expert, but i wonder if only the language were launched with struct destructors, the capacity of moving value types and some sort of macro keyword for retain/release would do to Swift.

What's on the 'ownership manifesto' could probably have being achieved by now, without taking the training wheels for the majority of people that don't need to customize the language by that much.

We all want more performance using generally built into the language constructs- and Fabio i see your point. I would add that we still need to not touch the main heap with its needs for atomics or locks too.

There's just no reason to destroy single threaded compute perf that way.

And as a person who has studied performance alot (I helped design Shark, back in the day), I can tell you, Swift is insidious in the way it inserts ARC cost into call chains even dodging class use, etc. Part of why were are talking about it here, is that I don't think most compiler people have time to study ultra high performance compute projects, and so don't see the effects of ARC in the way we do.

As I noted, when I visit that page (in Chrome) I see no code whatsoever. OK, I tried a different browser. Safari shows something.

Please do not presume; I can vouch for the HPC bona-fides of at least one key member of the Swift team.

Ah, I see. We use a PDF viewer plugin, good to know, apologies !

We use a really simple Actor DSL, the idea is generate a static Actor so its safe to call with async messaging, and have a queue associated with it. We get a lot of simplicity out of that. Brighten Ai is tens of megabytes of compiled code, so we need submodules to merrily do their work and talk to each other without having to worry about threading. This is the large scale course grained concurrency stuff, and we would add that we also have a pipelining version of that (for things like decoding, with many steps, and producer / consumer motif). and also advocate that Actors need sooner or later to be able to say "I need these actors started before me". And I would advocate also only allowing Codables to be pumped for V1.