Future of Swift-NIO in light of Concurrency Roadmap

ktoso · January 11, 2023, 11:07am

Your above example isn’t quite right… where the called “someOperation” method runs depends on where it is defined: is it a class, or an actor by itself etc. So in other words, does it already have some preference as to where it must run or it doesn’t care (and then it could indeed run on the same perhaps…)

But, yes, specifying where a specific task shall run is certainly one of the use cases but we can’t really comment on “when” and “what“ will do this before we have a design and more hands on on this. “Custom executors” can mean a few things and it might take a phased approach to get it all.

Thank you for explicitly mentioning this use case!

tcldr · January 11, 2023, 12:00pm

~~Yes, that’s true. For this example, assume someThing.someOperation is non-isolated. i.e not on an actor or GAIT, or annotated with a global actor.~~

Thanks for your thoughts, @ktoso! I've updated the example so that what I'm trying to say is perhaps a little clearer. Essentially, a way to cascade the concurrency context to non-isolated callees might be handy.

John_McCall · January 11, 2023, 8:25pm

I don't think you'd want to use custom executors to try to make a task always execute on some specific actor's executor. It's pretty core to the concurrency design that we understand statically where a specific async function is supposed to be executing, and having actor execution "overhang" into code that isn't supposed to run there is actively undesirable.

Now, if what you want is an async function that inherits the executor context of its caller, that's a very useful feature that we've been thinking about. However, that's a statically-expressible relationship, and custom executors are not the right way to achieve it.

tcldr · January 12, 2023, 9:01am

That's definitely something I'd be happy to see appear. I've seen the underscored @_inheritActorContext attribute, and I think a flavour of this will be a great thing to see included.

However, another scenario that I find myself wanting a little control over executor inheritance is when working with AsyncSequence chains.

Official advice I've seen recommends avoiding unnecessary context switches. In fact, I think there's a WWDC video on structured concurrency performance that recommends 'avoiding context switches' and 'batching work together'.

This is pretty difficult to do when working with an AsyncSequence. With an AsyncSequence the cadence of element production is quite naturally fine-grained. Each element goes through a user-composed stack of async calls in order to arrive at its final destination in its final state.

As an example, say you have a GAIT (i.e. MainActor) based source that produces a touch event on every frame, followed by some transformation (say a map), that is then consumed on the same GAIT to update some view:

GAITElementSource -> Map -> GAITConsumer

Or more concretely:

MainActorTouchEventSource -> Map -> MainActorTouchesView

AFAICT in this example, for every frame, the pipeline will receive a touch event on the main thread, hop to the cooperative thread pool to perform the map transformation, and then hop back to the main thread.

This seems like a less than optimal situation that could add up to a lot of context switching each frame. Statically annotating the map transformation func with @_inheritActorContext probably wouldn't work, as the map transformation would then be less useful in a situation where you don't want actor context inheritance.

If you take the analogy that AsyncSequence chains are like pipes between actors, the situation right now is as if those pipes can only exist on the cooperative thread pool. Even if that pipe starts and ends on the same (potentially non-default executor) actor.

Perhaps this kind of use case isn't what is envisaged for Swift concurrency, but it seems a shame, as the syntax for consuming asynchronous sequences do lend themselves very well to this kind of thing.

fumoboy007 · January 14, 2023, 8:37pm

Thank you @lukasa and @johannesweiss for taking the time to write down all of these details! Sorry for the slow reply… still digesting everything. Below are some of my thoughts.

Swift should own the event loop

This is going to sound obvious but… it occurred to me that I/O is fundamental to any and all programs. In a Swift Concurrency world, that means an event loop is fundamental to any and all programs since an event loop is needed to do non-blocking I/O.

Even after custom executors has been implemented, the default executor does not have an event loop and thus, no optimal way to do I/O. This seems like a huge deficiency given that I/O is fundamental. Perhaps the default executor isn’t worthy of its “default” status given that a custom executor is more suitable for I/O.

Instead, I think the default executor should have an event loop. Indeed, I think an event loop should be a hard requirement for any and all executors!

Swift Event Loop System Design

This is all very high level and lacking details but I have some thoughts on the design of a supposed event loop system for Swift.

I am thinking that Swift should offer a public, cross-platform, async–await API to read from and write to file descriptors. There are various ways to get a file descriptor but what is essential for Swift Concurrency is to read/write in a non-blocking manner.

Not clear whether or not these need to be owned by Swift but… there could also be public OS-specific APIs for less essential event sources (e.g. memory pressure events) since those also need to plug into the event loop.

The OS-specific implementations of these APIs would use an event loop API (kqueue, epoll, etc.) that is optimal for that OS.

Finally, the default executor and custom executors would use these OS-specific components to run the event loop.

Let me know your thoughts and whether I’m totally misunderstanding something.

ksluder · January 15, 2023, 6:00am

The language can’t own the event loop if the event loop is to be written in the language.

Today, executors on Darwin platforms are backed by a concurrency primitive vended by the operating system.

fumoboy007 · January 15, 2023, 10:13am

Hmm maybe I’m misunderstanding but wouldn’t the C event loop API(s) be used in a similar way in the Standard Library (which the default executor is a part of)?

Language → Standard Library → OS library

lukasa · January 17, 2023, 9:00am

The default executor does have an event loop. What it doesn't have is an EventLoop, nor any way of registering custom I/O with it.

Generally speaking this is a good thing. NIO follows the same pattern: you can't hand NIO an arbitrary file descriptor and expect us to wait for its I/O. We'll only allow you to do it in cases where the file descriptor is of a kind we understand, and where the event loop allows us to abstract it.

This is a reasonable feature proposal. What's a bit more problematic is the reliance on file descriptors as the core spelling.

File descriptors are the standard I/O primitive on Linux and most other Unixes. On Apple's platforms, file descriptors lack the same generality: for example, Network.framework does not expose file descriptors to its users and may not have a file descriptor at all. Many things on Apple's platforms expose FDs, but far from all of them.

Windows is even further from this space. Encoding "file descriptor" as the spelling becomes very tricky in that model. This is part of why NIO has (with mixed success) tried to avoid having users spell "file descriptor" when what they mean is "connection".

The further you follow this line of thinking, the closer you get to reinventing NIO: you have a bundle of different "event loop" types, each of which is capable of different things on different platforms, but all of which expose some non-overlapping set of supported I/O primitives

The problem with reinventing NIO here is that you also reinvent its core limitation, which is that "thou shalt not block the event loop". If you do CPU-heavy work on NIO's event loops, you introduce problems in your application: tail latencies spike, connections drop, throughput suffers, health checks fail, etc. The same is not true for the Swift concurrency pool: you're allowed to do CPU-heavy work there so long as you are making continual forward progress. It's not clear to me that we want to change that trade-off!

Helge_Hess1 · January 19, 2023, 4:42pm

A late answer, I'd like to use SwiftNIO and do I/O w/o having to perform thread hops. I.e. run async functions as part of the NIO EventLoop thread.

Alternatively I'd like hooks to implement something like NIO, using async/await, but w/o the thread hops, presumably that means exposing I/O primitives to Concurrency (following the idea of FlyingFox, but w/ proper concurrency support)

Not sure what makes more sense

fumoboy007 · January 30, 2023, 7:49pm

What if—after Swift Evolution review—we moved parts of NIO into Swift itself to support non-blocking I/O out of the box? (Again, since I/O is fundamental to any and all programs.)

Isn’t this what Task.yield() and TaskPriority are designed to help with? For example, the health check task would have the highest priority and CPU-heavy tasks would call Task.yield() periodically. That would allow health check tasks to be interspersed with portions of a CPU-heavy task.

It seems NIO does not currently have such a mechanism, so all CPU-heavy tasks need to be moved to a different thread pool. However, the Swift Concurrency design is able to accommodate this type of application in a single thread pool.

lukasa · January 31, 2023, 10:29am

I'm inclined to say that moving NIO into Swift is a non-goal. Keeping the size of Swift small is something the community has been pretty vociferous about, and that I also support. Swift packages are easy to get hold of, and have a number of benefits from being kept separate from the implementation of the language.

More broadly, I think this also isn't SwiftNIO's responsibility. Ideally a more general-purpose package solves this issue, most likely Foundation.

The problem is that you can't know what higher priority work you have until you issue your network I/O, and you can't know whether you have network I/O to do until you issue your I/O system calls. This forces the task scheduler to periodically check whether new I/O is possible, even if there are outstanding tasks that could progress. It makes life a bit harder from the perspective of modelling priority.

Helge_Hess1 · January 31, 2023, 10:04pm

Just the I/O scheduling hooks, not NIO at large.

lukasa · February 1, 2023, 4:58pm

That remains a pretty substantial chunk of functionality. It's also functionality that currently lacks any stable API, which was pretty intentional on our part. I'm not saying we shouldn't do it, but I am saying that defining its shape isn't entirely trivial.

johnburkey · February 13, 2023, 6:28pm

I would add that when you get this plumbing in, it's a great time to add a per Task allocator. You can get even more locality of reference and performance by letting stack like temp allocs come from a local allocator. It's what high performance java and C++ systems have done for years, and NIO seems like a great place to do this kind of thing.

The idea is that objects that live long enough or cross a boundary into global allocation space are Sendable, and can then be copied into that global alloc space.

Admittedly I always communicate the same general idea on these threads- but it's based on real experience. We are currently getting the best performance out of swift by using classes, not structs, not using Tasks (we use GCD based actors instead), careful dodging as many ref counting cases as we can, making our temp objects immortal (so we can manage reuse because we can dodge a bunch of ref counting costs that way), and then overriding the ref counting stuff to not call into the atomic stuff for our marked classes.

And we use NIO underneath your GRPC stuff (thanks for doing all of that!)

You could imagine if there were a Task specific allocator that wasn't atomic, then with the work you guys are proposing here for NIO, and the marking up objects coming with the borrow stuff, we are going to get closer to the performance the high perf C++ and Java systems have.

I think you guys are the right people to put some heft behind these ideas, because you are performance savvy. And I would say that just like locality of reference matters, not making things global and therefore atomic except when you need to, matters.

David_Smith · February 13, 2023, 9:02pm

Swift concurrency already uses a per-task allocator

John_McCall · February 13, 2023, 9:28pm

We use a task-local allocator for stack allocations, but I think John is talking about doing general object allocation out of a task-local allocator. I'd be pretty skeptical about heap fragmentation for that.

lukasa · February 14, 2023, 9:50am

Agreed. However, I do think there's some argument to be made for some support in Swift for zone allocators. This will probably be gated behind lifetime management because such an object would naturally be forbidden from escaping its scope, but NIO does incur some substantial pain in allocating/freeing ChannelHandlers, which are almost always per-Channel objects. Being able to free them in a single slab would be a nice win.

However, to @johnburkey's point, while NIO is a great place to investigate these strategies, we definitely need to co-ordinate with the Swift team to work out how such a feature would be expressed.

johnburkey · February 14, 2023, 7:42pm

I've found that in many systems there is object chaff that is created during production of results and that that chaff can be stored in mini collections which are also chaff- and sometimes even whole trees, but then all of that ends up being tossed away when the Task generating these results is finished. Sometimes a single result is picked from many. Sometimes the results are used for rendering and then discarded- but they are ephemeral and only seen by one thread. And that an entire "Task" can be an ephemeral task, whose inputs and outputs may be more long lasting, but all of the that temp chaff in that can be large.

In graphics (I worked on Quartz at Apple in the late 90's and early 2000's) its things like masks of shapes to be rendered (the outlines that were beziers are converted to bitmaps). Back in the day we would generate several megabytes of these things for complicated visual frames, but we didn't need them when the frame was over, so they were all ephemeral, temp. And we literally couldn't afford to use malloc/etc., because back then the PowerPC was at least 2.5x slower than intel for what we were doing, so we needed a win, so we did this. To win our software needed to be better, faster, less compute/unit of work.

So for us it was those temp generations of bitmaps and things. In NIO it might be those ChannelHandlers. In AI's its things like statistical probabilities. In speech recognition (we have a swift based voice Ai) it's things that are something like phonetic probabilities from something akin to n-grams).

These are all temp/chaff things that are generated - and you want them in a high level language (other ASR's use C++) . And they have limited lifetimes that are either obvious already or can be made obvious with API- something between the API we have in swift for extended lifetimes, and a stack scoping works. For temporary objects there are no heap fragments (I have patents inside Apple for temp allocation in graphical 2D heaps that avoid heap fragments, Im aware of the issue) Because you sweep the temp heap at the end of the temporary alloc time. All objects are gone. You can do this non-atomically because it's local to a Task. And in debug modes you can mark the crap out of the objects so they don't get reused and assert, etc.

With this we get almost free allocation for ephemeral objects. Free because you ptr move to alloc, and sweep the heap with one ptr move at the end. Thats of course MUCH cheaper than calling malloc, which is famously slow at Apple and used to be owned by Bertrand, with a smile and a wink. The argument made then and I make now is you shouldn't use Malloc in perf important code, instead using the Zone alloc stuff in the OS, stack alloc, etc, and in Quartz we have a frame allocator that works like im talking about here. .

If the borrow stuff doesn't work well enough, and you can see the local objects on your side of the wall, you can use a non atomic retain release- since you dont need atomics for non global objects only seen in one thread. If you have API to designate these local regions, you can annotate the code like with the other cases and just switch the implementation to emit nonAtomicArc calls instead of ARC calls.

You could imagine then a "local only array" that also doesn't do atomic retain/release, and isn't "Sendable". The main reason why I was talking to Chris L about Sendable early on was this . Not being Sendable means you can do all of this "don't use the heavyweight OS primitives" stuff.

The data supporting this is there if you look. It is the future. We are going to more and more cores, and Atomics and Malloc will just keep getting more and more expensive, and the people I know on the hardware side say the unified memory thing (which makes Atomics briefly cheaper) is a temporary fix for M1/M2, and not the future of the industry.

Love you all, and love Swift.