Why stackless (async/await) for Swift?

dabrahams · October 13, 2021, 4:13pm

At Adobe we've been investigating concurrency paradigms that will support the future of our products, and stackful coroutines (or fibers) are starting to look like an increasingly attractive option, in contrast with async/await systems like the one Swift has adopted. In particular, async/await seems to raise the “What Color is Your Function?” problem that limits interoperability and code reuse, and emphasize marking possible reentrancy that can be ruled out by construction using value semantics. I keep thinking maybe I've missed some fatal flaw with the stackful approach, but Google isn't turning up much in the way of useful critiques, so I thought I'd ask my friends in the Swift community, who've recently gone through an analysis of approaches, for their thoughts. Was the stackful approach considered and rejected for cause? I'd be grateful for any insight you can offer.

Thanks in advance,
Dave

Joe_Groff · October 13, 2021, 5:06pm

I suspect you've already seen this paper by Gor Nishanov about some of the issues with fibers:

Now, many of the overheads he cites can be mitigated by better control over the runtime environment, such as user-mode scheduling to avoid kernel overhead in context switches, using relocatable stacks to avoid the performance penalties of spaghetti stacks while not having to reserve large slabs of address space, and so on. In my mind, the primary thing that pushes Swift and many other environments to a "stackless" coroutine model is that they can't achieve that level of control over the runtime environment without giving up low-overhead access to the existing C runtime environment and OS kernels around them. You can't relocate stack pointers and still be interoperable with C.

Regardless, Swift's async tasks are still about as close to stackful as you can be without using the C stack—an async task shares a context structure, with stack discipline, among all active calls in the task, so you don't end up with a linked list of promise objects like you would with async/await in C# or JavaScript. We do use a "spaghetti stack" approach to growing the async context when it runs out of room, but we have a lot of power to mitigate that—the async call ABI allows callees to demand a particular async frame size from callers up front, and so when the call graph is known, the necessary total context size for a task can be computed when the task is spawned. Furthermore, our async code only needs to use the context to store values that need to live across a potential suspension; the straight-line bits of code between awaits can use the regular C stack. This puts less pressure on the async context than Go, which uses one stack for all code. Another benefit of this implementation model is that code that suspends only in the slow path is essentially as fast as C in the fast path, as @David_Smith described here:

A de novo Swift environment could perhaps implement async functions and regular functions the same way, using a common runtime implementation that allows for growable stacks and userspace-coordinated scheduling, in which case async and await could just become effect markers to indicate operations that may suspend the current task. Such an environment would however need to accept either higher-overhead C interop, or be able to recompile the C world using an ABI that was compatible with the hypothetical Swift's runtime environment. That was part of Go's original plan, I believe, but they ended up rewriting their universe in Go from scratch instead.

dabrahams · October 13, 2021, 6:16pm

Joe, thanks so much for this very complete answer; there's a lot to dig into here… I'll follow up when it's digested (and no, I hadn't seen that WG21 paper).

Briefly, for anyone else reading along, I just found a response to that paper: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0866r0.pdf

more to read…

Paul_Cantrell · October 13, 2021, 6:53pm

Joe, this post is amazing. Thank you.

Somebody who knows, please correct me, but:

It seems to me that Actors are the coroutine-shaped thing in the Swift concurrency model, that there’s a lurking almost-isomorphism between actors and stackful coroutines, and that one could probably even implement an actor Coroutine in existing Swift that makes this possible:

let coroutine = Coroutine.create {
   …
   let bar = await yield(foo)
   …
   let zog = await yield(baz)
   …
}

let foo = await coroutine.next()
…
let baz = await coroutine.next(bar)
…etc…

Saklad5 · October 13, 2021, 7:00pm

My naive ideal scenario for concurrency is that every pure function, no matter how small, has the ability to be asynchronous, and the compiler magically determines whether that overhead is justified and uses a synchronous version if it isn’t.

I’m pretty sure that would require solving the halting problem, but it’d be nice.

Saklad5 · October 13, 2021, 7:35pm

To put it another way, I’d love to see asynchronous code become the rule rather than the exception in Swift, and let the compiler sort out overhead.

My favorite aspects of Swift have a similar experience: you can use efficient patterns like copy-on-write values and lazy method composition pretty much by default. That’s spectacular, and it makes the language much nicer to work with.

dabrahams · October 13, 2021, 8:50pm

Hi again Joe,

Having read both the paper you cited and the response, I don't think a solid case has been made against user land fibers.

Sure, but non-relocating allocations with guard pages, as suggested by the response, seem like a perfectly suitable way to address that problem. Am I missing something?

Regardless, Swift's async tasks are still about as close to stackful as you can be without using the C stack—an async task shares a context structure, with stack discipline, among all active calls in the task, so you don't end up with a linked list of promise objects like you would with async/await in C# or JavaScript. We do use a "spaghetti stack" approach to growing the async context when it runs out of room, but we have a lot of power to mitigate that—the async call ABI allows callees to demand a particular async frame size from callers up front, and so when the call graph is known, the necessary total context size for a task can be computed when the task is spawned. Furthermore, our async code only needs to use the context to store values that need to live across a potential suspension; the straight-line bits of code between awaits can use the regular C stack.

Those are interesting optimizations. I guess I don't have a clear sense of how often they apply, though. The first one seems like it probably applies only in environments without resilience boundaries, and if the trend according to @David_Smith holds, doesn't more and more code that might be considered “straight line” start moving its local variable allocation to this async context?

A de novo Swift environment could perhaps implement async functions and regular functions the same way,

Hah, shades of Stackless Python! But in effect what you're saying is reinforcing my suspicion that (whatever the implementation details) the stackless programming model, where there are no red/blue functions, is more appropriate than the stackful one.

…in which case async and await could just become effect markers to indicate operations that may suspend the current task.

…which I suspect is unworthy of annotation in a language with the ability to limit shared state… in any system with pre-emptive multitasking, any call “may suspend the current ‘task’.”

Joe_Groff · October 13, 2021, 9:39pm

If you have enough address space, maybe. I get the impression most production systems have shied away from wanting to commit that much address space, but I honestly don't know firsthand why this isn't common. (At least in Apple land, there are still supported notionally-64-bit devices that only implement 30-something bits of virtual address space in practice, so such an approach would probably still not be viable on those devices.) Another place existing OS runtimes tend to intrude on such schemes is that, especially on Darwin and Windows, there are a lot of assumptions that the stack you're using is the same one that the kernel allocated for what it thinks is your current thread, so even if you allocate contiguous slabs, and context-switch among them in userspace running plain old C-style stack code, you'll run into heisenbugs because the kernel really wants you to be running on what it thinks is the right stack for your thread.

The fundamental limitation I see to us being able to exactly precompute context size would be in cases of unbounded recursion. Resilience doesn't have to be a barrier to the optimization, because anywhere we have an opaque reference to an async function, the context size is part of that reference along with the pointer to the entry point, so we could still compute the total allocation size as a sum of what the resilient callee needs plus the needs of the caller. The potential allocation cost is the only major overhead of using the async context over the C stack; spilling local state to the context shouldn't be much more expensive than spilling it to the stack across a normal function call once the space is there, and since the caller is responsible for allocating the space for you, you should be able to stash your own local state fairly cheaply.

Having a unified implementation gives you the option to have the debate, at least. I know that some people complain that it isn't readily apparent where context switches happen in Go code, though I don't know how prevalent that opinion is. And existing practice can interfere with platonic ideals here too; if you're writing Kit code for an Apple platform, you need to know when your main-thread-bound code might give up control of the main thread, but it's fine to be preempted because other threads aren't allowed to touch the UI anyway. There are also degrees to which the colors are different; in Bob's original post, the comparison was between regular functions and explicit callback-based APIs, whereas I would say the difference in Swift is more like having blue and azure functions; you don't have to turn your code inside-out to make it async, you need to add a few keywords.

tera · October 13, 2021, 9:41pm

the typical set of trade-offs associated with various thread implementations is:

(Edited to include iOS memory limitation)

no multi CPU support out of the box for lightweight threads
real threads don't scale (but that's not your concern with fibers)
virtual address space issue for stacks (but that's not a problem on 64 bit systems)
wired memory space is still an issue for stacks
cache unfriendliness on context switches
stack switching is easy to do in a library and C/C++ friendly. stack copying is no go for C/C++ inter-op, leads to more expensive stack switching and needs compiler support.
Edited: iOS VM limitation of 5GB (unless overridden on iOS15+ on supported devices, see com.apple.developer.kernel.increased-memory-limit for details)

let's only talk about switchable stacks implementations here.

let's say you allocate 2GB of VM space for stacks. questions arise: what is your stack size? do you manually control it upon fiber creation? can your function recurse enough to cause stack overflow? remember that every time you call into OS it uses an unspecified amount of stack which is hard to predict or guarantee, so better be safe than sorry. let's say you allocate 100K VM space per stack and your functions won't overflow the stack - that will give you a hard limit of 20K stacks on a 32 bit platform. let's focus on 64 bit platforms only, so this aspect is not a problem. 2GB is nothing there, but you have to wire at least one page size of RAM per fiber, probably more. page size was 4K in the past now i believe it is 16K (on A9). let's optimistically assume that you only wire 16KB per fiber. how much wired memory your app can use (for stacks only leaving some space for anything else)? let's say 1GB for stacks (and perhaps 1GB for anything else). at 16K per fiber that's a 60K fibers limit.

all in all, if you only need a 100 or a 1K or even 10K of lightweight threads - fibers are totally fine. scaling to 100K is problematic. that's from someone who knows a thing or three about threads since ThreadLib and Thread Manager on System 7 days

having said all that - have you considered to go the opposite way, to promise based solutions? no compiler support is needed and most of the above issues just do not exist. or is it a no go due to C/C++ inter-op requirements?

tera · October 13, 2021, 9:57pm

if you have time go through this thread to see what other people are doing.

tera · October 13, 2021, 10:06pm

there's also response to response.

dabrahams · October 14, 2021, 1:51am

If that's really true, it might be a stake in the heart of any fiber-like scheme.

Acknowledging what you've said about precomputing stack space, I do still wonder about this:

But now that I read his thread I see why maybe you didn't answer.
I was thinking David was describing a situation like this: I have a graph search algorithm parameterized by a visitor, a là BGL. I might want to support a visitor doing some async thing, so I make my algorithm async. It's not “actually asynchronous” unless you end up passing it a visitor that happens to suspend.

I assume this sort of thing is going to happen…

I know that some people complain that it isn't readily apparent where context switches happen in Go code,

Sure, but it can't be the fact that the CPU goes off to do something else that they care about, or they'd object to the other processes in the system, which cause context switches at arbitrary places anyway. Seems to me this has to be about reentrant access on shared data that they wrongly thought they were accessing exclusively.

Existing practice can interfere with platonic ideals here too; if you're writing Kit code for an Apple platform, you need to know when your main-thread-bound code might give up control of the main thread

Presumably because sharing isn't limited in Kit code?

in Bob's original post, the comparison was between regular functions and explicit callback-based APIs, whereas I would say the difference in Swift is more like having blue and azure functions; you don't have to turn your code inside-out to make it async, you need to add a few keywords.

It's not the scale of the change that I'm worried about, it's the prospect of wanting two versions of every function, identical except for those few keywords. Presumably if a Kit code programmer needs to know where their code might give up control of the thread, they also need to be able to do stuff without giving up that control…

dabrahams · October 14, 2021, 2:10am

Thanks; it's nice to see that the authors sorted out their differences. Ah, committee meetings in Kona, where everybody's agreeable . I remember those.

According to the paper I cited the 1M-thread-of-execution skynet benchmark runs fine with fibers. See page 7, FWIW.

tera · October 14, 2021, 3:31am

ah, microbenchmarks ;) i do not see the boost fiber based implementation in there, nor do i see the reference to what hardware the test was run on. just to give you an example: "In later versions of iOS, A7- and A8-based systems expose 16-kilobyte pages to the 64-bit userspace backed by 4-kilobyte physical pages, while A9 systems expose 16-kilobyte pages backed by 16-kilobyte physical pages." here the relevant parameter is either 4K or 16K depending upon the hardware.

the closest to C i see i in there is this from "skynet.c":

/* 
 * Because of the way libmill schedules coroutines without
 * the following yield() more of the coroutines which start
 * additional coroutines will run before the ones which 
 * do not (size == 1). This would result in a lot of active 
 * coroutines each with a stack (default 256K although it 
 * can be changed to 16K without modifying libmill) and 
 * many mapped pages. Even with the yield you may find that 
 * you need to increase vm.max_map_count e.g.
 * 
 *     sudo sysctl -w vm.max_map_count=2000000
 */

obviously not something you'd want your users to do before then run your app.

microbenchamarks' results shall be taken with a grain of salt. if you are careful to not call OS, etc you can construct a sample that uses a minimal amount of stack space, say 4K on a platform that uses 4K physical pages. 4K * 1M = 4GB, so if you have that amount of RAM for the app that particular microbenchamark will run "ok".

calling newThread(userDefinedCallback, 32K) (let alone 4K) would be a minefield on iOS/macOS... don't you think? i mean, how can you be sure that userDefinedCallback itself uses less than 32K of stack? and doesn't make another call that allocates a lot on stack, now or in the future. that might be an OS call. an OS call whose stack requirement might be fine now but increase in the next OS update. or there's some OS interrupt that uses a non trivial amount of the current stack. it doesn't look safe or future proof. to play safe i'd say: "choose at least 100K and then pray".

updated: on the current 64bit desktop/mobile platforms (embedded platforms aside) it would be better to specify a more safe amount of memory for stack, say 1MB. the actual minimum physical amount of memory allocated would be different based on the page size of the platform, e.g. 4K on A8 and 16K on A9 - this is this number that will be the actual limiter and the RAM ~~available~~ affordable for stack use, say half of RAM which would give 1GB on iPhone 6S, hence 64K fibers hard limit there. before you commit to anything i'd recommend you to do your own testing on platforms you are going to support. and keep your eyes open to the recent trends: increasing physical page sizes and more languages introducing async await machinery instead of going fibers direction. you may also sidestep this whole great stackful vs stackless debate and consider a less generic event-driven approach (see Nginx design as an example).

update to update: you can do your own "fiber testing" without fibers: just malloc a bunch (say a million) of megabyte sized blocks (one malloc per each block or one giant malloc for all blocks) and write the first 1K in each block. i won't be surprised if half of current popular platforms/devices fail this test.

another update: the mentioned mini test app below. it runs fine in both modes on a 6 years old mac that presumably uses small physical page sizes. it fails very early on iPhone XS Max in "many small blocks" mode, at about 6800 of them, which would correspond to the maximum fiber count if these were real fibers; and it fails straight away in "one giant block" mode on this iPhone. would be interesting to see how it runs on a recent mac / iPad / iPhone hardware, that presumably uses a bigger physical page size but also has more RAM

'to fiber or not to fiber' test app

//-------------
// to_fiber_or_not_to_fiber.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define blockCount 1000000L
#define blockSize (1024*1024L)
#define blockUsedSize (1024L)

int to_fiber_or_not_to_fiber(int mode) {
    printf("to_fiber_or_not_to_fiber started in mode: %s\n", mode ? "one giant block" : "many small blocks");
    if (mode) { // one giant block
        char* area = malloc(blockCount * blockSize);
        if (!area) {
            printf("malloc failure, exit\n");
            return -1;
        }
        for (int i = 0; i < blockCount; i++) {
            char* blockStart = &area[i * blockSize];
            memset(blockStart, 1, blockUsedSize);
            if ((i % (blockCount/10)) == 0) {
                printf("done %.0f%% blocks so far\n", 100.0 * i / blockCount);
            }
        }
        printf("done 100%%\n");
        printf("deallocating one giant block\n");
        free(area);
    } else { // many small blocks
        char** blocks = (char**)malloc(blockCount * sizeof(char*));
        if (!blocks) {
            printf("malloc failure, exit\n");
            return -1;
        }
        for (long i = 0; i < blockCount; i++) {
            blocks[i] = malloc(blockSize);
            if (!blocks[i]) {
                printf("malloc failure, exit\n");
                return -1;
            }
            memset(blocks[i], 1, blockUsedSize);
            if ((i % (blockCount/10)) == 0) {
                printf("done %.0f%% blocks so far\n", 100.0 * i / blockCount);
            }
        }
        printf("done 100%%\n");
        printf("deallocating many small blocks\n");
        for (long i = 0; i < blockCount; i++) {
            free(blocks[i]);
        }
        free(blocks);
    }
    printf("to_fiber_or_not_to_fiber finished ok\n\n");
    return 0;
}

//-------------
// to_fiber_or_not_to_fiber.h
int to_fiber_or_not_to_fiber(int mode);

//-------------
// Bridging-Header.h
#include "to_fiber_or_not_to_fiber.h"

//-------------
// testApp.swift

import SwiftUI

@main
struct testApp: App {
    init() {
        DispatchQueue.global().async {
            to_fiber_or_not_to_fiber(0)
            to_fiber_or_not_to_fiber(1)
        }
    }
    
    var body: some Scene {
        WindowGroup {
            ContentView()
        }
    }
}

struct ContentView: View {
    var body: some View {
        Text("Hello, world!").padding()
    }
}

trust me, i don't think that swift's async/await implementation is ideal. or any custom promise based implementation is. just the fiber way has its own gotchas and limitations and is hardly a clear winner if a winner at all.

i hope there's yet another better way to go that we just don't see yet!

Joe_Groff · October 14, 2021, 3:35pm

We'll hopefully eventually have reasync, akin to rethrows, to allow functions to parameterize their async-ness on that of their inputs. Since the performance of async functions that don't suspend is reasonably close to not being async at all, that hopefully puts less pressure on API design to try to factor APIs into parallel sync and async versions, or separate sync-then-async phases, solely for performance's sake too.

dabrahams · October 14, 2021, 3:53pm

It's a start. I think we probably need something like reasync at a protocol granularity, so that the async-ness of my BundleOfCallbacks instance can determine the async-ness of the algorithm using that bundle. But we'll still be left with a lot of meaningless syntax overhead.

Sure, but as noted by the Boost.Fiber people, it's a fallacy that you need to allocate 1M for fiber stacks, in general. There are lots of interesting points in the design space.

tera · October 14, 2021, 3:59pm

i just updated the sample app, now it is easy to play with different stack sizes. try 100K - it doesn't get me too far on my iPhone (to about 60K fibers) but maybe you'll have better results on your platforms/devices.

do i miss something? i'm not allocating 1M of real memory, just VM space, or so i think.. and only wire a small portion of that memory 1K rounded up to the nearest physical page size. i could have used 10M instead for VM and so long as 1M blocks x 10M per block = 10TB doesn't overflow the VM space allowed (which is what on 64 bit platforms?) the test shall give the same results.. somehow changing the block size from 1M to 100K changes the result ten fold, as if the whole block is wired immediately... don't quite understand that behaviour yet. @dabrahams @Joe_Groff

tera · October 14, 2021, 8:13pm

found these which shed some light on it:

https://developer.apple.com/documentation/bundleresources/entitlements/com_apple_developer_kernel_increased-memory-limit

indeed i can't allocate more than 5GB in the app on my iphone and this looks VM size limitation. with 100K stacks that translates into 50K fibers limitation if all memory is used for their stacks, this explains the test results i'm getting.

hassila · October 15, 2021, 6:27am

On a related note (helpful to digest the discussion here), a short question: is there any good documentation reference on the lower level interactions between sync and async functions to understand when/how things are stored split between task/heap/stack when using async/await/tasks/actors? I have a decent mental model when writing C code and would like to bring my Swift concurrency understanding up to par to not have badly performant code unnecessarily. Understood implementation can change..

Max_Desiatov · October 15, 2021, 8:32am

Understood implementation can change..

Can it really substantially change at this point? I thought that being released in 5.5 makes it part of the ABI now.