Notes on the coroutines ABI

Even if we don't do coroutines as a user feature in Swift 5, we want to make sure we implement coroutine-esque features (like materializeForSet) with an ABI that we're satisfied with. Joe Groff and I were talking about the low-level structure of that ABI, and I wanted to make sure that it was available for general interest + debate.

Apologies for the format. We'll write up a proper summary before implementing anything.

John.

rjmccall
3:47 jgroff: Hey, so I heard you were thinking that coroutines might not be necessary for ABI stability.
3:47 I'd be interested in hearing you out on that.

Joe Groff
3:57 it seems to me like we can get the ABI we want without the language feature. like, if we literally had no time we could in principle accept materializeForSet as is, and quickly hack transition get to materializeForGet
3:58 and we can pile on addressor-like one offs for the stdlib to do what it needs to with dictionaries

rjmccall
3:59 Ah. Yes, I agree that we could probably make co-routines conform to the materializeForSet ABI if we needed to.

Joe Groff
3:59 generators are maybe a bit more interesting; at minimum though you could use the interface of IteratorProtocol as their abi if we needed to
3:59 at least, for consuming generators

rjmccall
4:01 Hmm. Possibly true.

Joe Groff
4:01 it seems to me like we could also tweak the ABI of our existing interfaces to be what we want for generator coroutines there too

rjmccall
4:02 Which interfaces?

Joe Groff
4:02 like, slap an attribute on IteratorProtocol saying "this will always be a one-method protocol that looks like a coroutine"

rjmccall
4:03 Oh, I see, to special-case it in the ABI so that we can get the ideal layout.
4:06 Yeah, I guess I don't disagree that on a primitive ABI level we could make future coroutine work conform to whatever ABI we roll with in Swift 5.

Joe Groff
4:06 yeah, modulo borrowing concerns

rjmccall
4:06 Right, which are significant, of course.

Joe Groff
4:07 it's a fair question whether doing special-case work to get the ABIs right without the feature saves time over doing the feature properly

rjmccall
4:07 Yeah, that's what I was about to say.

jordan_rose
4:07 arclite borrowing? “if I am on Swift 5 stdlib then I can reach into Array like this”

Joe Groff
4:08 our existing accessor ABI seems to me like it's pretty close to what we want, and we could refine it within the existing model without too much churn

rjmccall
4:08 It feels like we'd end up where we did with addressors, iterating because we realize that we need more and more in the library.
4:09 Well, the big thing with accessors is that I think we want to replace a getter with a materializeForGet.

Joe Groff
4:09 right

rjmccall
4:09 And I'm really hesitant to roll out more stuff on the materializeForSet pattern.
4:09 But you're right that we could.

Joe Groff
4:09 i was trying to say that we don't necessarily need coroutines to do that; we could do compiler-driven mangling of the current source-level property model

rjmccall
4:10 Right, like we do with materializeForSet.

Joe Groff
4:21 from our previous discussions, i recall two general possible ABIs for accessors—could go with materializeForSet-style returning a callback, or pass down a nested function. we ought to be able to make materializeFor[GS]et work either way today
4:21 independent of languagew ork

rjmccall
4:22 Yeah. I think the callback ABI doesn't work if we need to support accessor usage in other coroutines, though.

Joe Groff
4:23 yeah, certainly not compiler-inverted ones at least

rjmccall
4:23 Right, obviously it's fine if somehow we can avoid compiler inversion.
4:25 Unless you really think there's something promising there, though, I think we need to assume that something more like the current ABI is what we're going to roll with.

Joe Groff
4:25 i can't think of anything else we could realistically do this year

rjmccall
4:27 Well, if we can base async functions on it, we can do that later. My assumption has been that there's not really anything we can do there at all that wouldn't be extremely invasive.
4:27 i.e. it's not just a because-we're-crammed-this-year problem.
4:27 Not sure that made any sense; let me rephrase.

Joe Groff
4:28 i think i get it

rjmccall
4:28 If there's an alternative way to lower async functions that doesn't require compiler inversion, and it's something we can pursue (presumably) next year, then we can assume we'll have that.
4:29 And we can base the accessor ABI around that assumption, e.g. by using coroutines.
4:29 But I haven't heard anything that makes me think that there's actually a feasible alternative there that we can pursue next year.
4:31 Okay. If we accept that coroutines require some sort of control inversion, then we should design what we think that ABI should look like.

Joe Groff
4:32 well, given infinite time, we could keep a coroutine's stack in some relocatable buffer, and compile async functions to burn a couple context registers to track the base of the buffer and sp relative to it, and keep all "addresses" of stuff on that stack relative. beginAsync allocates the buffer; suspendAsync gives you a refcounted handle to the buffer; pushing onto the stack potentially reallocs
4:32 but i don't think that's realistic
4:34 if anything else we really wouldn't want to malloc a context for every opaque access

rjmccall
4:35 Yeah, so I definitely don't think we can convince the backend to generate code that way.

Joe Groff
4:35 right
4:35 that would be a year-long llvm project in itself

rjmccall
4:35 Yeah, at least.
4:36 We can maybe make something like that the basis of a higher-level lowering, though.
4:37 i.e. that's how you get local memory that persists across a save point.
4:37 Er, a yield point.
4:38 I think we'd probably want to ensure that a coroutine can save a certain amount of state without needing to interact with the runtime.
4:39 Just for the benefit of really small coroutines.

Joe Groff
4:39 something like the caller-provided buffer for materializeForSet?

rjmccall
4:39 Right.
4:39 Like maybe the stack always has at least 4 words of free space?

Joe Groff
4:40 yeah, it'd be nice to avoid going to the heap if continuations never escape, which would be the common case for accesses and for loops

rjmccall
4:42 And something starting a stack, e.g. calling materializeForSet, can just allocate some fixed amount of space within their own stack and do some cheap initialization, and that's good enough for both the fast path and the slow path.
4:43 Maybe… pass down two pointers, a scratch space and a stack context?
4:44 Scratch space is [K x i8*], stack context is some control structure which can just be zero-initialized.

Joe Groff
4:44 if the caller knows it's going to store the yielded context for later, can it emplace the scratch space in an escapable object?
4:44 what's the stack context for?

rjmccall
4:46 Yeah, we don't necessary even need to promise that the scratch space stay at the same address. If the coroutine needs stable memory, they can allocate it.

Joe Groff
4:46 so anything stored there would have to be memcpyable?
4:46 makes sense

rjmccall
4:47 I mean, we could go either way on that.
4:47 But yeah, I think it might be useful to allow it to be memcpy'ed. Allocate it locally and then move it aside if you need to.
4:48 RE: stack context: I was thinking basically a pool of memory, so that repeated requests to allocate could resolve faster.
4:49 It's a trade-off, though, because its existence means that the originator needs to clean it up.
4:49 Which is presumably a runtime call.
4:50 Maybe one that could be elided in a fast path, but still, code size.
4:50 If you don't have it, then coroutines need to malloc every time.
4:51 (Unless they fit in scratch.)

Joe Groff
4:53 did anything come of the alloca-into-caller talk? you could theoretically let the coroutine push as much context as it needs on the stack, and make the caller move it if it needs to

rjmccall
4:53 Another benefit of a scratch space: it could potentially be pre-allocated to the known-right size in the original caller to make the system dynamically allocation-free.

Joe Groff
4:54 how would you get the known-right size?

rjmccall
4:54 Experimentation. It's not a great solution.
4:55 Or just picking a large-enough buffer in the same way that people pick large-enough stack sizes.
4:56 Alloca-into-caller still has the problem of requiring the memory to be movable, which gets into that big-LLVM-project space.
4:57 Well, okay, maybe less big.
4:57 We'd be filling this in some sort of coroutine lowering pass and deciding what to put there.

Joe Groff
4:58 yeah, seems a bit more tractable

rjmccall
4:58 Maybe tough at the LLVM level.

Joe Groff
4:58 across a yield point i think you have to reload all your local state out of a state union of some sort in any compiler transform

rjmccall
4:59 I guess we could add annotations.
4:59 Right.

Joe Groff
5:01 the state storage type you derive as part of the coroutine transform could be what you alloca-into-caller and leave for the caller to either move for later or immediate restart
5:01 immediately*

rjmccall
5:02 Yeah. So you only have to malloc if you actually need to persist something that has to stay at a fixed address.

Joe Groff
5:02 yeah
5:03 alternatively, we could also put an ABI cap on the size of the struct, so it always fits into a size the caller can preallocate

rjmccall
5:03 Now I guess you'd potentially be doing quadratic amounts of copying if you had coroutines forwarding to each other.

Joe Groff
5:03 yeah
5:04 so maybe it's better to let the caller emplace the state

rjmccall
5:05 Yeah. I think we can assume that accessors will usually be shallow (depth 1, overwhelmingly) and that async would need to get copied anyway.
5:06 Generators I'm not as sure about.

Joe Groff
5:07 it seems to me like generators would still primarily be relatively shallow and not escaped
5:07 either because you're digesting the whole thing in a for loop or in some silghtly more complex, but still local, control flow

rjmccall
5:08 Yeah. And relatively shallow probably still means depth=1.
5:09 Especially after inlining.
5:09 Okay. So… some kind of fixed-size scratch space that can be memcpy'ed by the caller between invocations.
5:10 For accessors, ramp function returns value pointer + optional continuation function pointer.

Joe Groff
5:11 is it better to return an optional pointer, or a pointer to a no-op function?

rjmccall
5:11 Optional? Maybe better for code size to just return a non-optional pointer and… yeah.
5:11 Typical time/code size trade-off, I think.

Joe Groff
5:12 i guess the no-op is always two branches vs one or three with the null check, but the return branch ought to be predictable at least

rjmccall
5:13 Yeah. And the same no-op function can be reused a lot.

rjmccall
5:14 For generators, I think ramp + continuation functions return value/address + optional continuation function pointer, with null meaning done + no value.
5:17 Async I guess would be the same way.

Joe Groff
5:17 yeah, generators and async ought to look mostly the same at the implementation level

spestov
5:17 I wonder how incredibly painful it would be to try to implement multi-shot continuations with reference counting

rjmccall
5:18 Er, actually, is that true? Maybe async actually just returns void.

Joe Groff
5:18 an async function is in a sense a generator that yields void

rjmccall
5:18 No, I mean maybe the continuation handled privately.

Joe Groff
5:18 or, potentially, some context information about what it's waiting on
5:19 ah, you mean, make it the callee's responsibility to schedule the continuation and just return void to the sync context?

rjmccall
5:19 Right.
5:20 Like, what would the caller do with the continuation function?

Joe Groff
5:20 sure, that might also be nice for retrofitting into async void-returning ABIs if needed
5:21 in the universe of possible designs, maybe async yields a list of event fds back to the caller event loop for queuing, instead of queuing itself
5:21 with chris's proposal though it's up to the callee

rjmccall
5:21 Ah, sure, if you had a universal event system to wait on, that would work.
5:22 Just return an event handle + a continuation to notify upon that event.
5:22 But yeah, it's hard to retrofit that.
5:24 Okay, if async functions schedule themselves, they don't have scratch buffers, either.
5:25 Thoughts on scratch size for accessors/generators? I guess they can vary separately.
5:26 Generators probably want more scratch space overall by default.
5:26 And there's some benefit to that, since it means generators can call accessors.
5:26 Maybe 4 pointers and 8 pointers?
5:26 16-byte aligned in both cases, just because.
5:27 Or maybe not, maybe just pointer-aligned is fine.

Joe Groff
5:28 how big is a typical lazy transformer struct

rjmccall
5:28 You mean like a lazy collection?

Joe Groff
5:28 yeah
5:29 a closure is two words

rjmccall
5:29 There's going to at least be a transform function, and then I guess the underlying collection.

Joe Groff
5:29 an array might be one or two words

rjmccall
5:29 Yeah.

Joe Groff
5:30 8 pointers gives you a collection and three closures, i guess that's ok

rjmccall
5:31 Yeah, or a collection, a closure, and an accessor call.
5:31 Hmm, not an accessor call, because you also have to save the continuation.
5:33 Other than wanting accessors to fit cleanly within generators, I don't think we need to be stingy here.

Joe Groff
5:36 yeah, maybe a bit more space for generators, since it's going to be common to yield an access, and there's more interesting state that could accumulate

rjmccall
5:36 12? 16?
5:36 12 is probably fine.

2 Likes