Efficient yielding accessor ABI

Since the non-underscored version of yielding accessors has not yet been shipped, maybe it’s not too late for this to make a difference…

Our tests in the Hylo project of Swift’s yielding accessors have shown them to have major performance problems. In particular, IIUC, the coroutines incur dynamic allocations. In the Hylo project we have worked out a compilation model that uses ordinary function calls instead of coroutines, and supports non-lexical lifetimes. If this model is of any interest to the implementation team, I’m happy to share more.

Regards,
Dave

22 Likes

Hi Dave, thanks for starting a discussion about this. Although I’m far from an expert on coroutines, could you elaborate on the alternative implementation you had in mind? Also, how significant are the performance impacts you are seeing?

2 Likes

Joe tells me there is a new ABI for yielding accessors that’s in development which avoids these heap allocations, but it’s not enabled yet, so that might explain the overhead you’re seeing with the current implementation.

I’m always happy to see more discussion of implementation techniques here, so why not? ;)

12 Likes

The overhead can be pretty significant when one starts chaining projections (e.g., in expressions of the form `foo[a].bar = ham` while crossing a resilience boundary.

Sadly I don’t have a reproducer handy and I think we fixed the most problematic use cases in the Hylo code base. What I remember from profiling data was that using `_read` or `_modify` would sometimes create needless dynamic allocations.

2 Likes

The scheme is pretty simple has a pretty simple basis, but well… you'll see. To set the stage, notes about how Hylo differs from Swift so you can read the code.

  • In Hylo, let means “immutable borrow,” like what Swift calls borrowing.
  • Hylo uses the let and inout keywords for the equivalent of Swift’s _readand _modify accesors, respectively
  • Hylo supports named inout and let bindings to projection results.
  • Hylo guarantees that lifetimes end at the first opportunity.
  • Pointers in Hylo are not unsafe, only access to their pointees is

So, you can have code like:

fun f(x: Array<Int>, y: inout Array<Array<String>>) {
  // The next three lines are equivalent to y[1][x[1]] = "Hello World".
  let x1 = x[1]
  inout y1 = &y[1]
  y1[x1] = "Hello World"
  // lifetime of x1 projection ends here.
  print(y1)
  // lifetime of y1 projection ends here.
}

Array element access will be done with an addressors for efficiency, but we need to support this pattern for yielding accessors, e.g. for types like Swift’s Dictionary whose _modify has to construct an Optional<Value>, yield it, and then take action after the yield based on the Optional's value to update the Dictionary. We call the accessor’s code before the yield its “ramp” and the code after the yield its “slide.”

The model starts with the realization that for most use cases an inoutor let accessor can be simply transformed into a higher-order function, so internally you could rewrite:

subscript(j: Int): String {
  inout {
    var q: String = ramp_code(j)
    yield &q
    slide_code(q)
  }
  ...

into

fun subscript_inout(j: Int, access: (inout String)->Void) {
  var q: String = ramp_code(j)
  access(&q)
  slide_code(q)
}

The body of the accessor is just transplanted into the function and the yield is replaced by the call to access. At the call site you wrap the code that uses the yielded in a closure, and pass that to the subscriptInout function as its access parameter.

However, this doesn’t cover cases shaped like our first example, where the ramps and slides don’t nest: you have x1's ramp, y1's ramp, x1's slide, and finally y1's slide. To handle that, you need to change the transformation slightly:
(I hope I didn’t make too many errors in this, I didn’t pass it through a real compiler and this is the first time I’ve tried to put it in code.)

fun subscript_inout<Caller>(
  j: Int, 
  continue: InoutScopeContinuation<String, Caller>,
  caller_locals: Locals<Caller>
) {
  var q: String = rampCode(j)

  continue(
    &q,
    fun (my_locals: SlideLocals)->Void {
      slideCode(&Locals<MyLocals>(my_locals).unsafe_pointee.q)
    },
    my_locals, // <=== synthesized pointer to local stack frame.
    caller_locals)
}

where:

/// Captured access to a functions's local variables, where
/// `X` has the layout of its parameters and local variables.
type Locals<X> = MutablePointer<X>

/// Contextually defined in each function  the layout of its 
/// parameters and local variables.
type MyLocals = ...

/// Type erased `Locals<Slide>` where `Slide` is `MyLocals` for some
/// slide function.
type SlideLocals = RawPointer

/// Post-yield code of an accessor projecting `inout T`
type InoutSlide<Accessor> = (Locals<Accessor>)->Void

/// The rest of a caller up to the end of the enclosing block.
type InoutScopeContinuation<T, Caller> 
  = (projected: inout T, 
     slide: InoutSlide<T>, slideLocals: RawPointer,
     caller_locals: Locals<Caller>)->Void

The transformation in the caller is mostly implied by the names and comments above, but I’ll spell it out:

  • the remainder of the enclosing block containing the access is represented as an InoutScopeContinuation<String, MyLocals>
  • Since the caller knows exactly where the slide needs to execute, a call to the slide parameter of that continuation is made at that point, passing slideLocals.
  • If there is another projection inside the continuation, the transformation is repeated.
  • All the local variables stay accessible because the chain of continuations doesn’t return until the end of scope.
  • You don’t actually need to use the end of the scope in most cases; you can trivially compute where a chain of overlapping projections ends and return from continuations at that point.

That’s it; LMK if you have any questions. I’m interested to hear what @Joe_Groff may be doing in this space.

16 Likes

@Joe_Groff @slava Is any of this useful for Swift, or have you already been down this road and decided on something better?

3 Likes

@Slava_Pestov hey, gentle ping

That looks similar to the coroutine splitting transformation that we let LLVM do for us with the existing ABI. The new ABI we're using takes advantage of the primary use of accessor coroutines still having stack discipline in a different way: the coroutine is allowed to use the callstack to store its state, and yield is implemented by "returning" but still leaving context unpopped on the callstack, avoiding the need for a predefined fixed allocation or fallback to dynamic allocation. When the caller resumes the coroutine, it picks up with the context it left on the stack, and pops it when it makes its final return. Mechanically, the coroutine still gets split into two almost-traditional functions at the LLVM level, but with the trick that the first half gets to leave live values on the stack for the second half to pick up again. The caller also has the ability to redirect context allocations to an async task's context when the caller is async, or for future-proofing, it can still fall back to malloc in case we ever introduce a reified coroutine access type.

7 Likes

@Joe_Groff : can you please share links to where this is implemented?

I would like to understand:

  • how can the caller use the stack after calling into the subscript (before returning the control back to the slide)
  • how is the stack cleaned up
  • how will this work if the call of the coroutine happens in a loop

Thank you very much

1 Like

@Joe_Groff it seems to me that the scheme I outlined doesn't require dynamic allocations even with a reified subscript. What's the advantage of doing it the way you are pursuing?

Unless I'm missing something, your InoutScopeContinuation is a closure, which would generally dynamically allocate space to store the state needed to resume the continuation. You could avoid that by generating a per-coroutine type to store the continuation info specific to a coroutine implementation, but you would then rely on generics to handle the interface between callers and the coroutine.

1 Like

The first half of the coroutine leaves the stack pointer pointing below its own context, so the caller can continue to use any stack space it had used before initiating the coroutine, and if it wants to allocate more, it can push the stack pointer further downward. This is OK because uses of coroutine accessors in Swift today are strictly scoped, so any additional local state the caller accumulates during an access would naturally be popped before ending the coroutine.

The coroutine pops its context off the stack after it completes.

As noted above, the scope discipline of coroutine accesses prevents this from happening. If we introduce the ability to have more ad-hoc access patterns, we would still have to fall back to heap allocations in cases where stack discipline can't be maintained. In order to accumulate a collection of coroutine accesses across a loop, that would require collecting the reified coroutine accesses as heap-allocated contexts under this scheme.

2 Likes

It's a lambda with no captures (and even if it had captures, it doesn't escape), and Hylo lambdas don't do dynamic allocation; they're more like C++ or Rust lambdas. So I don't know where any dynamic allocation would come from. This one is essentially a C function pointer. I should have written this to show the lack of captures:

/// Post-yield code of an accessor projecting `inout T`
type InoutSlide<Accessor> = [Void](Locals<Accessor>)->Void

/// The rest of a caller up to the end of the enclosing block.
type InoutScopeContinuation<T, Caller> 
  = [Void](projected: inout T, 
     slide: InoutSlide<T>, slideLocals: RawPointer,
     caller_locals: Locals<Caller>)->Void

It seems to me that if you're really committed to the scope discipline (which I assume means more than just lexical scope but actual nesting of projection lifetimes), you could use the simple higher-order function transformation (introduced at the beginning), and coroutines are merely a convenience for you, if that. We found the documentation of LLVM's coroutine primitives much too vague to feel confident continuing to use them, which is why we developed this other scheme.

3 Likes

@Joe_Groff I'm still looking for some advantage to what you're planning to do. If so, Hylo might want to consider doing that instead. Inquiring minds want to know!

3 Likes

Extracting the code in the caller’s access scope as a callback is a viable approach under certain assumptions, chiefly that you never need the scope of the access to be more dynamic than that, which happens to include static scopes that just don’t match whatever your callback implementation assumes. Because Swift uses coroutines instead, it can e.g. perform abstract accesses across async suspensions, despite the fact that we didn’t know how async functions would work when we designed those coroutines. But if you’re willing to give that sort of thing up, callback extraction is quite straightforward.

2 Likes

I’ve spent some time analyzing how the LLVM generated code looks like. It seems that the callee declares the size of the coroutine frame, and the caller allocates that on the stack before invoking the callee coroutine entry. The allocated frame is passed as the first parameter to the generated ramp function (and to the resume function).

The thing that confuses me is that the second argument to the coroutine entry function is @_swift_coro_malloc_allocator. This doesn’t seem to be used anywhere by the callee.

@Joe_Groff : why do we have this argument? Is it for legacy reasons?

Thank you very much!

You need the coroutine to allocate something dynamically-sized that can’t be preallocated in the frame.

And note that the new coroutine implementation is only used with the new SE-0474 syntax, which is not yet enabled by default, and also only currently uses the new implementation when targeting Apple arm64 platforms. For ABI compatibility reasons, the existing _read and _modify syntax will continue to use the fixed-size buffer implementation.

I think @lucteo's point is that he doesn't see that dynamic allocation happening in any actually-generated code. Is there a particular pattern that causes it?

Are you saying that, once you've bought into stackless coroutines (or—equivalently—you haven't yet ruled them out, as when you introduced _modify), it might be necessary to use coroutines for yielding accessors? It sounds plausible at first, but when I think about it carefully I can't understand why it would be true.

It seems to me we have two rewrite systems: the one I outlined, which rewrites yielding accessors as regular functions, and the stackless coroutine transformation Swift uses for async functions, which rewrites async adorned functions as structs storing locals that persist across the yield. Even if you want stackless coroutines you could apply one rewrite and then the other, no?

A local variable of dynamically-sized type should do it, or withUnsafeTemporaryAllocation.

If can you rewrite the yield accessor as both an async function taking an async callback and a sync function taking a sync callback, sure.

Edit: and you're willing to say that that covers everything and you will never have use accessors in a situation with inversion-of-control problems