Continuation-based versus nested-function-based materializeForSet


(Joe Groff) #1

We currently abstract over mutable property accesses using what I’ll call a continuation-based model–the materializeForSet accessor is called before an inout access, and returns a continuation callback that must be called when the inout access is completed. I think a nested-function-based model, if designed correctly, has the potential to be more efficient and easier to implement.

As background for anyone intimately familiar with the implementation, Swift’s language model for mutable properties requires a reconciliation step after a mutation occurs. For example, in an inout access like this:

foo(&x.y.z)
the y or z property may be computed, requiring a getter call before the call to foo and a setter call afterward, or either property may have willSet or didSet observers that fire after foo finishes. If the base x is a class, protocol, or generic value, the implementation of y can`t be statically known, so we need an abstract interface for mutating the property, projecting out the value at the start of the access and committing the mutated value back at the end. Ideally, the interface should accommodate efficient access to both stored and computed properties, avoiding unnecessary copying of stored properties to avoid inducing CoW copies or, in the near future, violating the constraints of move-only types or the borrow model. There are two broad approaches to doing this:

Continuation-based access

The accessor that begins the formal access can return a continuation closure to call when the access is completed. This is the approach Swift currently uses with its materializeForSet accessor pattern. A property under this model compiles to something like this C code:

struct X {
  // getter for Y
  Y (*getY)(const X *self);
  // setter for Y
  void (*setY)(X *self, Y newValue);
  // materializeForSet for Y
  struct MaterializeForSetReturn {
    // The projected field
    Y *y;
    // The function that finishes the access, or null
    void (*finish)(X *self, void *buffer);
  };
  struct MaterializeForSetReturn (*materalizeForSettingY)(X *self, void *buffer);
};

void callFooOnXDotYDotZ(X *x) {
  // To compile:
  // foo(&x.y.z)
  //
  // - call the materializeForSet accessor to begin the access to y, giving it a buffer
  // to store a computed value or capture state to pass to the end of the access
  char alignas(Y) bufferY[sizeof(Y) + 3*sizeof(void*)];
  auto materializedY = x->materializeForSettingY(x, bufferY);
  // - call materializeForSet to begin access to z
  char alignas(Z) bufferZ[sizeof(Z) + 3*sizeof(void*)];
  autom materializedZ = materializedY.y->materializeForSettingZ(materializedY.y, bufferZ);
  // - perform the access
  foo(materializedZ.z);
  // - finish the accesses, if we were given a finishing callback
  if (materializedZ.finish)
    materializedZ.finish(materializedY.y, bufferZ);
  if (materializedY.finish)
    materializedY.finish(x, bufferY);
}
A stored property can be exposed through this interface by trivially returning the projected address of the subobject, with no continuation:

struct MaterializeForSetReturn
materializeForSettingStoredY(X *self, void *buffer) {
  return (struct MaterializeForSetReturn){&self->y, NULL};
}
whereas a computed property can call the getter, store the computed value into the buffer, and return a continuation function that calls the setter:

struct MaterializeForSetReturn
materializeForSettingComputedY(X *self, void *buffer) {
  Y oldValue = getY(self);
  memcpy(buffer, &oldValue, sizeof(Y));
  return (struct MaterializeForSetReturn){(Y *)buffer, finishSettingComputedY};
}
void finishSettingComputedY(X *self, void *buffer) {
  Y newValue;
  memcpy(&newValue, buffer, sizeof(Y));
  setY(self, newValue);
}
A benefit of this approach is that it maintains the integrity of the source function in the compiled code. On the other hand, in order for the property implementation to transfer state from the start to the completion of the access, the caller must preallocate some stack space; if it’s too much, the space is wasted, and if it’s not enough, the property implementation must use malloc to get more space for itself. (It’s theoretically possible to avoid this with some clever stack pointer accounting.) There’s also a code size impact on the caller, which needs to bracket the inout access with a call on each side. The underlying CPU has to execute three or five branches per segment of the access path (calling and returning from materializeForSet, testing for null, and potentially calling and returning from the continuation if there is one).

Nested functions

Alternatively, we can implement the formal access pattern as a series of nested functions. A property under this model compiles to something like this C code:

struct X {
  // getter for Y
  Y (*getY)(const X *self);
  // setter for Y
  void (*setY)(X *self, Y newValue);
  // mutator for Y
  void *(*mutateY)(X *self, void *context, void *(*innerMutation)(Y *y, void *context));
};

void callFooOnXDotYDotZ(X *x) {
  // To compile:
  // foo(&x.y)
  // - call the mutate accessor for y, with the inner access as a
  // function whose pointer gets passed in
  x->mutateY(x, NULL, callFooOnXDotY_inner_1);
}
void callFooOnXDotYDotZ_inner_1(Y *y, void *context) {
  // - call the mutate accessor for z
  y->mutateZ(y, context, callFooOnXDotY_inner_2);
}
void callFooOnXDotYDotZ_inner_2(Z *z, void *context) {
  foo(z);
}
A stored property can implement mutate by projecting the physical address of the subobject, tail-calling the subsequent inner function:

void *mutateStoredY(X *self, void *context, void *(*innerMutation)(Y *y, void *context)) {
  /*tail*/ return innerMutation(&self->y, context);
}
and a computed property can bracket the inner mutation in getter and setter calls:

void *mutateComputedY(X *self, void *context, void *(*innerMutation)(Y *y, void *context)) {
  Y value = getY(self);
  void *result = innerMutation(&value, context);
  setY(self, value);
  return result;
}
This makes things easier for the property implementation, which can easily save as much state as it needs to on the stack instead of relying on a preallocated buffer handed down from the caller. There’s one fewer branch necessary per segment of the access path compared to the continuation-based model, either three if the accessor has no reconciliation necessary and can tail-call the inner function, or four if it does a normal call. On the other hand, the caller function has to be broken up into many smaller subfunctions, potentially one for every segment of an access path.

Letting nested functions walk the access path

We can reduce the number of nested functions that are needed for a nested access path by designing a convention that allows each mutator to directly call the mutator for the next segment of access. For example, the caller could construct the entire access path as an array of pointers to mutation functions:

typedef void *(*MutatorFunction)(void *base, MutatorFunction *accessPath, void *context);
void callFooOnXDotYDotZDotW(X *x) {
  // foo(&x.y.z.w)
  MutatorFunction accessPath[] = {
    (MutatorFunction)mutateZ,
    (MutatorFunction)mutateW,
    (MutatorFunction)callFooOnXDotYDotZDotW_inner
  };
  mutateY(x, accessPath, NULL);
}
Each property accessor function then loads the next step of the access out of the path, and advances the pointer forward for the nested access:

void *mutateY(X *self, MutatorFunction *accessPath, void *context) {
  // Project out y
  Y *y = self->y;
  // Call the next step of the access path, passing the remainder of the access path
  return (*accessPath)(y, accessPath + 1, context);
}
And the inner function at the end of the access path ignores the access path parameter and performs the access:

void *callFooOnXDotYDotZDotW_inner(W *w, MutatorFunction *_ignored, void *context) {
  foo(w);
}
In this model, only one inner function needs to be broken out of the caller for the innermost inout access, at least for simple property projections. There are only two branches needed per segment, or one for segments that can tail-call the inner access. Parameterized segments such as subscripts can be supported by storing the parameters in the access path buffer as well, and having the subscript accessor advance past the parameters when it consumes them. For example:

void callFooOnXDotYSubIDotZ(X *x, intptr_t i) {
  // foo(&x.y[i].z)
  uintptr_t accessPath[] = {
    (uintptr_t)mutateYSubscript,
    (uintptr_t)i,
    (uintptr_t)mutateZ,
    (uintptr_t)callFooOnXDotYSubIDotZ_inner
  };
  
  mutateY(x, accessPath, NULL);
}

void *mutateYSubscript(Y *y, uintptr_t *accessPath, void *context) {
  // Get the subscript index out of the access path
  intptr_t i = (intptr_t)*accessPath++;
  // Get the next step of the access out of the access path
  MutatorFunction next = (MutatorFunction)*accessPath++;
  
  return next(&y->elements[i], accessPath, context);
}
On the other hand, this does push the inner access functions and subscript arguments onto the stack, which is a small amount of overhead. Furthermore, the caller function has to push any state it needs to pass from outside the access to inside on the stack as well, if it isn’t able to cram it otherwise into a single context pointer argument.

Machine-level calling convention tricks

To be able to feed a return value from the inner access back to the outer caller, nested mutators would need to preserve all of the return registers potentially used by the inner access on the way out of the access. To minimize the impact on the outer caller, the convention for mutators could also specify that they need to preserve the contents of callee-save registers when they call down to nested accesses. This should allow as much state to be kept in registers between the outer caller and inner access as if they were naturally in the same function, with at most a function call in between.

With these sorts of optimizations, I think the nested function approach has the potential to be competitive with, or even more efficient than, our current continuation-based materializeForSet design.

-Joe


(Jordan Rose) #2

I like the idea; it makes more sense to me than our current model (which feels more like a plain callback than a continuation to me). Some things that occurred to me when reading this:

- This seems like it'll be much simpler to check for invalid concurrent access to the same location (inout violations) in debug builds. (I don't think it's actually any more or less possible, but it is simpler.)

- This will look funny but work fine for multiple inout parameters: foo(&x, &y)

- OTOH, this does break up basic blocks in the caller context, making LLVM analysis less effective. Since the materializeForSet calls were already either inlined or opaque, though, we're probably fine there.

- Does this help us with the nested dictionary CoW problem? `foo["bar"]["baz"] += 1`

Jordan

···

On Oct 31, 2016, at 12:22, Joe Groff via swift-dev <swift-dev@swift.org> wrote:

We currently abstract over mutable property accesses using what I’ll call a continuation-based model–the materializeForSet accessor is called before an inout access, and returns a continuation callback that must be called when the inout access is completed. I think a nested-function-based model, if designed correctly, has the potential to be more efficient and easier to implement.

As background for anyone intimately familiar with the implementation, Swift’s language model for mutable properties requires a reconciliation step after a mutation occurs. For example, in an inout access like this:

foo(&x.y.z)
the y or z property may be computed, requiring a getter call before the call to foo and a setter call afterward, or either property may have willSet or didSet observers that fire after foo finishes. If the base x is a class, protocol, or generic value, the implementation of y can`t be statically known, so we need an abstract interface for mutating the property, projecting out the value at the start of the access and committing the mutated value back at the end. Ideally, the interface should accommodate efficient access to both stored and computed properties, avoiding unnecessary copying of stored properties to avoid inducing CoW copies or, in the near future, violating the constraints of move-only types or the borrow model. There are two broad approaches to doing this:

Continuation-based access

The accessor that begins the formal access can return a continuation closure to call when the access is completed. This is the approach Swift currently uses with its materializeForSet accessor pattern. A property under this model compiles to something like this C code:

struct X {
  // getter for Y
  Y (*getY)(const X *self);
  // setter for Y
  void (*setY)(X *self, Y newValue);
  // materializeForSet for Y
  struct MaterializeForSetReturn {
    // The projected field
    Y *y;
    // The function that finishes the access, or null
    void (*finish)(X *self, void *buffer);
  };
  struct MaterializeForSetReturn (*materalizeForSettingY)(X *self, void *buffer);
};

void callFooOnXDotYDotZ(X *x) {
  // To compile:
  // foo(&x.y.z)
  //
  // - call the materializeForSet accessor to begin the access to y, giving it a buffer
  // to store a computed value or capture state to pass to the end of the access
  char alignas(Y) bufferY[sizeof(Y) + 3*sizeof(void*)];
  auto materializedY = x->materializeForSettingY(x, bufferY);
  // - call materializeForSet to begin access to z
  char alignas(Z) bufferZ[sizeof(Z) + 3*sizeof(void*)];
  autom materializedZ = materializedY.y->materializeForSettingZ(materializedY.y, bufferZ);
  // - perform the access
  foo(materializedZ.z);
  // - finish the accesses, if we were given a finishing callback
  if (materializedZ.finish)
    materializedZ.finish(materializedY.y, bufferZ);
  if (materializedY.finish)
    materializedY.finish(x, bufferY);
}
A stored property can be exposed through this interface by trivially returning the projected address of the subobject, with no continuation:

struct MaterializeForSetReturn
materializeForSettingStoredY(X *self, void *buffer) {
  return (struct MaterializeForSetReturn){&self->y, NULL};
}
whereas a computed property can call the getter, store the computed value into the buffer, and return a continuation function that calls the setter:

struct MaterializeForSetReturn
materializeForSettingComputedY(X *self, void *buffer) {
  Y oldValue = getY(self);
  memcpy(buffer, &oldValue, sizeof(Y));
  return (struct MaterializeForSetReturn){(Y *)buffer, finishSettingComputedY};
}
void finishSettingComputedY(X *self, void *buffer) {
  Y newValue;
  memcpy(&newValue, buffer, sizeof(Y));
  setY(self, newValue);
}
A benefit of this approach is that it maintains the integrity of the source function in the compiled code. On the other hand, in order for the property implementation to transfer state from the start to the completion of the access, the caller must preallocate some stack space; if it’s too much, the space is wasted, and if it’s not enough, the property implementation must use malloc to get more space for itself. (It’s theoretically possible to avoid this with some clever stack pointer accounting.) There’s also a code size impact on the caller, which needs to bracket the inout access with a call on each side. The underlying CPU has to execute three or five branches per segment of the access path (calling and returning from materializeForSet, testing for null, and potentially calling and returning from the continuation if there is one).

Nested functions

Alternatively, we can implement the formal access pattern as a series of nested functions. A property under this model compiles to something like this C code:

struct X {
  // getter for Y
  Y (*getY)(const X *self);
  // setter for Y
  void (*setY)(X *self, Y newValue);
  // mutator for Y
  void *(*mutateY)(X *self, void *context, void *(*innerMutation)(Y *y, void *context));
};

void callFooOnXDotYDotZ(X *x) {
  // To compile:
  // foo(&x.y)
  // - call the mutate accessor for y, with the inner access as a
  // function whose pointer gets passed in
  x->mutateY(x, NULL, callFooOnXDotY_inner_1);
}
void callFooOnXDotYDotZ_inner_1(Y *y, void *context) {
  // - call the mutate accessor for z
  y->mutateZ(y, context, callFooOnXDotY_inner_2);
}
void callFooOnXDotYDotZ_inner_2(Z *z, void *context) {
  foo(z);
}
A stored property can implement mutate by projecting the physical address of the subobject, tail-calling the subsequent inner function:

void *mutateStoredY(X *self, void *context, void *(*innerMutation)(Y *y, void *context)) {
  /*tail*/ return innerMutation(&self->y, context);
}
and a computed property can bracket the inner mutation in getter and setter calls:

void *mutateComputedY(X *self, void *context, void *(*innerMutation)(Y *y, void *context)) {
  Y value = getY(self);
  void *result = innerMutation(&value, context);
  setY(self, value);
  return result;
}
This makes things easier for the property implementation, which can easily save as much state as it needs to on the stack instead of relying on a preallocated buffer handed down from the caller. There’s one fewer branch necessary per segment of the access path compared to the continuation-based model, either three if the accessor has no reconciliation necessary and can tail-call the inner function, or four if it does a normal call. On the other hand, the caller function has to be broken up into many smaller subfunctions, potentially one for every segment of an access path.

Letting nested functions walk the access path

We can reduce the number of nested functions that are needed for a nested access path by designing a convention that allows each mutator to directly call the mutator for the next segment of access. For example, the caller could construct the entire access path as an array of pointers to mutation functions:

typedef void *(*MutatorFunction)(void *base, MutatorFunction *accessPath, void *context);
void callFooOnXDotYDotZDotW(X *x) {
  // foo(&x.y.z.w)
  MutatorFunction accessPath[] = {
    (MutatorFunction)mutateZ,
    (MutatorFunction)mutateW,
    (MutatorFunction)callFooOnXDotYDotZDotW_inner
  };
  mutateY(x, accessPath, NULL);
}
Each property accessor function then loads the next step of the access out of the path, and advances the pointer forward for the nested access:

void *mutateY(X *self, MutatorFunction *accessPath, void *context) {
  // Project out y
  Y *y = self->y;
  // Call the next step of the access path, passing the remainder of the access path
  return (*accessPath)(y, accessPath + 1, context);
}
And the inner function at the end of the access path ignores the access path parameter and performs the access:

void *callFooOnXDotYDotZDotW_inner(W *w, MutatorFunction *_ignored, void *context) {
  foo(w);
}
In this model, only one inner function needs to be broken out of the caller for the innermost inout access, at least for simple property projections. There are only two branches needed per segment, or one for segments that can tail-call the inner access. Parameterized segments such as subscripts can be supported by storing the parameters in the access path buffer as well, and having the subscript accessor advance past the parameters when it consumes them. For example:

void callFooOnXDotYSubIDotZ(X *x, intptr_t i) {
  // foo(&x.y[i].z)
  uintptr_t accessPath[] = {
    (uintptr_t)mutateYSubscript,
    (uintptr_t)i,
    (uintptr_t)mutateZ,
    (uintptr_t)callFooOnXDotYSubIDotZ_inner
  };
  
  mutateY(x, accessPath, NULL);
}

void *mutateYSubscript(Y *y, uintptr_t *accessPath, void *context) {
  // Get the subscript index out of the access path
  intptr_t i = (intptr_t)*accessPath++;
  // Get the next step of the access out of the access path
  MutatorFunction next = (MutatorFunction)*accessPath++;
  
  return next(&y->elements[i], accessPath, context);
}
On the other hand, this does push the inner access functions and subscript arguments onto the stack, which is a small amount of overhead. Furthermore, the caller function has to push any state it needs to pass from outside the access to inside on the stack as well, if it isn’t able to cram it otherwise into a single context pointer argument.

Machine-level calling convention tricks

To be able to feed a return value from the inner access back to the outer caller, nested mutators would need to preserve all of the return registers potentially used by the inner access on the way out of the access. To minimize the impact on the outer caller, the convention for mutators could also specify that they need to preserve the contents of callee-save registers when they call down to nested accesses. This should allow as much state to be kept in registers between the outer caller and inner access as if they were naturally in the same function, with at most a function call in between.

With these sorts of optimizations, I think the nested function approach has the potential to be competitive with, or even more efficient than, our current continuation-based materializeForSet design.

-Joe
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


(John McCall) #3

Well, I'm not totally convinced about either of those two points, but it's definitely something we need to talk about.

Let me try to break this down. I think I'm going to repeat you a lot in different words.

We have a coroutine-esque problem:
  0. Outer is executing.
  1. Outer calls Inner.
  2. Inner executes for a while.
  3. Inner passes control back to Outer without abandoning its execution context.
  4. Outer executes for awhile.
  5. Outer passes control back to Inner, potentially having changed its own execution context.
  6. Inner executes for awhile.
  7. Inner finally abandons its execution context and passes control back to Outer.
  8. Outer is executing.

Note that there might be multiple static points for (3) and (5), each requiring a different resumption point (and creating different context changes) at (5) and (7) respectively.

(I'm going to use the terms "standard spilling", "subfunction spilling", and "transfer spilling" without defining them. I'll come back to this.)

I. Lowerings

Currently, as you say, we do this with a sort of CPS conversion where Inner is split into multiple sub-functions. The initial function is passed a (necessarily fixed-size, opaque to Outer) buffer; at (3), it saves its context into that buffer and returns a sub-function that Outer will invoke (passing the same buffer) at point (5); at (5), that sub-function restores its context from that buffer; and finally the sub-function returns normally at (7). This has two advantages: Outer can reason locally about the data flow from (0) to (4) to (8), and Inner can return a null pointer at (3) to indicate that there's no work to do in (6) and thus save an indirect branch (at the cost of a local conditional branch). However, if Inner has interesting context to save at (3), that requires extra code to manage that data flow, and the amount of data to save can overflow the buffer and therefore require malloc. Total branching cost: the original call at (1), the easily-predicted return at (3), the conditional branch at (5), the possible indirect call at (5), and the easily-predicted return at (7). Total save/restore cost: standard spilling at (1-3) and (5-7), transfer spilling at (3-5), possible malloc.

The simplest alternative proposal is a callback conversion, extracting (4) from Outer as a sub-function which is passed to Inner. Inner now gets an ordinary stack frame to save its context, so its data flow and control flow are easier for the backend to reason about, and malloc is not required. However, the data/control flow in Outer is significantly obscured; most importantly, different static points for (5) may need to resume to different instructions at (7) (which the continuation lowering implicitly handles by duplicating the call in (5)), and Outer's context will be modified at various points from (1-7). Total branching cost: the original call at (1), the indirect call at (3), the easily-predicted return at (5), the easily-predicted return at (7), the possible switch-out at (7). Total save/restore cost: subfunction spilling at (1-7), standard spilling at (3-5), only stack allocations.

Note that the call at (5) needs to duplicated for continuation lowering in exactly those cases where we need a switch-out at (7) in callback conversion.

Your observation is that, if Outer needs to perform a nested access with this sequence at (4), a naive callback conversion will have a lot of branches. Let the outer access be (A), with its steps being (A.0-8), and the inner access be (B), with its steps being (B.0-8), so we have (A.4) := (B.0-8). If the nesting is simple — i.e. there is no code in B.0 or B.8 — and the initiation sequences are compatible — i.e. you can convince (A.3) to directly make the original call for (B.1) — then it is unnecessary to extract a sub-function for (A.4), and instead (A.1) can just set up the chain of calls. Total branching cost: the original call at (A.1), the indirect call at (A.3 / B.1), the indirect call at (B.3), the easily-predicted return at (B.5), the easily-predicted return at (B.7 / A.5), the easily-predicted return at (A.7), the possible switch-out at (A.7). Total save/restore cost: subfunction spilling at (A.1-7), standard spilling at (A.3-5), standard spilling at (B.3-5), only stack allocations. But it does have some additional set-up overhead, and it expects a lot from the indirect call at (3); in particular, to get this to work well in practice, we really have to change all entrypoints for (1) so that they're callable in exactly the same way, i.e. taking a context argument, a "next callback" pointer, and an array of extra argument data for the types, indices, whatever.

II. Spilling

"Standard spilling" means ordinary compiler behavior to save values across a call; it includes putting values in callee-save registers. The compiler has a lot of intelligence about this.

"Subfunction spilling" is like standard spilling except the values must be accessible to a sub-function. The compiler has some very limited intelligence for this; I'm confident that that intelligence can grow over time, but it's not much right now. This precludes using callee-save registers unless, like you suggest, we promise to preserve them when making the callback; and note that "preserve" here really means "expect our caller and our callee to have a private contract about these registers that we shouldn't disturb" — in particular, our caller might set the register, our callee might expect to see that value *and change it*, and our caller might expect to see that changed value. This is completely implementable, but it's not how the backend is really engineered right now.

"Transfer spelling" means saving values across a shift in execution contexts, given only a buffer. The compiler is not well-engineered for this; we'd probably always be lowering to sub-functions early on, and doing so will block a lot of value optimizations in LLVM.

III. Implementation

I think SIL should clearly work with structured, unified Outer and Inner functions here, freely able to move operations between them without having to worry about what goes where. That means that we have to do some sort of lowering post-SIL (or at least in some low-level late lowering, rather than in SILGen). I expect that that alone will greatly improve our current implementation.

In order to do this late lowering, we'll need some structural restrictions on SIL that should be in line with many of the other restrictions we're imposing with Semantic SIL. Given that, I don't think that it's particularly harder to extract a continuation vs. extracting an inner sub-function.

IV. Summary

So to sum up:

Our current continuation-style lowering is good when:
  - there's a lot of data flow from 0 to 4 or from 4 to 8,
  - there's complex control flow from 4 to 8, or
  - the inner function has no code in 6, which is common for stored/addressed accesses.

A callback-style lowering is good when:
  - there's a lot of data flow from 2 to 6 or
  - a zero-allocation implementation is required.

A chained callback-style lowering is good when:
  - callback-style lowerings are good and
  - simply nested accesses are common enough to pay for the slight additional overhead.

Honestly, I have a hard time seeing the callback trade-offs as being the right ones in general. Removing the malloc is required for true systems programming, but when that guarantee isn't needed, I think we can provide a large enough buffer to make allocation basically a non-factor.

John.

···

On Oct 31, 2016, at 12:22 PM, Joe Groff via swift-dev <swift-dev@swift.org> wrote:
We currently abstract over mutable property accesses using what I’ll call a continuation-based model–the materializeForSet accessor is called before an inout access, and returns a continuation callback that must be called when the inout access is completed. I think a nested-function-based model, if designed correctly, has the potential to be more efficient and easier to implement.


(Joe Groff) #4

I like the idea; it makes more sense to me than our current model (which feels more like a plain callback than a continuation to me). Some things that occurred to me when reading this:

- This seems like it'll be much simpler to check for invalid concurrent access to the same location (inout violations) in debug builds. (I don't think it's actually any more or less possible, but it is simpler.)

- This will look funny but work fine for multiple inout parameters: foo(&x, &y)

- OTOH, this does break up basic blocks in the caller context, making LLVM analysis less effective. Since the materializeForSet calls were already either inlined or opaque, though, we're probably fine there.

True. John mentioned that LLVM has grown some support for subfunction extraction in support of C++1z coroutines; we might be able to leverage that so that the IR still looks materializeForSet-ish even though we lower things differently.

- Does this help us with the nested dictionary CoW problem? `foo["bar"]["baz"] += 1`

I think the dictionary problem can be addressed in either implementation model, if we open up the language to express generalized "mutators" instead of just special-cased addressors. With strict enforcement of `inout`, the dictionary implementation would have the worst-case freedom to move an indexed value into a temporary without copying it, and move it back at the end of the access, if we're unable to otherwise represent it entirely as an in-place operation.

-Joe

···

On Nov 1, 2016, at 11:00 AM, Jordan Rose <jordan_rose@apple.com> wrote:


(Slava Pestov) #5

- Does this help us with the nested dictionary CoW problem? `foo["bar"]["baz"] += 1`

My understanding is that this problem arises because we don’t have ‘optional addressors’. A dictionary lookup might return nil. If addressors had a way to return an optional wrapping a pointer, and optional evaluation supported this, we could do in-place updates of this kind. For Apple people: I have a radar that I think describes this problem (rdar://17509619).

Slava

···

On Nov 1, 2016, at 11:00 AM, Jordan Rose via swift-dev <swift-dev@swift.org> wrote:

Jordan

On Oct 31, 2016, at 12:22, Joe Groff via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

We currently abstract over mutable property accesses using what I’ll call a continuation-based model–the materializeForSet accessor is called before an inout access, and returns a continuation callback that must be called when the inout access is completed. I think a nested-function-based model, if designed correctly, has the potential to be more efficient and easier to implement.

As background for anyone intimately familiar with the implementation, Swift’s language model for mutable properties requires a reconciliation step after a mutation occurs. For example, in an inout access like this:

foo(&x.y.z)
the y or z property may be computed, requiring a getter call before the call to foo and a setter call afterward, or either property may have willSet or didSet observers that fire after foo finishes. If the base x is a class, protocol, or generic value, the implementation of y can`t be statically known, so we need an abstract interface for mutating the property, projecting out the value at the start of the access and committing the mutated value back at the end. Ideally, the interface should accommodate efficient access to both stored and computed properties, avoiding unnecessary copying of stored properties to avoid inducing CoW copies or, in the near future, violating the constraints of move-only types or the borrow model. There are two broad approaches to doing this:

Continuation-based access

The accessor that begins the formal access can return a continuation closure to call when the access is completed. This is the approach Swift currently uses with its materializeForSet accessor pattern. A property under this model compiles to something like this C code:

struct X {
  // getter for Y
  Y (*getY)(const X *self);
  // setter for Y
  void (*setY)(X *self, Y newValue);
  // materializeForSet for Y
  struct MaterializeForSetReturn {
    // The projected field
    Y *y;
    // The function that finishes the access, or null
    void (*finish)(X *self, void *buffer);
  };
  struct MaterializeForSetReturn (*materalizeForSettingY)(X *self, void *buffer);
};

void callFooOnXDotYDotZ(X *x) {
  // To compile:
  // foo(&x.y.z)
  //
  // - call the materializeForSet accessor to begin the access to y, giving it a buffer
  // to store a computed value or capture state to pass to the end of the access
  char alignas(Y) bufferY[sizeof(Y) + 3*sizeof(void*)];
  auto materializedY = x->materializeForSettingY(x, bufferY);
  // - call materializeForSet to begin access to z
  char alignas(Z) bufferZ[sizeof(Z) + 3*sizeof(void*)];
  autom materializedZ = materializedY.y->materializeForSettingZ(materializedY.y, bufferZ);
  // - perform the access
  foo(materializedZ.z);
  // - finish the accesses, if we were given a finishing callback
  if (materializedZ.finish)
    materializedZ.finish(materializedY.y, bufferZ);
  if (materializedY.finish)
    materializedY.finish(x, bufferY);
}
A stored property can be exposed through this interface by trivially returning the projected address of the subobject, with no continuation:

struct MaterializeForSetReturn
materializeForSettingStoredY(X *self, void *buffer) {
  return (struct MaterializeForSetReturn){&self->y, NULL};
}
whereas a computed property can call the getter, store the computed value into the buffer, and return a continuation function that calls the setter:

struct MaterializeForSetReturn
materializeForSettingComputedY(X *self, void *buffer) {
  Y oldValue = getY(self);
  memcpy(buffer, &oldValue, sizeof(Y));
  return (struct MaterializeForSetReturn){(Y *)buffer, finishSettingComputedY};
}
void finishSettingComputedY(X *self, void *buffer) {
  Y newValue;
  memcpy(&newValue, buffer, sizeof(Y));
  setY(self, newValue);
}
A benefit of this approach is that it maintains the integrity of the source function in the compiled code. On the other hand, in order for the property implementation to transfer state from the start to the completion of the access, the caller must preallocate some stack space; if it’s too much, the space is wasted, and if it’s not enough, the property implementation must use malloc to get more space for itself. (It’s theoretically possible to avoid this with some clever stack pointer accounting.) There’s also a code size impact on the caller, which needs to bracket the inout access with a call on each side. The underlying CPU has to execute three or five branches per segment of the access path (calling and returning from materializeForSet, testing for null, and potentially calling and returning from the continuation if there is one).

Nested functions

Alternatively, we can implement the formal access pattern as a series of nested functions. A property under this model compiles to something like this C code:

struct X {
  // getter for Y
  Y (*getY)(const X *self);
  // setter for Y
  void (*setY)(X *self, Y newValue);
  // mutator for Y
  void *(*mutateY)(X *self, void *context, void *(*innerMutation)(Y *y, void *context));
};

void callFooOnXDotYDotZ(X *x) {
  // To compile:
  // foo(&x.y)
  // - call the mutate accessor for y, with the inner access as a
  // function whose pointer gets passed in
  x->mutateY(x, NULL, callFooOnXDotY_inner_1);
}
void callFooOnXDotYDotZ_inner_1(Y *y, void *context) {
  // - call the mutate accessor for z
  y->mutateZ(y, context, callFooOnXDotY_inner_2);
}
void callFooOnXDotYDotZ_inner_2(Z *z, void *context) {
  foo(z);
}
A stored property can implement mutate by projecting the physical address of the subobject, tail-calling the subsequent inner function:

void *mutateStoredY(X *self, void *context, void *(*innerMutation)(Y *y, void *context)) {
  /*tail*/ return innerMutation(&self->y, context);
}
and a computed property can bracket the inner mutation in getter and setter calls:

void *mutateComputedY(X *self, void *context, void *(*innerMutation)(Y *y, void *context)) {
  Y value = getY(self);
  void *result = innerMutation(&value, context);
  setY(self, value);
  return result;
}
This makes things easier for the property implementation, which can easily save as much state as it needs to on the stack instead of relying on a preallocated buffer handed down from the caller. There’s one fewer branch necessary per segment of the access path compared to the continuation-based model, either three if the accessor has no reconciliation necessary and can tail-call the inner function, or four if it does a normal call. On the other hand, the caller function has to be broken up into many smaller subfunctions, potentially one for every segment of an access path.

Letting nested functions walk the access path

We can reduce the number of nested functions that are needed for a nested access path by designing a convention that allows each mutator to directly call the mutator for the next segment of access. For example, the caller could construct the entire access path as an array of pointers to mutation functions:

typedef void *(*MutatorFunction)(void *base, MutatorFunction *accessPath, void *context);
void callFooOnXDotYDotZDotW(X *x) {
  // foo(&x.y.z.w)
  MutatorFunction accessPath[] = {
    (MutatorFunction)mutateZ,
    (MutatorFunction)mutateW,
    (MutatorFunction)callFooOnXDotYDotZDotW_inner
  };
  mutateY(x, accessPath, NULL);
}
Each property accessor function then loads the next step of the access out of the path, and advances the pointer forward for the nested access:

void *mutateY(X *self, MutatorFunction *accessPath, void *context) {
  // Project out y
  Y *y = self->y;
  // Call the next step of the access path, passing the remainder of the access path
  return (*accessPath)(y, accessPath + 1, context);
}
And the inner function at the end of the access path ignores the access path parameter and performs the access:

void *callFooOnXDotYDotZDotW_inner(W *w, MutatorFunction *_ignored, void *context) {
  foo(w);
}
In this model, only one inner function needs to be broken out of the caller for the innermost inout access, at least for simple property projections. There are only two branches needed per segment, or one for segments that can tail-call the inner access. Parameterized segments such as subscripts can be supported by storing the parameters in the access path buffer as well, and having the subscript accessor advance past the parameters when it consumes them. For example:

void callFooOnXDotYSubIDotZ(X *x, intptr_t i) {
  // foo(&x.y[i].z)
  uintptr_t accessPath[] = {
    (uintptr_t)mutateYSubscript,
    (uintptr_t)i,
    (uintptr_t)mutateZ,
    (uintptr_t)callFooOnXDotYSubIDotZ_inner
  };
  
  mutateY(x, accessPath, NULL);
}

void *mutateYSubscript(Y *y, uintptr_t *accessPath, void *context) {
  // Get the subscript index out of the access path
  intptr_t i = (intptr_t)*accessPath++;
  // Get the next step of the access out of the access path
  MutatorFunction next = (MutatorFunction)*accessPath++;
  
  return next(&y->elements[i], accessPath, context);
}
On the other hand, this does push the inner access functions and subscript arguments onto the stack, which is a small amount of overhead. Furthermore, the caller function has to push any state it needs to pass from outside the access to inside on the stack as well, if it isn’t able to cram it otherwise into a single context pointer argument.

Machine-level calling convention tricks

To be able to feed a return value from the inner access back to the outer caller, nested mutators would need to preserve all of the return registers potentially used by the inner access on the way out of the access. To minimize the impact on the outer caller, the convention for mutators could also specify that they need to preserve the contents of callee-save registers when they call down to nested accesses. This should allow as much state to be kept in registers between the outer caller and inner access as if they were naturally in the same function, with at most a function call in between.

With these sorts of optimizations, I think the nested function approach has the potential to be competitive with, or even more efficient than, our current continuation-based materializeForSet design.

-Joe
_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


(Joe Groff) #6

OK, if you're not convinced that chained callbacks are a win, then it's kind of a wash. I agree that we'd want to represent the caller and callee as independent coroutines in SIL either way, so the tradeoffs come down to how we lower to readily-available LLVM abstractions. Either implementation approach's tradeoffs could be mitigated with enough implementation effort: We could avoid mallocing in the continuation-style implementation by manipulating the stack pointer across calls to save space for extra state, and on the other hand, for the callback-style lowering, we could increase the 0-to-4-to-8 bandwidth by saving more registers when switching into the callee's context. One possibly small advantage I think the callback style still holds is the ability for (3) to be a tail call when the callee has no code in 6 (the stored/addressed case), which saves a test and branch the caller needs to perform if we represent that as a nullable continuation return in the continuation-style model.

-Joe

···

On Nov 1, 2016, at 3:32 PM, John McCall <rjmccall@apple.com> wrote:

On Oct 31, 2016, at 12:22 PM, Joe Groff via swift-dev <swift-dev@swift.org> wrote:
We currently abstract over mutable property accesses using what I’ll call a continuation-based model–the materializeForSet accessor is called before an inout access, and returns a continuation callback that must be called when the inout access is completed. I think a nested-function-based model, if designed correctly, has the potential to be more efficient and easier to implement.

Well, I'm not totally convinced about either of those two points, but it's definitely something we need to talk about.

Let me try to break this down. I think I'm going to repeat you a lot in different words.

We have a coroutine-esque problem:
0. Outer is executing.
1. Outer calls Inner.
2. Inner executes for a while.
3. Inner passes control back to Outer without abandoning its execution context.
4. Outer executes for awhile.
5. Outer passes control back to Inner, potentially having changed its own execution context.
6. Inner executes for awhile.
7. Inner finally abandons its execution context and passes control back to Outer.
8. Outer is executing.

Note that there might be multiple static points for (3) and (5), each requiring a different resumption point (and creating different context changes) at (5) and (7) respectively.

(I'm going to use the terms "standard spilling", "subfunction spilling", and "transfer spilling" without defining them. I'll come back to this.)

I. Lowerings

Currently, as you say, we do this with a sort of CPS conversion where Inner is split into multiple sub-functions. The initial function is passed a (necessarily fixed-size, opaque to Outer) buffer; at (3), it saves its context into that buffer and returns a sub-function that Outer will invoke (passing the same buffer) at point (5); at (5), that sub-function restores its context from that buffer; and finally the sub-function returns normally at (7). This has two advantages: Outer can reason locally about the data flow from (0) to (4) to (8), and Inner can return a null pointer at (3) to indicate that there's no work to do in (6) and thus save an indirect branch (at the cost of a local conditional branch). However, if Inner has interesting context to save at (3), that requires extra code to manage that data flow, and the amount of data to save can overflow the buffer and therefore require malloc. Total branching cost: the original call at (1), the easily-predicted return at (3), the conditional branch at (5), the possible indirect call at (5), and the easily-predicted return at (7). Total save/restore cost: standard spilling at (1-3) and (5-7), transfer spilling at (3-5), possible malloc.

The simplest alternative proposal is a callback conversion, extracting (4) from Outer as a sub-function which is passed to Inner. Inner now gets an ordinary stack frame to save its context, so its data flow and control flow are easier for the backend to reason about, and malloc is not required. However, the data/control flow in Outer is significantly obscured; most importantly, different static points for (5) may need to resume to different instructions at (7) (which the continuation lowering implicitly handles by duplicating the call in (5)), and Outer's context will be modified at various points from (1-7). Total branching cost: the original call at (1), the indirect call at (3), the easily-predicted return at (5), the easily-predicted return at (7), the possible switch-out at (7). Total save/restore cost: subfunction spilling at (1-7), standard spilling at (3-5), only stack allocations.

Note that the call at (5) needs to duplicated for continuation lowering in exactly those cases where we need a switch-out at (7) in callback conversion.

Your observation is that, if Outer needs to perform a nested access with this sequence at (4), a naive callback conversion will have a lot of branches. Let the outer access be (A), with its steps being (A.0-8), and the inner access be (B), with its steps being (B.0-8), so we have (A.4) := (B.0-8). If the nesting is simple — i.e. there is no code in B.0 or B.8 — and the initiation sequences are compatible — i.e. you can convince (A.3) to directly make the original call for (B.1) — then it is unnecessary to extract a sub-function for (A.4), and instead (A.1) can just set up the chain of calls. Total branching cost: the original call at (A.1), the indirect call at (A.3 / B.1), the indirect call at (B.3), the easily-predicted return at (B.5), the easily-predicted return at (B.7 / A.5), the easily-predicted return at (A.7), the possible switch-out at (A.7). Total save/restore cost: subfunction spilling at (A.1-7), standard spilling at (A.3-5), standard spilling at (B.3-5), only stack allocations. But it does have some additional set-up overhead, and it expects a lot from the indirect call at (3); in particular, to get this to work well in practice, we really have to change all entrypoints for (1) so that they're callable in exactly the same way, i.e. taking a context argument, a "next callback" pointer, and an array of extra argument data for the types, indices, whatever.

II. Spilling

"Standard spilling" means ordinary compiler behavior to save values across a call; it includes putting values in callee-save registers. The compiler has a lot of intelligence about this.

"Subfunction spilling" is like standard spilling except the values must be accessible to a sub-function. The compiler has some very limited intelligence for this; I'm confident that that intelligence can grow over time, but it's not much right now. This precludes using callee-save registers unless, like you suggest, we promise to preserve them when making the callback; and note that "preserve" here really means "expect our caller and our callee to have a private contract about these registers that we shouldn't disturb" — in particular, our caller might set the register, our callee might expect to see that value *and change it*, and our caller might expect to see that changed value. This is completely implementable, but it's not how the backend is really engineered right now.

"Transfer spelling" means saving values across a shift in execution contexts, given only a buffer. The compiler is not well-engineered for this; we'd probably always be lowering to sub-functions early on, and doing so will block a lot of value optimizations in LLVM.

III. Implementation

I think SIL should clearly work with structured, unified Outer and Inner functions here, freely able to move operations between them without having to worry about what goes where. That means that we have to do some sort of lowering post-SIL (or at least in some low-level late lowering, rather than in SILGen). I expect that that alone will greatly improve our current implementation.

In order to do this late lowering, we'll need some structural restrictions on SIL that should be in line with many of the other restrictions we're imposing with Semantic SIL. Given that, I don't think that it's particularly harder to extract a continuation vs. extracting an inner sub-function.

IV. Summary

So to sum up:

Our current continuation-style lowering is good when:
- there's a lot of data flow from 0 to 4 or from 4 to 8,
- there's complex control flow from 4 to 8, or
- the inner function has no code in 6, which is common for stored/addressed accesses.

A callback-style lowering is good when:
- there's a lot of data flow from 2 to 6 or
- a zero-allocation implementation is required.

A chained callback-style lowering is good when:
- callback-style lowerings are good and
- simply nested accesses are common enough to pay for the slight additional overhead.

Honestly, I have a hard time seeing the callback trade-offs as being the right ones in general. Removing the malloc is required for true systems programming, but when that guarantee isn't needed, I think we can provide a large enough buffer to make allocation basically a non-factor.


(Joe Groff) #7

I think our long-term goal here is to let the property definition fully control the `inout` access with a coroutine-like definition, something like:

var foo: T {
  get { return /*value for read-only access*/ }
  set { _foo = /*write-only access*/ }
  mutate {
    var tmp = prepareForMutation()
    yield &tmp // continue nested inout access
    reconcileMutation(tmp)
  }
}

This should let the dictionary implementation do whatever it needs to present a moved Optional value to the chained access. If Dictionary can't store Optionals directly inline in all cases, we ought to still be able to rely on enforced inout uniqueness to get away with moving in and out of a temporary.

-Joe

···

On Nov 1, 2016, at 9:23 PM, Slava Pestov <spestov@apple.com> wrote:

On Nov 1, 2016, at 11:00 AM, Jordan Rose via swift-dev <swift-dev@swift.org> wrote:

- Does this help us with the nested dictionary CoW problem? `foo["bar"]["baz"] += 1`

My understanding is that this problem arises because we don’t have ‘optional addressors’. A dictionary lookup might return nil. If addressors had a way to return an optional wrapping a pointer, and optional evaluation supported this, we could do in-place updates of this kind. For Apple people: I have a radar that I think describes this problem (rdar://17509619).


(John McCall) #8

It's tricky, because if Dictionary is going to support non-movable values, then we also want the non-mutating accessor to be able to return a Borrowed<Optional<V>>, which it can't do if that representation has to be a pointer to an Optional<V>. It can returned an Optional<Borrowed<V>> just fine, of course.

This relates to the earlier conversation we had about borrowed tuples. We may have to make the representation of a Borrowed<T> potentially different from just a T*, and that really stinks.

John.

···

On Nov 2, 2016, at 9:05 AM, Joe Groff via swift-dev <swift-dev@swift.org> wrote:

On Nov 1, 2016, at 9:23 PM, Slava Pestov <spestov@apple.com> wrote:

On Nov 1, 2016, at 11:00 AM, Jordan Rose via swift-dev <swift-dev@swift.org> wrote:

- Does this help us with the nested dictionary CoW problem? `foo["bar"]["baz"] += 1`

My understanding is that this problem arises because we don’t have ‘optional addressors’. A dictionary lookup might return nil. If addressors had a way to return an optional wrapping a pointer, and optional evaluation supported this, we could do in-place updates of this kind. For Apple people: I have a radar that I think describes this problem (rdar://17509619).

I think our long-term goal here is to let the property definition fully control the `inout` access with a coroutine-like definition, something like:

var foo: T {
get { return /*value for read-only access*/ }
set { _foo = /*write-only access*/ }
mutate {
   var tmp = prepareForMutation()
   yield &tmp // continue nested inout access
   reconcileMutation(tmp)
}
}

This should let the dictionary implementation do whatever it needs to present a moved Optional value to the chained access. If Dictionary can't store Optionals directly inline in all cases, we ought to still be able to rely on enforced inout uniqueness to get away with moving in and out of a temporary.


(John McCall) #9

- Does this help us with the nested dictionary CoW problem? `foo["bar"]["baz"] += 1`

My understanding is that this problem arises because we don’t have ‘optional addressors’. A dictionary lookup might return nil. If addressors had a way to return an optional wrapping a pointer, and optional evaluation supported this, we could do in-place updates of this kind. For Apple people: I have a radar that I think describes this problem (rdar://17509619).

I think our long-term goal here is to let the property definition fully control the `inout` access with a coroutine-like definition, something like:

var foo: T {
get { return /*value for read-only access*/ }
set { _foo = /*write-only access*/ }
mutate {
  var tmp = prepareForMutation()
  yield &tmp // continue nested inout access
  reconcileMutation(tmp)
}
}

This should let the dictionary implementation do whatever it needs to present a moved Optional value to the chained access. If Dictionary can't store Optionals directly inline in all cases, we ought to still be able to rely on enforced inout uniqueness to get away with moving in and out of a temporary.

It's tricky, because if Dictionary is going to support non-movable values,

Sorry, I obviously meant move-only (i.e. non-copyable) values here. I keep making this mistake.

John.

···

On Nov 2, 2016, at 11:38 AM, John McCall via swift-dev <swift-dev@swift.org> wrote:

On Nov 2, 2016, at 9:05 AM, Joe Groff via swift-dev <swift-dev@swift.org> wrote:

On Nov 1, 2016, at 9:23 PM, Slava Pestov <spestov@apple.com> wrote:

On Nov 1, 2016, at 11:00 AM, Jordan Rose via swift-dev <swift-dev@swift.org> wrote:

then we also want the non-mutating accessor to be able to return a Borrowed<Optional<V>>, which it can't do if that representation has to be a pointer to an Optional<V>. It can returned an Optional<Borrowed<V>> just fine, of course.

This relates to the earlier conversation we had about borrowed tuples. We may have to make the representation of a Borrowed<T> potentially different from just a T*, and that really stinks.

John.
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev


(Joe Groff) #10

Being able to model an access as Optional<borrowed T> or Optional<inout T> might fit better with the move-only types model for dictionaries, and would be useful for other things like optional protocol members and properties found on AnyObject. Dictionaries have the "delete" operation to shoehorn into the `inout Optional<T>` model and give `foo["bar"] = nil` meaning, but there are other mutation interfaces for which that isn't true. (Even for dictionaries, I've always felt it was a bit of a hack.)

-Joe

···

On Nov 2, 2016, at 11:38 AM, John McCall <rjmccall@apple.com> wrote:

On Nov 2, 2016, at 9:05 AM, Joe Groff via swift-dev <swift-dev@swift.org> wrote:

On Nov 1, 2016, at 9:23 PM, Slava Pestov <spestov@apple.com> wrote:

On Nov 1, 2016, at 11:00 AM, Jordan Rose via swift-dev <swift-dev@swift.org> wrote:

- Does this help us with the nested dictionary CoW problem? `foo["bar"]["baz"] += 1`

My understanding is that this problem arises because we don’t have ‘optional addressors’. A dictionary lookup might return nil. If addressors had a way to return an optional wrapping a pointer, and optional evaluation supported this, we could do in-place updates of this kind. For Apple people: I have a radar that I think describes this problem (rdar://17509619).

I think our long-term goal here is to let the property definition fully control the `inout` access with a coroutine-like definition, something like:

var foo: T {
get { return /*value for read-only access*/ }
set { _foo = /*write-only access*/ }
mutate {
  var tmp = prepareForMutation()
  yield &tmp // continue nested inout access
  reconcileMutation(tmp)
}
}

This should let the dictionary implementation do whatever it needs to present a moved Optional value to the chained access. If Dictionary can't store Optionals directly inline in all cases, we ought to still be able to rely on enforced inout uniqueness to get away with moving in and out of a temporary.

It's tricky, because if Dictionary is going to support non-movable values, then we also want the non-mutating accessor to be able to return a Borrowed<Optional<V>>, which it can't do if that representation has to be a pointer to an Optional<V>. It can returned an Optional<Borrowed<V>> just fine, of course.

This relates to the earlier conversation we had about borrowed tuples. We may have to make the representation of a Borrowed<T> potentially different from just a T*, and that really stinks.