Value ownership when reading from a storage declaration

John_McCall · August 8, 2018, 12:31am

As part of preparing the Swift ABI for getting locked down in Swift 5, I am changing some aspects of the implementation strategy for storage accesses in Swift 5. A lot of this is purely implementation-level and has no effect on the language, or indeed on programmers at all except for small changes in performance (expected to be minor and generally for the better). But there are two ways that it will surface as new features that can be used to achieve better performance in particular situations. In keeping with Swift's general principle of "progressive disclosure", my expectation is that most programmers will never need to use or even know about these features; nonetheless, they (eventually) need to be pitched and undergo the normal evolution process.

Storage Abstraction

What?

All of this relates to the problem of storage abstraction, i.e. hiding the details of how a storage declaration (a var, let, or subscript) is implemented.

By the implementation of a storage declaration, I mean information like:

whether the storage is backed by memory,
the set of accessor functions defined by the storage, and
the function bodies of those accessors.

Why?

There are currently three reasons why Swift might need to abstract over how storage is implemented:

The storage might be a protocol requirement, and so Swift has no static knowledge about how it's implemented by the conforming type.
The storage might be an overridable class member, and so Swift has to assume that the base object is an instance of a subclass which has overridden the storage to use a substantially different implementation.
The storage might be defined in a binary framework which the current code maintains a stable binary interface to, and so Swift has to assume that the storage's implementation might be different in the version of the library actually present at execution time. (That is, the storage might be "resilient".)

In the future, we may add more reasons to abstract over storage implementations, e.g. by allowing arbitrary storage to be dynamic.

How?

The traditional way of abstracting over storage, familiar from many different languages, is to define a getter and a setter (if mutable). These functions can be synthesized automatically for any reasonable storage implementation, and Swift does this when the accessors are required; for example, if the storage is a simple stored property, the getter returns a copy of the current value of the property and the setter writes its argument value into the property.

However, a getter and a setter aren't very efficient if the storage declaration is actually backed by memory, which is very common. The problem is not that a function call is required: that's a relatively small amount of overhead, and besides, it's essentially unavoidable if we're going to allow the underlying implementation to be an arbitrary computed property. The problem is that forcing the storage to be accessed through this interface may create a large amount of extra work just to satisfy the interface, and the impact is particularly bad if the storage is backed by memory. For example:

Calling a getter will always copy the value, but the caller may be able to complete its work without needing a separate copy.
If the storage is of an aggregate value, calling a getter will force the entire value to be copied, but the caller may only wish to copy a small portion of it.
If the storage is of an aggregate value, calling a setter will always replace the entire value, but the caller may only wish to replace a small portion of it.
If the caller wishes to read and then modify the current value of the storage (e.g. passing it as an inout argument), it must do the modification on a copy of the current value; there is no way to modify it "in place". This is particularly bad if the value is a copy-on-write structure.

In an effort to address some of these issues --- particularly the last one --- previous versions of Swift have synthesized a third accessor for mutable storage declarations. This accessor is called materializeForSet, and it is essentially a hacked-in coroutine that yields a mutable pointer. When the storage is just backed by memory, its materializeForSet returns a pointer to that memory. When the storage is instead computed, materializeForSet calls the getter, writes the value into a temporary variable, and yields the address of that variable; it then calls the setter when resumed.

materializeForSet addresses a lot of the biggest problems with pure getter-setter abstraction, but it's still got some major flaws. A relatively small flaw (at least, for evolution purposes) is that it's pretty hacked-in: it's awkward to generate code for it, and it uses an odd, unsystematic ABI that introduces a fair amount of code bloat and doesn't really fit with some of the structural things we try to do in SIL. The bigger flaw is that it's just for
modifications and doesn't really help with the performance issues I mentioned about getter.

For these reasons, I am changing the set of basic accessors used to access abstracted storage. The first change is to replace materializeForSet with a modify coroutine that yields a mutable reference to storage, which is really just an implementation-level improvement. The second change is to conditionally replace the getter with a read coroutine that yields a borrowed value (i.e. a value taken from storage without copying it). These changes give rise to the two new features I mentioned at the top:

Generalized Accessors

The first feature is called Generalized Accessors, and I'm not ready to fully pitch it yet because we're still figuring some things out. Suffice it to say for now that the idea is to allow Swift programmers to directly define the read and modify accessors (materializeForSet has never been implementable in Swift code). This is discussed in some detail in the ownership manifesto.

Ownership of Read Values

The second feature is to allow a storage declaration to explicitly control the ownership of a value that's been read out of the storage. For example, does an abstracted access to base.property always produce an owned value or can it produce a borrowed value? On the implementation level, this means: are reads from the abstracted storage implemented by calling a getter or by calling a read coroutine?

There are two reasons why this is important:

The first reason is that it's semantically critical for move-only types. A var of move-only type that's actually backed by memory cannot be accessed by a getter because the getter, in order to return an owned value, would need to move the value out of the backing memory, leaving it uninitialized. On the other hand, a var of move-only type that's actually implemented with a getter (necessarily creating a new value on every access, which would be strange but not unimaginable) should not generally be accessed with a read coroutine because ownership of the returned value will be irrevocably lost, which is likely contrary to the intent of such an API.
The second reason is that it matters for performance even when the storage type is copyable. A read coroutine can help avoid a copy, but otherwise it's more expensive to use than a getter because of the need to support the separate phases of a coroutine; this may be worthwhile to avoid copying an Array, but it's overkill to avoid the supposed expense of copying an Int. Also, if the storage is actually implemented with a getter, a read coroutine can't forward ownership out; if the caller really does need its own copy, it'll be forced to copy the yielded borrowed value. And it's quite common for callers to need an independent copy of the value; if they do, and the caller has to perform that copy itself, that's generally worse for code size because most declarations have more call sites than implementations. On the other hand, the caller does generally have more information than the callee (especially with generic code) and can perform the copy more cheaply.

The possible solutions I see here are:

We can include both a getter and a read coroutine in the set of accessors synthesized for abstracted storage, and then pick one or the other based on we use the value. To me, this seems untenable because of the code-size impact.
We can have one accessor and make the decision dynamic by passing an owned-vs.-borrowed flag. I don't think this would really help code size much, if at all, and it'd cause significant problems in SIL.
We can choose one accessor or the other statically based on the declaration and allow the decision to be overriden with an attribute. For this to be used for resilience, the decision needs to be independent of what accessors are actually defined by the source code.

As reflected in the name of this section, I'm leaning towards the third option but I'm not sure what the right default rule is:

In the abstract, I think producing a borrowed value is the best default rule, and that we should have some type-based heuristic for deciding that a type is trivial enough that it should always be returned with a getter.
But in practice I'm worried about the impact of using read coroutines, especially on code size, and especially on properties that are implemented with getters.

The syntax I'm currently leaning towards for declaring the ownership of the returned value is to put it after the colon or (in a subscript) arrow, e.g.:

  var title: __owned String { get set }
  subscript (index: Int) -> __shared Element { get set }

(Note that these are the stand-in, underscored keywords currently used by the parameter-ownership annotations; that proposal also needs to move forward eventually.)

Another idea would be to make this explicit in the accessors list of a protocol requirement:

  var title: String { get set }
  subscript (index: Int) -> Element { read set }

But that idea only works for protocol requirements, and it's pretty subtle.

Much like parameter ownership, the language design aspects of this don't actually need to be resolved by Swift 5; we can always change the spelling later. I just want to move towards the right semantic model.

jrose · August 8, 2018, 1:33am

It's a hard problem, and I don't have an answer, but I want to note that I appreciate how clearly you've spelled things out here.

I agree that "borrow" (read) is the safest default, and that maybe it's worth having some notion of triviality to prefer get for some types (trivially copyable && small, maybe?). But I don't know how to apply this in practice.

Given your choices, I'm honestly leaning towards explicit accessors as the mechanism for declaring this, including for concrete subscripts and computed properties. It seems unlikely that a base class would make a bad choice between get and read that all the subclasses can't deal with. Not impossible (abstract base classes, subclasses that are using a move-only type), but possibly not common enough to be worth designing a feature around. The only tricky case is stored properties, but those almost always prefer read except when the type is simple anyway, so maybe following the default is fine as long as it's deterministic.

I happen not to like the ownership annotations in those positions because they're not really part of the type. That is, if you have a KeyPath<Foo, String>, it's not different between a property that uses read and one that uses get. So the part after the colon in the declaration shouldn't be different either. This seems closer to weak or private(set) that affects the implementation and access rules but not the type itself once you've done the load. But that's probably not the top criterion in this decision.

michelf · August 8, 2018, 1:53am

Let's say I have a protocol that declares a property as "owned" and a class that declares the same property as "shared". Then in an extension I make the class conform to the protocol.

protocol P {
    var s: __owned String { get set }
}
class C {
    var s: __shared String
}
extension C: P {}

Should that cause a compile-time error or will the missing read or get accessor be auto-generated?
In the later case, wouldn't that "seems untenable because of the code-size impact"?
Could this happen in some situations even without an explicit attribute? (maybe by mixing associated types or generics)

John_McCall · August 8, 2018, 2:01am

We definitely wouldn't make this a compile-time error; we don't want ownership annotations to cause semantic problems outside of move-only types. And the code-size impact wouldn't be excessive because we have to generate different functions for the class and protocol anyway.

A type-based heuristic could definitely lead to this problem even without explicit attributes; suppose that one protocol declares the property as Int (a small trivial type that we should always use a getter for) and another declares it as an associated type.

hlovatt · August 8, 2018, 7:40am

I really dislike more annotations on types, it is spoiling an elegant language. You say that most people won't use them, but I don't buy this. Once the standard library uses the annotations they will pop up everywhere. They will confuse people and people will think they are necessary, because they see them in the standard library.

I therefore suggest a compiler heuristic, something like @jrose 's "trivially copyable && small", and forget about chasing the extra performance.

michelf · August 8, 2018, 11:56am

A question is how predictable will be the heuristic? Case in point: resilient structs. You don't know their size until runtime, so they're more costly to copy for code not in the same module (or resilience domain) as the type itself. Perhaps the heuristic should take that into account by favoring returning a copy in these cases (unless the struct is known to be costly to copy), whereas getters implemented on the outside (not seeing through the type) would default to a borrowed read.

But if the compiler can choose based on characteristics of the type that aren't known outside the module, it'll need to annotate the generated interface file for the module accordingly. So in the end an annotation will be needed regardless. I guess it could be allowed only inside an interface file.

I think the semantics are clearer when different accessor names are used instead of an annotation:

var first: Element { read set }
subscript (index: Int) -> Element { read set }

The get and set accessors are currently declared in protocols, computed properties, and will be part of interface files. We could allow read in these three places too. If you really need your stored property to be { read set }, then you can wrap it in a { read set } computed property. Otherwise the compiler decides for you.

jawbroken · August 9, 2018, 2:51am

This doesn't seem true to me, given that there are already a ton of annotations used in the standard library that I almost never see in other Swift code (e.g. @_frozen, @_fixed_layout, @inline(__always), @inlinable, @usableFromInline, @_specialize(…), @_semantics(…)). A few performance-focused libraries use some of these, but they haven't become viral or cargo-culted in the way that you suggest.

John_McCall · August 9, 2018, 3:48am

Internal interfaces that know that a publicly-resilient type is actual trivial can definitely be optimized to use the best convention available. The nice thing about internal interfaces is that you can always heuristically improve them later.

For public interfaces, it's different. In your example, we're compiling some library A that's using a resilient struct S and knows its implementation details. This necessarily entails that library A is exposing a resilient binary interface to some of its clients, because is-within-the-resilience-domain-of is a transitive relation. So A's public interface should always use conventions that are conservatively reasonable for any implementation of S; in other words, it should pretend it doesn't know about S's implementation when setting up its public interface. In this case, that means always using a read accessor because there might someday be an implementation of S which would really benefit from it.

I can understand why you think that {read set} is clearer, but there are three problems I see with it, even ignoring the use cases outside of protocol requirements:

The first is that I think it's the wrong "default". Remember that all existing protocols are going to say {get} or {get set} instead of using {read} or {read set}, which would now be presumptively requesting the use of a getter. So if the property is actually stored, we're necessarily accessing it less efficiently if it's not trivial enough for a getter to be better.
The second is that, while I'm not sure I accept the argument that people cargo-cult arbitrary attributes out of the standard library implementation, I would worry about people doing that to something like {read set}, especially if it appeared in interface descriptions. It raises the prominence of read by quite a lot, which worries me — it seems like it undermines progressive disclosure.
The third is that I think it says too much, as if the storage was enumerating its exact set of required accessors. {get set} currently means that the storage is both readable and modifiable; it does not mean that the accessors are exactly a getter and a setter. There's already an implicit third accessor (materializeForSet, today, but soon to be modify) and that's not going to change.

John_McCall · August 9, 2018, 4:09am

Your point about class properties is a really good one: optimistically assuming that the property isn't overridden is almost certainly the best approach. That's especially true because they're probably not really touching the getter.

But... for protocols I'm not sure that being explicit about accessors gives good results, as mentioned right above, and for resilience I think it just doesn't work.

Torust · August 9, 2018, 5:19am

One part that's unclear to me is why a read coroutine has a greater impact on code size than a getter, and by how much; is it possible to test e.g. performance and the resulting code size increase in the standard library by making everything non-trivial use read coroutines?

Given that this is specifically for resilient libraries, and resilient libraries are in general loaded dynamically between many applications (particularly the standard library), it feels like erring on the side of performance is more appropriate (although I admit to being ignorant of the full tradeoffs here).

While it doesn't address the second point, it's possible to work around this by saying that get simply means whatever the compiler infers to be the most efficient way to get the value (i.e. not strictly a getter) and having the keywords be e.g. copy and read. I think it's probably best if these more-specific variants use keywords not currently used by the language.

As for property declarations on types: computed properties can already expose get and set methods which these keywords would replace. I'd be in favour of extending that to also apply to stored properties so that you could do:

class SomeClass {

    // One possibility, as per michelf's suggestion.
    var storedProperty : [Int] = [1, 2, 3, 4] { copy modify }

    // Alternatively, as per jrose's suggestion.
    get(copy) set(modify) var storedProperty : [Int] = [1, 2, 3, 4]

}

I personally don't know which is better: I like the consistency of the first approach and the readability and non-intrusiveness of the second.

John_McCall · August 9, 2018, 6:58am

I definitely intend to get that sort of code-size measurement. The basic trade-off is that the coroutine just needs a little more set-up, basically on the order of taking two or three extra arguments; and in exchange it can avoid copying something, which for almost every type (except the tiny trivial ones like Int and Float) is a code-size win for the read coroutine but not necessarily for its caller, if the caller needs a copy; and for the most part, moving code from the callee to the caller is worse for code size because most functions have more callers than implementations, unless they're not used at all.

The idea of changing the keywords completely is interesting, as is that last alternative proposal. I think adding a trailing {copy modify} clause to stored properties after the initializer expression would be a major parsing ambiguity — although, come to think of it, we might have had to deal with this for some other feature (property behaviors?).

jrose · August 9, 2018, 5:03pm

We already have it for observing accessors, so doing something similar with new accessor keywords might not be terrible.

hlovatt · August 10, 2018, 12:21am

This need for additional annotation is coming about because of the static linking and static optimisation not interacting well with an OS upgrade. An alternative, that would eliminate the annotations, is to re-link and re-optimise after an OS upgrade.

Judging by the number of threads wanting more and more annotations this option of re-linking and re-optimising after an OS upgrade would appear to have broad applicability, is it time for a thread to discuss this?

John_McCall · August 10, 2018, 7:55am

That is quite the proposal. What you're really suggesting is that Swift abandon binary machine-code distribution and instead depend on recompiling all of the Swift code in a program whenever any of the dependencies changes. The compiler wouldn't have to start from source code — it could start from some stable abstract representation of that code, like a Java class file, or hypothetically like SIL — but there would be no stable machine-code interfaces and the model would rely on late compilation to make interoperation work. So let's unpack that.

It's certainly not ridiculous to consider changing Swift's code-distribution model. The main implementation is focused on a specific binary-compatibility model, but there's nothing inherent about that in the language. Different code-distribution models have advantages and disadvantages that make them more or less well-suited for different kinds of programs and environments. Trade-offs that are good for quick-running processes can be bad for long-running processes. Trade-offs that are good for monolithic processes can be bad for running many small processes in parallel. I personally think it's inevitable that we'll explore using different distribution models for different kinds of programs.

However, I want to be very clear about this: a late-compilation model that requires rebuilding code on every OS upgrade is not acceptable to Apple for app distribution. Even if it were acceptable in the abstract, we are committed to using a binary-interopation model in Swift 5 and therefore in the stable ABI for apps on Apple platforms. Thus, Swift will always need to support this code-distribution model in the language, and there is no point in having a discussion about changing the model in order to define away the problems it makes.

We are trying very hard to ensure that these annotations are deep in the progressive-disclosure sequence. The goal is that programmers should not need to care about them unless they have very specific performance requirements, and even then, usually only if they're also making a stable binary interface. If you feel that a design is failing to meet those goals, that is very important feedback. We know there are places where that's true — for example, @inlinable can be important for cross-library performance even for source libraries — and we consider them to be serious defects that need to be fixed, precisely because they violate these goals of progressive disclosure.

It is an unfortunate consequence of Swift 5's focus on reaching a stable ABI that we have to think about a lot of these problems now. This is an abnormal release in that way; I don't think there's going to be an ever-increasing flood of these annotations.

gregtitus · August 11, 2018, 5:39pm

I think you are right that providing both copy and read everywhere is overkill, and making { get } do whatever seems best statically is the way to go.

But in order to help tune callee vs caller code size could you add a little bit of your possible solution #1 into the mix and allow declarations to ask for both { read copy }? (Presumably for larger values that nevertheless the developer expects a lot of callers to copy.) This could generate a getter thunk that calls the read and makes the copy and makes that thunk part of the binary interface for callers, thus putting the code size increase back in the callee.

John_McCall · August 11, 2018, 6:14pm

I'd tossed around the idea of doing that implicitly in the ABI, but I hadn't considered allowing users to opt in to it. That's an interesting idea.

I may just need to punt on the spelling of this in the short term.

michelf · August 11, 2018, 7:46pm

Taken to its logical conclusion, this could be allowed: { read copy modify replace }.

hlovatt · August 12, 2018, 3:17am

Hopefully you are right and I am wrong.

As an aside: I was thinking of LLVM, rather than SIL, as the intermediate language, so that other languages that compile to LLVM could play along .

regexident · November 13, 2018, 5:40pm

Another argument against …

… apart from computer parsing issues is the problem of human parsing:

It's not uncommon to have code that looks like this:

var storedProperty : [Int] = [
    0, 1, 7, 2,
    5, 8, 16, 3,
    19, 6, 14, 9,
    9, 17, 17, 4,
    12, 20, 20, 7,
    7, 15, 15, 10,
    23, 10, 111, 18,
    18, 18, 106, 5,
    26, 13, 13, 21,
    21, 21, 34, 8,
    109, 8, 29, 16,
    16, 16, 104, 11,
    24, 24,
] { copy modify }

dhoepfl · November 18, 2018, 5:34pm

Regarding the syntax, I'd like to mention SE-0030 (Property Behaviors) here. I think if Swift had property behaviors (or an general annotation concept), this would be a good way to express this kind of information.

Basics

Based on SE-0030,

var property: String

would be a shortcut for:

var [get=(copy), set=(replace)] property: String

With some more shortcuts to support current behavior, all of the following would describe the same:

var property: String
var property: String { get set }

var [get, set] property: String
var [get=(copy), set] property: String
var [get, set=(replace)] property: String
var [get=(copy), set=(replace)] property: String

Now one could write:

var [get=(read, copy), set=(modify)] property: String

Read-only

If get is present but no set means that the variable cannot be set (but might be computed or otherwise volatile).

let property: String

would be an alias for

var [const, get=(copy)] property: String

You could also have

let [get=(read, copy)] property: String         // same as ...
var [const, get=(read, copy) property: String

(Or the other way round, I do not know if the current let supports read)

Custom implementations

Custom implementations of get/set/didSet/willSet would be implemented as they are now, after the type, "subclassing" the property. If you think about the custom implementations as "subclassing the property", copy and replace internally just autocreate methods called get and set (called like that for compatibility) which get overridden by custom get/set. The new access types read and modify work alike.

Once there are method-keypaths, you could use these, too (the example uses it for didSet, which is currently not possible if you have a custom set but that could change one day, too. That’s just to show that syntax is future-proof):

var [didSet=\Self.refresh] property : String {
    var propertyBackingStore: String = ""

    // Having get/set implemented changes the declaration to
    // "var[get=(copy), set=(replace), …]"
    get { return propertyBackingStore; }
    set { /* implicit:
             newValue: String,
             willSet: (String) -> Void,
             didSet: (String) -> Void in
             // didSet is a wrapper that adds KeyPath!
          */
          if propertyBackingStore != newValue {
             willSet(newValue)
             defer { didSet(propertyBackingStore) }

             propertyBackingStore = newValue
          }
    }
    // Here you could implement `modify`, too, this would make it
    // "var[get=(copy), set=(replace, modify), …]"
}

// ...

// Note that the keyPath is not given by the caller above.
func refresh(_ property: KeyPath<Self, String>, oldValue: String) {
    // ...
}

Naming

Maybe a better naming would be getters=(…) and setters=(…) instead of get=(…) and set=(…) but on the other hand, there is func instead of function, too.