SE-0336: Distributed actor isolation

ktoso · December 14, 2021, 1:12pm

Thank you very much for reading through and the questions, Chris!

Let's go one by one:

We're trying to slice the problem into a few pieces and it feels like we have enough to discuss here already without mixing the initializers into this proposal as is.

You're right that this proposal does not touch initializers; this is on purpose as SE-0327: On Actors and Initialization has not been set in stone yet, so we're going step by step here, and the method and other isolation pieces of this proposal are separate enough so we can discuss them already.

Having that said, we have thought through and designed initializers as well and this is going to be explained in depth in the second pitch/proposal associated with distributed actors. You can get a sneak peak on it already in this draft PR on swift-evolution: [WIP] Distributed Actors Runtime Proposal #1498. I was just polishing this up and will post a pitch thread so people can give it a look over the break if they wanted to. It won't be up for review until we're done with initializers, this isolation proposal, and it'd be in 2022

Needless to say, I collaborated with @kavon on the initializer's proposal (SE-0327: On Actors and Initialization), so we're in sync about the semantics (and the WIP 2nd proposal discusses these in depth). On that note, if we want to discuss initializers we can do a separate thread for them right now.

In a way there always is some local distributed actor "somewhere", and other remote distributed actor references pointing towards it.

I don't think "distributable" really helps the understanding here. The mental model truly should be "okey, this is distributed, i have no idea if it is local or not", and that's the mindset one has to be in when developing with those types. The moment we know if one is local or remote is the "breaking through location transparency" part, and is a very rare thing.

I'll preface this discussion re distributed the keyword with saying that every time someone new reviews this work this comes up, and after a few weeks (or months) of entertaining the idea we come back to square one that the keyword is both beneficial and necessary.

As this is the official review, I'll definitely re-explain the necessity and tradeoffs of the additional keyword again, although this is what the Alternatives Considered: Implicitly distributed methods / "opt-out of distribution" really did a deep dive into, so I hoped we won't have to re-re-visit this again...

The protocol approach is probably useful to avoid having to implement the DistributedActor abstraction mechanism to constrain things in the generic system. See below on the distributed declmodifier.

actor Player : DistributableActor {

It is going to have its own proposal indeed (the before mentioned [WIP] Distributed Actors Runtime Proposal #1498 that I just finished preparing and will pitch maybe even tomorrow).

There is no need to deep-dive into how the actor system works because all of that is "runtime" concerns, while this proposal is only about the isolation and type-checking aspects of this work. This proposal does mention the type because of its implications to initializers (the necessity to accept it as argument in initializers) and because it is where the SerializationRequirement comes from.

For this review, let's keep it at that; All of the detailed semantics of why and how the system is involved in making remote calls, assigning IDs and more is going to be discussed in the runtime proposal. So the next proposal is now available for you to read, but I don't think it has any impact on this proposal's review other than "there is a DistributedActorSystem and it as a SerializationRequirement associated type".

Hah yeah, so this is indeed where we started off from quite some time ago: DistributedActor was just Codable, and had an implementation provided via extension and that was it. An ID was always required to be Sendable, Hashable, and Codable so that'd be that.

As we discussed this feature with potential adopters one use-case that came up, and would improve upon existing solutions people work with today, was the ability to help developers in a type-safe way with understanding which "services" can be freely shared between processes and which not.

For example, in XPC there are "endpoints" which may be freely shared, and "connections" which might not. But due to the existence of xpc_connection_create_from_endpoint this distinction isn't as clear and it can become messy.

ANONYMOUS CONNECTIONS
The recipient of that message will then be able to create a connection from that endpoint using xpc_connection_create_from_endpoint(). // from man xpc_connection_create

This is very much like an Actor's resolve, however actors operate on a higher level, and we want to hide this from users, at the same time helping them to only "send around" actors which were actually intended for this pattern. In other words, in we would want to actors either an EndpointID or an ID that isn't Codable, and then naturally developers would be restricted from "accidentally" passing around one that was never intended to be passed around.

So this implicit Codable conformance is to support this pattern we were asked for, hope this explains a bit of the background here.

It is a little unfortunate that we cannot just™ write it as

// (not possible)
extension DistributedActor: Codable where ID: Codable { ... }

but that'd be a huge type-system feature I'm told... so we ended up with synthesis for this case, as Codable being specialized isn't unheard of.

Note also that under present Swift no end user is actually able to implement init(from decoder) for a distributed actor at all (!), because of the immutable self in such initializers. I believe I called that out in the proposal and it's something we'd like to lift in the future so then people could actually implement this themselfes if they wanted to...

I really hoped the discussion of this in Alternatives Considered: Implicitly distributed methods / "opt-out of distribution" explained this well enough. I (really really) know that at first it seems similar enough and why-dont-we-just-infer-it but it truly breaks down terribly the deeper one looks at not marking distributed funcs.

Before we dive in here... I was re-visiting this "can we do it without distributed func...?" multiple times over the last years; the last rime time just before writing this proposal, having spent almost two weeks thinking of all kinds of ways it maybe could work and I truly don't think it works out. It always seems fun at first, but ends up breaking down soon enough.

Let's discuss it more though, since it is the official review thread after all:

The problem with the 3rd option from the alternatives considered section–in which the checks are made "as similar as possible as Sendable checking" in that they're emitted lazily and depend on call-sites–is that it means that all functions must emit metadata to be invoked, so every time we compile we need to emit all thunks associated with remote calls, for all functions (and get-only computed properties) of any distributed actor.

Some of these we'll not be possible to derive the thunk implementations for, because e.g. the parameters don't conform to the SerializationRequirement... so we'd fail the synthesis for some of them. So far okey... this is just* emitting a lot of thunks and metadata for functions which perhaps were never intended to be distributed...

* "just too much metadata" – this is actually a deal breaker already for some use-cases we're interested in already IMHO, but let's keep digging.

There are two primary issues to focus on:

we need to pessimistically emit all metadata and thunks for all functions (public, internal, and private (!)), of a distributed actor because it MIGHT be called remotely and we have no idea if they will or not;
- There is no way to optimize out "not used" distributed thunks, because the entire purpose of those is to be cross-process, so we lost our ability to optimized anything "not used"
worse call error user experience:
- callers, by conforming some other type to Codable suddenly would think that "hey, that remote call should work, but didn't resolve the function, why is that?" -- well, the remote peer perhaps did not have the param X as Codable, so we never emitted the metadata for it, so we'd never even attempt decoding, leading to a bad user experience in such rollout scenarios

And last but not least, the prime concern I have with this "inversion of annotation necessity": distribution just isn't the same as sendability checks: Sendability checks are performed within the same program. But distribution, i.e. emitting the "distributed method accessor thunks" means that any such function is remotely callable, and COULD be subject to exploitation. We truly do not want to make a system where accidentally making publicly and remotely invocable things is the norm, this would be a terrible design from a security, and API boundaries standpoint. Access control does not help here at all, if an "internal" func shall be distributed, it really is "as if public" because we can just pretend being a remote call from the same module.

Consider, under the "implicit distribution" semantics, an actor that has a computed property that computes some secure key for a transaction, we'd write:

distributed actor A { }
// ... 
extension A {
  // not intended to be called remotely:
  func getAuthKey() -> AuthKey // AuthKey is Codable { ... }
  // under "implicit distributed" rules, we have to emit distributed thunks
  // for this func, and it becomes effectively public and remotely callable.
  // 
  // It is very hard to notice we just added this as such distributed func
  // since we're even in an extension, and maybe even in another file...!
}

It would be a terrifying world in which I just exposed this API that I thought was internal and local to the entire world to try to poke at and exploit. Of course, there are many other layers of protecting a call, like mutual TLS, or other Kernel level capability mechanisms to prevent access to the connection/endpoint, but still -- we truly should not make a design where the norm is making mistakes and accidentally opening up holes in our applications. We are working with teams focused on closing down such holes, and they have been rather welcoming to the distributed actor efforts so far, and I'd hate to come back to them saying we're adding more areas for accidentally slipping up and making things remotely callable that should not have been.

Sidenote, discussion why even private methods may need to be able to be called remotely. Short version, because of the potential of a nonisolated private func ...() async throws being able to call them. So we'd even have to emit the thunks even for private functions -- and I'm sure people would not expect that and it'd become an attack vector.

There is the issue of auditability too, but that is the least troublesome of them all, though to me personally a compelling one as well.

Oh, absolutely. We do this already; local calls incur no transport/serialization overhead.

Again, as this is a runtime concern it is not discussed in this proposal but instead will be covered by the runtime proposal/pitch that I linked above. If you want to read ahead this is covered in the runtime proposal's (NOT THIS PROPOSAL) Invoking Distributed Methods on Remote Instances section. When it is unknown at compile time if the instance is remote or not, we invoke a thunk which checks, and if the instance is local invokes the local func directly. There are no additional suspension points or any other serialization overheads in this case. If it was remote after all then we invoke the remote call infrastructure.

Initializers are always local. There is no such thing as "initialize this actor over there". As such, there is no "transfer this state", there is always just messaging (distributed methods).

Sidenote: we had attempted this at some point in Akka and it was called "remote deployment" and it was a terrible mess and bad idea. Though there mostly because of the semantics associated with waiting for initialization, as well as versioning associated with this.

This is the same how other runtimes deal with this. It is important for a distributed actor to be able to hold state that cannot be serialized and sent around. They are like exposed endpoints that manage such state contained to a node after all (e.g. connections, file handles etc).

If you wanted to initialize a worker on a remote node, you do it through another actor that serves as a factory:

distributed actor GreeterMaker { 
  func makeMeAGreeter(something: Something) -> Greeter {
    Greeter(something: Something) // init is "local"
  }
}

let remote = try GreeterMaker.resolve(...)
let greeterOnRemote: Greeter = try await remote.makeMeAGreeter()

try await greeterOnRemote.hello()

"Remote" initializers would open all kinds of cans of worms that we don't want to real with as well, most notably: what would be the lifetime of such "remotely created" actor? Since Swift relies on reference counting, and the init is returning "the only" (at first) reference to an object... what would that even mean for a remotely "initialized" actor? We'd either have to build in a way to manage lifecycles associated with them into the initialization somehow, or make other promises about the lifetime -- neither of which are things I want the language to get into.

With plain old distributed (factory) functions it is simple: the init is always local, and it is up to the function to either store or otherwise manage the lifetime of the distributed actor it returned and is about to return to a remote peer.

I remain convinced that the "remote deployment" path is not something we should pursue and, most importantly, it is not necessary for any specific use-case since everything it achieves we can do without it, and cleaner. Introducing it would cause a lot of complexity to already complex initializers, and I'd be very worried about supporting them in various actor systems based on past experience, and Swift's unique problems with how actor lifecycles work (tied to refcounting, and no, we should not implement distributed ref-counting ).

Let's take the last two together:

Yeah this could be fair to tackle separately perhaps. I thought it was important to outline this capability as it is fairly important for some use-cases, but as we now have real runtime support for invoking distributed methods, we don't need it for just implementing the cluster and other libraries I think

Sidenote: We actually had to provide this during our initial port of the distributed actors cluster library because without it the source generation based implementation would have been blocked somewhat. But now that we're working on the distributed method invocation support in the language, I think we could survive without this for the time being.

I'd love to write the "right" whenLocal, but to do that we'd need the local in the type-system. We discussed it a little bit with @Douglas_Gregor but should probably revisit how hard and how far out doing the "right" thing here would be. It is true that it is quite similar to what happened with nonisolated and isolated parameters - that we generalized them some more.

Uff! I hope I covered all questions, even though a few of them really are asking about runtime concerns which are outside of this proposal and defined in the next.

Maybe if we want to keep digging into the runtime details, we can use the thread I just made for that side of things (and associated pitch): [Pitch] Distributed Actor Runtime? We'll see how the discussions flow I guess.

Thanks again for all the feedback and let's keep it coming; I'm sure we'll arrive at a satisfactorily design in the end