SE-0336: Distributed actor isolation

ktoso · December 17, 2021, 12:53pm

Sorry for the delay in replying here! Back to it with full attention now though

I use those terms to mean very specific things, but you're right the proposal didn't define them, that's my bad and fair to add the definitions (I'm happy to add explicit definitions, proposed clarifications here).

I did define it more in the talk ([Video] Distributed Actors announced at Scale by the Bay), but it should not be required to watch this and the proposal should define everything.

Let's define it here and I can add clarifications to the proposal text if necessary:

a distributed actor is what we interact with in the source language all the time
a "local" distributed actor, means that while in source we treat it as distributed, we actually have a local instance in hand;
- we don't know if we have a local instance in hand (e.g. in a variable) unless we use the whenLocal facilities
- this is what we talk about when we say "known to be local distributed actor" but the short form of that is just "local distributed actor"
a "remote" distributed actor, is the "proxy", it has no storage allocated other than for the id and actorSystem
- we don't know if we have a remote reference in hand (e.g. in a variable) unless we use the whenLocal facilities

I don't view the "distributed local" phrasing as weird; it is just that distributed means either of two instance types at runtime: local, or remote; While at compile time we concern ourselves mostly with the fact that an actor is distributed, so we don't know and assume the remote case. It would be wrong to call such actors remote though - they may or may not be such.

I was referring to the issue that not all distributable actor instances are remote. If they are local and in the same address space, then you avoid all the coding overhead etc, and they are not "distributed" at all.

Sure, but statically, you don't know (except the "breaking through location transparency" mechanisms) and are not supposed to know if you're programming with a local or remote one.

This is a core principle of programing in such systems: I write my algorithm against "some worker" and when I run it in my tests, all the workers are actually "local" but when I actually run it in the target environment, they are all (or just some) remote actors. We program "with distributed actors", and "remote/local" is a runtime property, that for what it's worth can even change depending on configuration of the system etc.

The fact that from the perspective of one process a specific actor is never actually remote, does not make it any less distributed: distribution that we could have shared it with other nodes/processes, and therefore they may attempt to invoke things on it, so the distribution aspect always matters really. It's truly better to always think "distributed" when working with these types.

It's true, the distributed marker means a call can be remotely accessed. But that's how we specifically use those words: distributed is the "i don't know if local or remote", and the remote word is reserved for "i know it is remote". This is not unlike other uses of this term in many other distributed actor implementations out there.

I don't want to dwell on the naming too much right now, and let's move on to the more specific design questions, but last thing to note is that if we were to follow that "-able" logic, we should have named async to asyncable because an async func may not ever suspend or do anything asynchronous at all.

This is the same as async means "may be asynchronous, so treat it as such (but you may get lucky and it was quick to return synchronously after all)", and distributed means "may be remote, so treat it as such (but you may get lucky and maybe it was local after all)".

Let's continue to other questions though, there's a lot here other than naming to look through still:

This again is a very meaningful distinction and comes back to not allowing to break the isolation model.

If we said: actor Worker: DistributedActor were the way to mark these and implement this with normal Swift rules, that it's an Actor & DistributedActor now, we can end up in the following isolation violation:

extension Actor {
  func f() -> SomethingSendable { ... }
}
// and then...
func g<A: Actor>(a: A) async {
  print(await a.f())
}
// and then...
actor MA: DistributedActor { // : Actor implicitly, because `actor semantics`
}
func h(ma: MA) async {
  await g(ma) // allowed because a MA is an Actor, but can't actually work at runtime
}

// there's other examples that end up showing the same pattern though

So... it really isn't an Actor. It is a DistributedActor. They share implementation details (specifically, the "distributed local actor" has the same internal runtime representation as an Actor, but for isolation checking purpose reasons, they are not related).

This is also why protocol Actor: AnyActor {} and protocol DistributedActor: AnyActor {} if we want to keep the same top-level parent protocol type for both.

We can of course argue about keyword vs. "special magic protocol that removes the Actor conformance" etc, but to me that seems rather hacky and unprecedented; whereas acknowledging that they're not the same, by declaring them differently feels more close to reality. (We did look into these relationships for a long time, DA refining the A or the other way around do not make sense because of the conflicting demands on isolation checking).

I'm not sure how to address this other than "it would be one mega proposal" which we found is not stomach-able by reviewers, thus the split of "isolation" (this) and "runtime" proposals.

Totally agreed that it would be a simpler model, and in clustering indeed that's what worked well for us. But as I said: this serves a specific request we received from xpc and security teams involved in reviewing this work. We found a few things that are not so great today and we'd like to utilize this effort to improve them. This is specifically lifting mis-uses of APIs into preventing those mis-uses into the actor types simply not being Codable, and thus not able to be send to other distributed actors "by accident".

Agreed that adding this as alternative considered makes sense, I'll add that to my "additions" PR in the morning and notify here. I think my writeup from the previous post could be used for that section.

Sure, that's how async thunks work, but... we have no knowledge of caller-side at all at compilation time of a distributed actor (!).

The purpose of distributed actors is to cross process boundaries. A "service" will be compiled entirely separately, form what "clients" might end up calling it, so in this case we're crossing module boundaries so anything public has to become implicitly distributed (in a distributed-keyword-less design), which by itself is troublesome IMHO because we won't notice we forgot to make an API actually callable because we forgot making some parameter Codable for example. We perhaps would not even notice in our testing, if we kept using "known to be local" actors in our tests (though today this is not possible, but the introduction of local would cause that).

This was the trivial scenario though. The more interesting and very important one is peer-to-peer systems where we are not crossing module boundaries. I might declare a distributed actor with an internal func futureFeature() and I'm not using it in my app in version 1. The feature though exists and is perhaps even implemented. If the func was not used... you'd propose to optimize away emitting the thunks for it - so we cannot invoke it. Now, we roll out version 2 of our app and we're announcing the "futureFeature" and actually even v1 processes can support it, it was just some "from December 24th, download an update and it unlocks the feature!" The new version of the same app, knows that is has the func futureFeature() and we know it is implemented and ready; distributed isolation under this implicit model allows us to call this func... and yet... the v1 will never invoke the target, because we're missing the thunks to do so because the fun was "never used" before! This is a terrible pitfall, we allowed compiling things which look like they would work, and they should work, and yet they won't.

So... to avoid such nightmarish pitfalls (as if there wasn't enough pain-points with evolving distributed protocols) we're forced into emitting thunks for every single func of a distributed actor. This isn't something we work-around from in the implicit model. The only solution is "developers must mark everything as local that definitely is not distributed", and approaching the annotation problem from this side is just putting the cart before the horse really. No-one would go around auditing their code and adding local to all functions "just to be safe", while there is a lot of incentive to mark methods distributed because if I forget to do so, what I'm working on right now won't work at all, even in my tests, so I am very much led to do the right thing here. A nice bonus is getting the codability checks on declaration sites too, but that's just the cherry-on top of the right semantic model.

I hope this example explains why we'd be forced into emitting far too much metadata and thunks.

The difference is quite large because we're talking about separate processes talking to another, without any prior knowledge of each-other, other than the type declarations. The example above goes through this step by step, but to summarize:

This isn't about the type-checking, this is about the ability of the callee to actually receive and invoke an incoming invocation. In the implicit distributed func model, we are forced to be pessimistic about it because everything might be invoked. It is not a desire-able design neither from expressing developer indent, or metadata footprint perspectives.

This isn't as much about "the declmodifier solves it", as it is about "an opt-in model is by default safer, smaller in footprint, and expresses developer intent much cleaner".

Absolutely agreed on tools; and we'd built those on top of the emitted metadata; this metadata exists regardless of which (implicit / "opt-out" or explicit / "opt-in") model is adopted. I will say though, it is much easier to audit a set of explicitly made distributed methods rather than "well, basically every func of any distributed actor".

I hope the definitions in the beginning of this reply help a bit, but we can dive deeper into this because I sense there still is a misunderstanding here:

There are no SerializationRequirement checks applied to initializer arguments. Only the usual Sendability checks that the Actor Initializers: Sendability proposes in the under-review proposal right now.

To illustrate this with an example:

struct DatabaseConnection: Sendable {...} // NOT Codable
distributed actor Worker { 
  let db: DatabaseConnection
  init(db: DatabaseConnection, system: ActorSystem) { self.db = db }
}

We absolutely want to be able to have distributed workers which accept and store not codable arguments in their initializers. An initializer is always "local" in the sense that the process that executes Worker(...) is where that instance resides. There is no "initialize a worker on another node" performed by initializers.

We, could, though if we wanted to have an actor that accepts such "give me a Worker" requests (from remote peers), and returns a worker, like this:

distributed actor Service { 
  let db: DatabaseConnection
  // ...
  distributed func getWorker(id: ...) -> Worker { 
    self.cachedWorkers[id] ?? Worker(db: db, system: self.actorSystem)
    // simplified, we'd store the new one etc...
  }
}

// === meanwhile on a different process -----
actor Logic { // anything really
  let service: Service
  func calculate(...) async throws {
    // oh no, heavy calculation, let's get a remote worker
    let worker = try await service.getWorker(id: ...)
    try await worker.work(...) // worker is remote in our example
  }
}

So, we're able to "create" a worker on the remote side-on demand, but with collaboration of a Service that creates them however it sees fit. And most importantly, the workers have non-serializable internal state -- we never ship the state to the Logic on the client; the client only got a remote reference to the worker after all.

Thank you very much for diving into all these topics! It is excellent to bounce those ideas and I'm sure we'll get to a great design we're all comfortable with