[Pitch] Distributed Actors

Saklad5 · October 6, 2021, 7:44pm

Ideally gRPC could be implemented as a distributed actor transport.

johnburkey · October 6, 2021, 7:49pm

agree, that would be great! Although as noted earlier, in addition to the semantics of streaming, and keeping connections open, we would also need the storage on the wire be the protobufs stuff most people use in that case, instead of Codable. Hence if we ripped out our internal GRPC stuff, we would still prefer streaming functionality, even if we didn't get full GRPC compatibility. Personally I think it would be "cool" if protobuf support was moved into Swift, and you at least had an option via some kind of protocol markup to use protobuf storage..

Douglas_Gregor · October 14, 2021, 12:36am

I've thought about this a bit more, and I no longer think this is a good idea. The distributed tag indicates what the surface area of the distributed actor that cannot be local, and I think it's key to Resolving DistributedActor bound protocols.

I've also been thinking about source generation or, rather, how to remove it.

`ActorTransport-based` remote calls

We can extend the ActorTransport protocol with the ability to perform a remote call, such that each distributed function can be compiled into a remote call through the transport, eliminating the need for code generation on the sender side. For the receiver side, each distributed function will be recorded in compiled metadata that can be queried to match a remote call to its type information and the actual function to call.

ActorTransport can be extended with a remote call with a signature like the following:

protocol ActorTransport {
  // ... existing requirements ...

  /// Perform a remote call to the function.
  ///
  /// - parameters:
  ///   - function: the identifier of the function, assigned by the implementation and used as a lookup key for the
  ///     runtime lookup of distributed functions.
  ///   - actor: identity for the actor that we're calling.
  ///   - arguments: the set of arguments provided to the call.
  ///   - resultType: the type of the result
  ///
  /// - returns: the result of the remote call
  func remoteCall<Result: Codable>(to function: String, on actor: DistributedActor, arguments: [Codable], returning resultType: Result.Type) async throws -> Result
}

When a distributed function is compiled, the compiler creates a distributed "thunk" that will either call the actual function (if the actor is local) or form a remote call invocation (if the actor is remote). For example, given:

distributed actor Greeter {
  distributed func greet(name: String) -> String { ... }
}

the compiler will produce a distributed thunk that looks like this:

extension Greeter {
  nonisolated func greet_$thunk(name: String) async throws -> String {
    if isRemoteActor(self) {
      // Remote actor, perform a remote call
      return try await actorTransport.remoteCall(to: "unique string for 'greet'", on: self, arguments: [name], returning: String.self)
    } else {
      // Local actor, just perform the call locally
      return await greet(name: name)
    }
  }
}

The transport will have to implement remoteCall(to:on:arguments:returning:), and is responsible for forming the message envelope containing the function identifier and arguments and sending it to the remote host, then suspending until a reply is received or some failure has been encountered.

Responding to remote calls

When a message envelope for a remote call is received, it will need to be matched to a local operation that can be called. The function identifier is the key to accessing this information, which will be provided by the distributed actor runtime through the following API:

struct DistributedFunction {
  /// The type of the distributed actor on which this function can be invoked.
  let actorType: DistributedActor.Type
  
  /// The types of the parameters to the function.
  let parameterTypes: [Codable.Type]
  
  /// The result type of the function.
  let resultType: Codable.Type
  
  /// The error type of the function, which will currently be either `Error` or `Never`.
  let errorType: Error.Type
  
  /// The actual (local) function that can be invoked given the distributed actor (whose type is `actorType`), arguments
  /// (whose types correspond to those in `parameterTypes`) and producing either a result (of type `resultType`) or throwing
  /// an error (of type `errorType`).
  let function: (DistributedActor, [Codable]) async throws -> Codable
  
  /// Try to resolve the given function name into a distributed function, or produce `nil` if the name cannot be resolved.
  init?(resolving function: String)
}

The function property of DistributedFunction is most interesting, because it is a type-erased value that allows calling the local function without any specific knowledge of the types that the function types. The other properties (actorType, parameterTypes, etc.) describe the types that function expects to work with. Together, they provide enough information to decode a message.

The initializer will perform a lookup into runtime metadata that maps from the string to these operations. The compiler will be responsible for emitting suitable metadata, which includes a type-erased "thunk" to call the local function. For the greet operation in the prior section, that would look something like this:

func Greeter_greet_$erased(actor: DistributedActor, arguments: [Codable]) async throws -> Codable
  let concreteActor = actor as! Greeter   // known to be local
  let concreteName = arguments[0] as! String
  return await concreteActor.greet(name: concreteName)
}

The design here is not unlike what we do with protocol conformances: the compiler emits information about the conformances declared within your module into a special section in the binary, and the runtime looks into that section when it needs to determine whether a given type conforms to a protocol dynamically (e.g., to evaluate value as? SomeProtocol). The approach allows, for example, a module different from the one that defined a distributed actor to provide a distributed function on that distributed actor, and we'll still be able to find it.

Doug

Douglas_Gregor · October 14, 2021, 12:52am

I've thought more about this, too, and I believe I understand the shape of the solution. The basic idea is for the ActorTransport to gain knowledge of the serialization constraints that it places on the parameter and result types of distributed functions using that transport. This could default to Codable (which is shorthand for Encodable & Decodable), but for example a protobuf implementation could instead specify SwiftProtobuf.Message. For example, ActorTransport would gain an associated type like the following:

protocol ActorTransport {
  associatedtype SerializedRequirementType = Codable
}

A protobuf transport could specify SwiftProtobuf.Message here, such that any distributed actors using the protobuf transport serialize data through its message protocol. Now, this does require that distributed actors know how they serialize data, so at the very least we need DistributedActor to know the serialization requirement that applies to its distributed functions:

protocol DistributedActor {
  associatedtype SerializationRequirementType = Codable
  
  static func resolve<Identity, ActorTransportType>(id identity: Identity, using transport: ActorTransportType) throws -> Self?
    where Identity: ActorIdentity, ActorTransportType: ActorTransport, 
          ActorTransportType.SerializationRequirementType == SerializationRequirementType
}

In truth, a distributed actor is likely to want to know more about its transport than that, so we might just want to jump to having the transport be part of the DistributedActor protocol:

protocol DistributedActor {
  associatedtype ActorTransportType: ActorTransport
  
  static func resolve<Identity>(id identity: Identity, using transport: ActorTransportType) throws -> Self?
    where Identity: ActorIdentity
}

I'll leave that for a separate discussion, and assume the first version of DistributedActor. Given that, let's reconsider our Greeter distributed actor from before:

distributed actor Greeter {
  distributed func greet(name: String) -> String { ... }
}

The SerializationRequirement of Greeter will default to Codable. Therefore, the parameter and result types for the greet operation must satisfy the requirements of Codable. If instead we had a protobuf-based greeter, like the following:

distributed actor ProtobufGreeter {
  typealias SerializationRequirement = SwiftProtobuf.Message
  
  distributed func greet(name: String) -> String { ... }
}

Now, the parameter and result types for the greet operation (both String) would have to conform to the SwiftProtobuf.Message protocol, and all distributed operations on ProtobufGreeter would use protobuf messages. This affects how the code is type-checked and the shape of the distributed thunks that can perform the actual remote procedure calls (whether using source generation or my approach to eliminating source generation), but not the fundamental way that distributed actors behave.

This approach does introduce restrictions on distributed functions that aren't on a concrete actor type. For example, we could not define a distributed function in an extension of DistributedActor because there is no way to ensure type safety:

extension DistributedActor {
  distributed func getPoint(index: Int) -> Point { ... } // should Int and Point conform to Codable? SwiftProtobuf.Message?
}

To address this, we would have to require distributed functions on protocol extensions to specify their serialization requirement type explicitly:

extension DistributedActor where SerializationRequirementType == Codable {
  distributed func getPoint(index: Int) -> Point { ... } // Int and Point must conform to Codable
}

This entire design requires SerializationRequirementType to be dealt with fairly concretely. One cannot, for example, parameterize a distributed actor over its serialization requirement:

distributed actor SneakyGreeter<SR> {
  typealias SerializationRequirementType = SR // error: we cannot express that SR is an existential type that can be a generic requirement 

  distributed func greet(name: String) -> String { ... } // cannot guarantee that `String` conforms to `SR`
}

While one could imagine advanced type system features that would allow the above to be expressible, such a change would be far out of scope as part of distributed actors. I think we should accept that as a limitation of this particular approach, and it could be lifted in the future if the type systems eventually supports it.

This can layer on top of my approach to eliminating source generation by effectively deleting all of the Codable mentions from DistributedFunction, remoteCall, etc. Instead, use Any or no requirements at all, and left it to the type checker + transport to ensure that a distributed function cannot get called with the wrong serialization requirements.

Doug

ktoso · October 14, 2021, 9:43am

Thanks for the writeups, Doug!

As we've worked through this before I very much agree with all of this and more than happy to take the proposal to the next level with this! Unless others will have much feedback here, I think I'll work this into our second version of the proposal as we work through implementing some of this in the coming weeks.

For additional context a few comments:

The `DistributedFunction` type

Yes, that is perfect!

And we'd have that function(...) crash hard if misused, in order to prevent any weird misuse. It is the transport developers responsibility to make quadruple sure things are being invoked on the right actors and with the right types.

We'll have to work out what exactly it has to contain to support a few other advanced features (like for tracing we'd want to have the "#function" name so a trace span could use that as a nice default for the name of the trace). By lucky accident, recently there was a talk by the Reuben from the Orleans team, describing their source generator (they use source gen), and it basically is exactly that shape as well (including the pretty names for debugging/introspection), so that was a nice accidental confirmation of our design -- as I've seen it only after we came up with this actually (Though to be fair, the requirements for such systems are all the same, so yeah, similar solutions arise).

Transport or SerializationRequirement typealiases

You mentioned that it's perhaps left for another discussion, but I'll quickly say that for many reasons, this is exactly what we'd want anyway, and were aiming for eventually anyway:

protocol DistributedActor {
  associatedtype ActorTransportType: ActorTransport
}

This has a small rippling effect on some of the other synthesis (type of transport accepted by initializer), but that's pretty simple and something we wanted anyway.

In practice I think we'd default to something like CodableTransport -- by the way, perhaps time to rename this as ActorMessageTransport and then CodableMessageTransport which just refines it with the typealias SerializedRequirementType = Codable

Overall looking great and I think it'll work well.

I am not too concerned about the limitation on extensions -- this is fine IMHO. What matters the most is that a protocol itself does not have to specify this, and we can validate there is a specific type provided either at the concrete actor or at the time of resolving. Either being static compile time checks

In any case, this all sounds great and I'll work those into the next revision of the proposal

Douglas_Gregor · October 15, 2021, 4:40pm

At an implementation level, I was thinking I'd use the mangled name here, since it has pretty much all of the information we need and the implementation guarantees uniqueness already. We could expose that to the user in a more structured manner so they could get, e.g., the actor type name, function name (greet(string:)), module name, and so on.

ktoso:

Transport or SerializationRequirement typealiases

You mentioned that it's perhaps left for another discussion, but I'll quickly say that for many reasons, this is exactly what we'd want anyway, and were aiming for eventually anyway:
protocol DistributedActor {
  associatedtype ActorTransportType: ActorTransport
}
This has a small rippling effect on some of the other synthesis (type of transport accepted by initializer), but that's pretty simple and something we wanted anyway.

Yeah, I didn't want to pre-suppose, but I like this direction. It also means we could eliminate the existential AnyActorTransport storage from distributed actors to make them smaller and more concrete. De-existentialing the interfaces to distributed actors seems like a generally good thing.

Doug

Saklad5 · October 15, 2021, 5:21pm

I agree, it’s almost always best to avoid existentials when possible.

hooman · October 15, 2021, 5:22pm

ktoso:

Transport or SerializationRequirement typealiases

You mentioned that it's perhaps left for another discussion, but I'll quickly say that for many reasons, this is exactly what we'd want anyway, and were aiming for eventually anyway:
protocol DistributedActor {
  associatedtype ActorTransportType: ActorTransport
}
This has a small rippling effect on some of the other synthesis (type of transport accepted by initializer), but that's pretty simple and something we wanted anyway.

How would someone switch an existing code base to a different transport/serialization? Say we start with Codable, and then decide to switch to ProtoBuf for performance reasons? Also, is it going to be possible to have different transport and serialization for the same remote actor based on where the remote actor happens to be located? (Say a different process on the same physical machine vs a remote machine)

Saklad5 · October 15, 2021, 5:24pm

Wouldn’t you just rewrite the impacted code? I don’t see any reasonable way around that.

In theory, you could write some sort of adaptor that decodes one and encodes the other, but there’d be little benefit in doing so.

hooman · October 15, 2021, 5:32pm

I am aware of these options. The reason I am asking is further exploring the design space to see if we can do better to reduce dependency on concrete transport and serialization format at the source code level. Maybe something like some Transport and some Serializable, if you will. Something that would keep the ultimate generated code as concrete and statically optimized as possible, while accommodating for the cases that I mentioned.

Douglas_Gregor · October 20, 2021, 4:54am

If you want to switch a given distributed actor to a different transport, you would replace your

typealias ActorTransportType = <whatever transport you are using>

with a type alias to a different transport. You can centralize that definition in your code base, perhaps, or we could introduce some special defaulting rule. For example, imagine that if there were a type alias like this visible:

typealias DefaultActorTransportType = MyFavoriteActorTransportType

Then, any distributed actor that doesn't explicitly define an ActorTransportType would get the visible default. Perhaps libraries that define transports could also provide a public DefaultActorTransportType so you wouldn't have to do so yourself in most cases: import the library that has a transport, and your distributed actors use that transport by default.

Changing transports statically would mean changing the typealias(es). If you want to change the transport dynamically, we could define a type-erased transport like, say, AnyActorTransport<SerializationType>, and you can choose dynamically.

In the design I proposed for abstracting away Codable, we could not abstract away the serialization mechanism quite so easily, because we need a static type against which we can check all distributed functions. See the end of my post on this design, where we talk about needing the same-type constraints when extending DistributedActor with new distributed functions.

Doug

Tino · October 29, 2021, 9:06am

I am a little worried to see history repeating itself — especially as Distributed Objects are not even mentioned in the whole text.

Distributed Objects also started as cool and promising, but ended as a bunch of keywords which most developers don't even know.

However, despite not referencing it, this try seems to address some issues of the failed approach, so let's hope for more luck this time.

hassila · October 29, 2021, 11:02am

As someone who used DO (NeXTSTEP) as well as PDO (2.x, 3.x and 4.x on both HP-UX & Solaris...) extensively during a few years in the mid 90:s when it was relevant tech I can understand that concern, although not sharing it (yet) - the major problems with DO/PDO was robustness and performance (the "machd" emulation on Solaris did give us major pain), the actual productivity and ease-of-use was quite ok. Seeing that hooks seems to be in place for transport implementations and message serialisation (or lack thereof) is promising I think.

I definitely think there is room for a great solution for this design space which could greatly accelerate building distributed Swift on the server solutions - just hope it will get time to bake enough to ensure all the pieces of the design are there.

hassila · October 29, 2021, 11:50am

Also, one use case that I've encountered is where you have a distributed system that basically has a number of clusters running on different sites and they all together form a supercluster.

Just mentioning this use case as it may have implications for actor resolving (you might prefer locality of reference) and other things.

liuliu · October 29, 2021, 4:41pm

Because every distributed actor conforms to DistributedActor protocol, would it be sufficient to just have:

actor Greeter: DistributedActor {
}

Like any other protocols we do compiler-level code synthesis? (Codable / Equatable etc).

Paul_Cantrell · October 29, 2021, 5:22pm

This is really cool, a nicely conceived vision, and I have nothing to useful to say about the details that others haven’t said better.

I have to ask the big picture question that’s been nagging at me since I first heard rumors of this work: What’s the driving use case here? Hasn’t the industry shifted pretty hard away from stateful RPC since the “distributed objects” craze of the 80s? This is definitely one hell of an improvement over Stateful Session Beans or CORBA (shudder). But does the world need a better CORBA?

I phrase these questions rhetorically, but I do mean them seriously! What’s the motivation here? What’s the application space?

I take it this is meant for tightly connected systems (LAN clusters, XPC, and the like), and not for the kind of “loose interop over the Internet” stuff that sunk SOAP…. (Or should people actually attempt to use this, say, as a wrapper for REST APIs, as they will doubtless attempt to do?)

A clearer sense of what’s driving it and how it’s going to be used would help focus feedback.

hooman · October 29, 2021, 6:02pm

Yes, I also see it that way. It is the next logical step for a highly concurrent (single) actor system to scale beyond a single host. It is also important in separating subsystems into individual processes for security purposes. (XPC stuff).

nikitamounier · October 29, 2021, 10:50pm

I think you have the right reasoning here – the microchip industry is moving towards custom accelerators (such as the Mac Pro's afterburner card, which is now in the M1 Pro / Max) which are all asynchronous, and which are much easier to work with when modelled as distributed actors (accelerators are mentioned in @Chris_Lattner3 's Concurrency Manifesto). However, I don't see why Swift's distributed actors working for XPC-like systems means that they couldn't also work for more large-scale "Internet stuff", especially on the cloud (think microservices or game lobbies).

I definitely don't think that the two are mutually exclusive – the genius of the distributed actor design is that the transport is completely abstracted away into some ActorTransport, meaning that you can have a really nice swifty api for both XPC-like systems and large-scale worldwide distributed systems.

jrose · October 30, 2021, 6:13am

I’m coming into this late, and without much expertise in this kind of server space, especially not compared to Konrad and others’ years working with Akka. But…I’m still not convinced this needs to be a language feature.

A story from the pre-1.0 days: we were wondering whether constructing new objects should have special syntax (new UIView(…)), or just look like a normal function call. We ended up going with plain old call syntax, as everyone knows, and one of the primary reasons was that any method can allocate and return objects, whether they’re initializers, factory methods, or just plain old instance methods on another object. Initializers are special in a number of ways, but calling them isn’t one of them. That’s how I feel about distributed: as soon as you wrap the method in a helper, it just looks like a regular async method to the rest of the program. To me, that neutralizes the idea that we’re marking the potentially-more-expensive methods to call.

But like initializers, the implementation of distributed methods is different. So that isn’t a reason not to do this on its own. I just can’t shake the feeling that while from inside this is a nicely integrated way to do distributed computing, from outside it’s a restricted dialect of Swift with a lot of magic in it.

I know there’s experience from Akka about a library-only solution, but it feels like this feature could be implemented with a code generation tool rather than compiler integration. From the perspective of other code, a distributed actor is something like this:

enum FooActor<Transport: ActorTransport> {
  case local(LocalFooActor)
  case remote(ProxyFooActor<Transport>)

  func doTheThing(_ params: FooParams) async throws -> FooOutput { … }
}

Actually using a representation like this (though probably not this exactly) has a handful of benefits, particularly that the escape hatch for local access is actually typed now. You could do this generation using the compiler, but you could also do it using SourceKit—though it’s true that error handling would be worse if, say, a parameter wasn’t Codable.

I guess I just don’t think this is worth putting into the language. That’s a big step, with compatibility implications of years. With a separate library and code generation tool, we can still make breaking-change releases, and they aren’t tied to other updates to the language. And I don’t think you’d actually lose much expressive power at all that way.

Paul_Cantrell · October 30, 2021, 2:16pm

Right, clearly one can. My concern is about whether one should. Developers reached a pretty strong consensus between the mid-80s and the mid-00s that RPC in general and stateful objects in particular are a really problematic model for widely distributed systems, and industry seems to be continuing even now in that “coarse-grained state transfer via stateless request” direction (REST → GraphQL, for example, is such a step) rather than hungering for something with the shape of this proposal. One could use HTTP as a transport for distributed actors, but…if that were a design goal of this proposal, then I have serious concerns about its current design.

OTOH, your custom accelerator example makes tons of sense to me the way this proposal is shaped, even some OS services, along with previously mentioned examples of XPC and LAN clusters. So it would be nice to know from the proposal authors what particular problem spaces they’re intending to target.

[Pitch] Distributed Actors

ActorTransport-based remote calls

Responding to remote calls

The DistributedFunction type

Transport or SerializationRequirement typealiases

`ActorTransport-based` remote calls

The `DistributedFunction` type