[Pitch] Distributed Actors

I'm strongly advocating to fixing this in general, because it already is causing issues with plain local actors, somewhat undermining their usefulness as synchronization tool in some situations: Task: Is order of task execution deterministic? - #2 by ktoso

One possible solution would be to allow send actor.message() which would be the "don't wait for reply; but DO wait until the message was enqueued (in the local sense, or 'sent' in the distributed sense)". This would also solve the issue of non-deterministic message order that is highlighted in the above forums thread.

It's not yet clear if the team would be willing to accept this language extension though; we don't have specific plans to share about this as of now.

If we had such uni directional calls, I'm not sure but we still may consider them throwing, just because the local part of a message send could still throw I suppose (i.e. serialization failure).

// edit 1: typo, plan -> plain

4 Likes

Thanks John! Going through the comments then:

Thanks, I'll clean that up as we go through implementing this function. It has some minor mistakes elsewhere as well.

I'm leaning towards having it a free function because it makes it harder to discover and not just "randomly" use it everywhere since it ends up in auto-completion -- that would be bad.

You're right that failable inits make this a bit harder, and these semantics need more docs. I have not implemented all the edge cases here. The exact semantics of when which function is called will indeed be dependent on throws and probably where exactly the throw happened -- i.e. before all properties were initialized or after etc.

  • Note to self to follow up on exact lifecycle call semantics in face of failures.

The missing bit here is perhaps more lifecycle monitoring things, I have this worked out but it'd be a library thing -- I'll add a note to myself to follow up on this, but in general I don't think it's a big issue.

Practically speaking when you need a "keep this alive because others will refer to it" you'd register it with some "registry" that would strongly hold it, and upon deinit you'd remove yourself from it. Though you're right to call out perhaps more discussion of this will be needed as we mature the project and discover appropriate patterns for it. :thinking:

Yeah the use-case for it is pretty weak but it would allow making sure that some protocol is implemented "as an actor", say "Worker & AnyActor". It doesn't really gain us much though... the entire protocol is still just Worker after all.

But it does feel pretty "right" to allow expressing this need in any case... We'll see if we end up with this marker protocol or not :thinking:

:+1:


Thanks again for the comments, I added some followups about this for when I flesh out the proposal's next revision, thanks!

1 Like

Thanks Konrad, all makes sense; agree the best way to figure out the refcount-type things will be to build some software with this.

1 Like

First of all: Thank you for this pitch. I think this will be a cool feature.

Just in case anyone is interested: I was inspired by this pitch to create a framework that does something very similar for client-server communication: SwiftyBridges. You basically write a type that handles the logic on the server and the framework creates a matching type for the client plus all necessary communication code. This means you don't need to worry about building requests and responses or creating an API client.

First, I wanted to implement this framework as a transport for the feature proposed here. However, I decided against it. Maybe my reasons are valuable to someone:

  • The server API type very probably relies on libraries not available on the client. Therefore, the API type would not compile on the client-side. This would require the future direction 'Resolving DistributedActor bound protocols' to be implemented.
  • The current version of the feature relies on code generation. If the framework needs to generate code anyway, then generating the whole type is not much more work than generating the dynamic method replacements. This means that the proposed feature would have meant almost no benefit to me as a transport implementor.
  • I do not think that an actor is a good fit for the server API type:
    • If the actor was responsible for multiple clients, then it would only be able to respond to one request at a time. It would also pose this question: How does the actor receive the identity of the client or the user? Would every method need a parameter clientID or userID? That would mean that the client code would need to add this parameter to every method call.
    • If one actor was just responsible for one client, then keeping all actors in memory would be infeasible on a server with many users.
      Therefore, I opted to model server API types as structs and for every request, a new instance is created.
  • Modeling the client API type as an actor would mean that only one method could be called at a time. This is an unnecessary restriction.

To make it clear: These reasons do not mean that the proposed feature has no value. It just means that, in my opinion, it does not fit this application, at least not in its current form.

Hi Konrad,

Many thanks for the exhaustive reply, I finally got time around to re-read and follow up.

this proposal won't address these questions at all because it is designed to completely leave it all up to specific transports. There is no wire protocol in this language proposal, and as such, it cannot answer questions about the wire protocol.

That's fair enough, my main concern is perhaps not so much the wire format (and we may end up using e.g. SBE and/or protobuf there depending on use cases as you suggest), it is more about the higher level questions - like how to perform discovery of possibly concurrent different versions of distributed actors (so we can pick the right one) - it's a bit unclear how the resolve mechanisms would allow for doing this at a first glance (but quite possibly I just miss it).

As the distributed protocol definitions fundamentally are defined in Swift, rather than an external schema (which I think is super nice!), there is a bootstrap question to consider on how to perform the resolving (should there be a standard protocol for this?) - if there are two concurrent running implementations of a distributed actors, I (may) need to be able to choose between them. I guess we can implement such a protocol ourselves and just never change it, but perhaps it would be of common utility?

• The more we move "into the language", the more the language has to "lock in" features and schemes. This is not necessarily a good thing.
Agreed.

◦ Long term, to do a really really great job at this, we'd need more powerful meta-programming to annotate "favored" versions, or even keep same functions with "old version" annotated etc... We don't have this expressivity today, and we should take this one-step-at-a-time:
Same as the Structured Concurrency and Actors proposals... this work is not going to be "finished" with this proposal alone. There will be many more steps polishing, and adding missing features as we mature this work over the coming months and years even.

For sure, it's another (like concurrency) too big chunk to bite off in one go.

I expect we should do some "transport implementation guidelines" and I think I should get to that eventually. I think it is fair to expect this should happen before the proposal is seriously to be considered for acceptance

I think this might be the key point, that I'm simply not seeing the full end-to-end picture without some more guidance...

Looking forward a lot to see this evolve - as exciting as the concurrency work has been really - super for Swift on Server.

2 Likes

Also note that promise to have stable identity of live proxy objects is not the same as to promise of stable ObjectIdentifier’s produced out of them.

The first one is nice to have and probably could be implemented without unbound memory growth using weak references. Simple implementation might use a dictionary with weak references as values, and prune dead keys from time to time. In this case you are promising clients that if they have a live instance, it’s identity is stable. And if clients don’t have a live instance, they should not care if it’s stable or not.

The second one is not needed, IMO. Even with non-actor classes, you can observe memory address being recycled for object of the same or different class. It is already not safe to let ObjectIdentifier escape, unless you are taking additional measures to ensure that instance stays alive.

There is also identity of weak references (address of the side table). There are some applications where this is needed, but currently it does not really work in Swift anyway (see Hashing Weak Variables). But if it would work, stability of weak references cannot be implemented exactly for the reasons you mentioned.

For most purposes, it is best to ignore ObjectIdentifier in favor of the Hashable implementation when determining whether something is functionally "the same” anyway. By the same token, determining whether something refers to “the same thing” should be done using Identifiable (whose documentation explicitly encourages providing a domain-specific implementation rather than using the default for classes).

Distributed actors seems to implicitly have the necessary requirements to fault tolerance (since they are running in a distributed mode). Would it be technically possible to run full fault tolerant distributed actors within the same process in production? (As in if the code fatalErrors, it only takes out the one actor where it fatal errored. )

fatalError(_:file:line:) simply returns Never. Since you can’t make a value of an uninhabited type, that means control flow never leaves that scope.

If you want a non-fatal error, don’t use that.

So yeah in general (in-process) crash handling is closely associated with some actor implementations out there. This is really a property of the "local" actors. What would potentially enable this is actually distributed actors stronger isolation requirements (banned stored property access).

Long story short: we are very interested in this topic, but do not currently have any work happening towards this. Improving the crash handling story indeed could be one of the future work with actors and distributed actors, but right now it is not part of the pitch or plan of record.

We can implement failure handling of entire nodes right now with distributed actors of course -- so if you were to implement "a bunch of actors in this other process" and the entire process dies because of a fatal error, we can indeed notice and notify about the crash of all the distributed actors located in that process. We could provide some helpers for that, but that would be a linux/server solution. (Basically like a multi-process transport). In a way this is similar to what XPC can do -- bunch of actors in other process, if process dies we can easily notice that.

3 Likes

Was there any change on whether computed properties will be permitted, by the way?

Yeah, I agreed on that in principle up-thread – we'll evaluate that.

In the next iteration of the proposal (probably a few weeks out) I'd like to explore getting rid of source generation, so that adds quite a bit of complexity and functionality to the language to support that, it'd also impact if and how we can support computed properties.

3 Likes

Hey, any thought to having "streaming" funcs built in, like GRPC? Its nice to keep an HTTP2 connection open, and actually the semantics of some interactions like being setup that way.. In GRPC we have client, server, and bidi streaming.

Oh, and in our case we use GRPC for both internal and external to Brighten RPC, and there is an additional advantage to us to use GRPC for external to customers API (so they can bind to our services with any language) - so when this goes lives we would probably still keep that, but if Swift distributed RPC is about as performant we will toss the GRPC internal use in the name of simplicity ...

Ideally gRPC could be implemented as a distributed actor transport.

agree, that would be great! Although as noted earlier, in addition to the semantics of streaming, and keeping connections open, we would also need the storage on the wire be the protobufs stuff most people use in that case, instead of Codable. Hence if we ripped out our internal GRPC stuff, we would still prefer streaming functionality, even if we didn't get full GRPC compatibility. Personally I think it would be "cool" if protobuf support was moved into Swift, and you at least had an option via some kind of protocol markup to use protobuf storage..

I've thought about this a bit more, and I no longer think this is a good idea. The distributed tag indicates what the surface area of the distributed actor that cannot be local, and I think it's key to Resolving DistributedActor bound protocols.

I've also been thinking about source generation or, rather, how to remove it.

ActorTransport-based remote calls

We can extend the ActorTransport protocol with the ability to perform a remote call, such that each distributed function can be compiled into a remote call through the transport, eliminating the need for code generation on the sender side. For the receiver side, each distributed function will be recorded in compiled metadata that can be queried to match a remote call to its type information and the actual function to call.

ActorTransport can be extended with a remote call with a signature like the following:

protocol ActorTransport {
  // ... existing requirements ...

  /// Perform a remote call to the function.
  ///
  /// - parameters:
  ///   - function: the identifier of the function, assigned by the implementation and used as a lookup key for the
  ///     runtime lookup of distributed functions.
  ///   - actor: identity for the actor that we're calling.
  ///   - arguments: the set of arguments provided to the call.
  ///   - resultType: the type of the result
  ///
  /// - returns: the result of the remote call
  func remoteCall<Result: Codable>(to function: String, on actor: DistributedActor, arguments: [Codable], returning resultType: Result.Type) async throws -> Result
}

When a distributed function is compiled, the compiler creates a distributed "thunk" that will either call the actual function (if the actor is local) or form a remote call invocation (if the actor is remote). For example, given:

distributed actor Greeter {
  distributed func greet(name: String) -> String { ... }
}

the compiler will produce a distributed thunk that looks like this:

extension Greeter {
  nonisolated func greet_$thunk(name: String) async throws -> String {
    if isRemoteActor(self) {
      // Remote actor, perform a remote call
      return try await actorTransport.remoteCall(to: "unique string for 'greet'", on: self, arguments: [name], returning: String.self)
    } else {
      // Local actor, just perform the call locally
      return await greet(name: name)
    }
  }
}

The transport will have to implement remoteCall(to:on:arguments:returning:), and is responsible for forming the message envelope containing the function identifier and arguments and sending it to the remote host, then suspending until a reply is received or some failure has been encountered.

Responding to remote calls

When a message envelope for a remote call is received, it will need to be matched to a local operation that can be called. The function identifier is the key to accessing this information, which will be provided by the distributed actor runtime through the following API:

struct DistributedFunction {
  /// The type of the distributed actor on which this function can be invoked.
  let actorType: DistributedActor.Type
  
  /// The types of the parameters to the function.
  let parameterTypes: [Codable.Type]
  
  /// The result type of the function.
  let resultType: Codable.Type
  
  /// The error type of the function, which will currently be either `Error` or `Never`.
  let errorType: Error.Type
  
  /// The actual (local) function that can be invoked given the distributed actor (whose type is `actorType`), arguments
  /// (whose types correspond to those in `parameterTypes`) and producing either a result (of type `resultType`) or throwing
  /// an error (of type `errorType`).
  let function: (DistributedActor, [Codable]) async throws -> Codable
  
  /// Try to resolve the given function name into a distributed function, or produce `nil` if the name cannot be resolved.
  init?(resolving function: String)
}

The function property of DistributedFunction is most interesting, because it is a type-erased value that allows calling the local function without any specific knowledge of the types that the function types. The other properties (actorType, parameterTypes, etc.) describe the types that function expects to work with. Together, they provide enough information to decode a message.

The initializer will perform a lookup into runtime metadata that maps from the string to these operations. The compiler will be responsible for emitting suitable metadata, which includes a type-erased "thunk" to call the local function. For the greet operation in the prior section, that would look something like this:

func Greeter_greet_$erased(actor: DistributedActor, arguments: [Codable]) async throws -> Codable
  let concreteActor = actor as! Greeter   // known to be local
  let concreteName = arguments[0] as! String
  return await concreteActor.greet(name: concreteName)
}

The design here is not unlike what we do with protocol conformances: the compiler emits information about the conformances declared within your module into a special section in the binary, and the runtime looks into that section when it needs to determine whether a given type conforms to a protocol dynamically (e.g., to evaluate value as? SomeProtocol). The approach allows, for example, a module different from the one that defined a distributed actor to provide a distributed function on that distributed actor, and we'll still be able to find it.

Doug

14 Likes

I've thought more about this, too, and I believe I understand the shape of the solution. The basic idea is for the ActorTransport to gain knowledge of the serialization constraints that it places on the parameter and result types of distributed functions using that transport. This could default to Codable (which is shorthand for Encodable & Decodable), but for example a protobuf implementation could instead specify SwiftProtobuf.Message. For example, ActorTransport would gain an associated type like the following:

protocol ActorTransport {
  associatedtype SerializedRequirementType = Codable
}

A protobuf transport could specify SwiftProtobuf.Message here, such that any distributed actors using the protobuf transport serialize data through its message protocol. Now, this does require that distributed actors know how they serialize data, so at the very least we need DistributedActor to know the serialization requirement that applies to its distributed functions:

protocol DistributedActor {
  associatedtype SerializationRequirementType = Codable
  
  static func resolve<Identity, ActorTransportType>(id identity: Identity, using transport: ActorTransportType) throws -> Self?
    where Identity: ActorIdentity, ActorTransportType: ActorTransport, 
          ActorTransportType.SerializationRequirementType == SerializationRequirementType
}

In truth, a distributed actor is likely to want to know more about its transport than that, so we might just want to jump to having the transport be part of the DistributedActor protocol:

protocol DistributedActor {
  associatedtype ActorTransportType: ActorTransport
  
  static func resolve<Identity>(id identity: Identity, using transport: ActorTransportType) throws -> Self?
    where Identity: ActorIdentity
}

I'll leave that for a separate discussion, and assume the first version of DistributedActor. Given that, let's reconsider our Greeter distributed actor from before:

distributed actor Greeter {
  distributed func greet(name: String) -> String { ... }
}

The SerializationRequirement of Greeter will default to Codable. Therefore, the parameter and result types for the greet operation must satisfy the requirements of Codable. If instead we had a protobuf-based greeter, like the following:

distributed actor ProtobufGreeter {
  typealias SerializationRequirement = SwiftProtobuf.Message
  
  distributed func greet(name: String) -> String { ... }
}

Now, the parameter and result types for the greet operation (both String) would have to conform to the SwiftProtobuf.Message protocol, and all distributed operations on ProtobufGreeter would use protobuf messages. This affects how the code is type-checked and the shape of the distributed thunks that can perform the actual remote procedure calls (whether using source generation or my approach to eliminating source generation), but not the fundamental way that distributed actors behave.

This approach does introduce restrictions on distributed functions that aren't on a concrete actor type. For example, we could not define a distributed function in an extension of DistributedActor because there is no way to ensure type safety:

extension DistributedActor {
  distributed func getPoint(index: Int) -> Point { ... } // should Int and Point conform to Codable? SwiftProtobuf.Message?
}

To address this, we would have to require distributed functions on protocol extensions to specify their serialization requirement type explicitly:

extension DistributedActor where SerializationRequirementType == Codable {
  distributed func getPoint(index: Int) -> Point { ... } // Int and Point must conform to Codable
}

This entire design requires SerializationRequirementType to be dealt with fairly concretely. One cannot, for example, parameterize a distributed actor over its serialization requirement:

distributed actor SneakyGreeter<SR> {
  typealias SerializationRequirementType = SR // error: we cannot express that SR is an existential type that can be a generic requirement 

  distributed func greet(name: String) -> String { ... } // cannot guarantee that `String` conforms to `SR`
}

While one could imagine advanced type system features that would allow the above to be expressible, such a change would be far out of scope as part of distributed actors. I think we should accept that as a limitation of this particular approach, and it could be lifted in the future if the type systems eventually supports it.

This can layer on top of my approach to eliminating source generation by effectively deleting all of the Codable mentions from DistributedFunction, remoteCall, etc. Instead, use Any or no requirements at all, and left it to the type checker + transport to ensure that a distributed function cannot get called with the wrong serialization requirements.

Doug

10 Likes

Thanks for the writeups, Doug!

As we've worked through this before I very much agree with all of this and more than happy to take the proposal to the next level with this! :100: Unless others will have much feedback here, I think I'll work this into our second version of the proposal as we work through implementing some of this in the coming weeks.


For additional context a few comments:

The DistributedFunction type

Yes, that is perfect! :slight_smile:

And we'd have that function(...) crash hard if misused, in order to prevent any weird misuse. It is the transport developers responsibility to make quadruple sure things are being invoked on the right actors and with the right types.

We'll have to work out what exactly it has to contain to support a few other advanced features (like for tracing we'd want to have the "#function" name so a trace span could use that as a nice default for the name of the trace). By lucky accident, recently there was a talk by the Reuben from the Orleans team, describing their source generator (they use source gen), and it basically is exactly that shape as well (including the pretty names for debugging/introspection), so that was a nice accidental confirmation of our design -- as I've seen it only after we came up with this actually :smiley: (Though to be fair, the requirements for such systems are all the same, so yeah, similar solutions arise).

Transport or SerializationRequirement typealiases

You mentioned that it's perhaps left for another discussion, but I'll quickly say that for many reasons, this is exactly what we'd want anyway, and were aiming for eventually anyway:

protocol DistributedActor {
  associatedtype ActorTransportType: ActorTransport
}

This has a small rippling effect on some of the other synthesis (type of transport accepted by initializer), but that's pretty simple and something we wanted anyway.

In practice I think we'd default to something like CodableTransport -- by the way, perhaps time to rename this as ActorMessageTransport and then CodableMessageTransport which just refines it with the typealias SerializedRequirementType = Codable :thinking:

Overall looking great and I think it'll work well.

I am not too concerned about the limitation on extensions -- this is fine IMHO. What matters the most is that a protocol itself does not have to specify this, and we can validate there is a specific type provided either at the concrete actor or at the time of resolving. Either being static compile time checks :+1:

In any case, this all sounds great and I'll work those into the next revision of the proposal :+1:

8 Likes

At an implementation level, I was thinking I'd use the mangled name here, since it has pretty much all of the information we need and the implementation guarantees uniqueness already. We could expose that to the user in a more structured manner so they could get, e.g., the actor type name, function name (greet(string:)), module name, and so on.

Yeah, I didn't want to pre-suppose, but I like this direction. It also means we could eliminate the existential AnyActorTransport storage from distributed actors to make them smaller and more concrete. De-existentialing the interfaces to distributed actors seems like a generally good thing.

Doug

2 Likes