[Pitch] Distributed Actors

I think the main problem with batching is that performing work on the other side may have side effects or at least perform unnecessary work. For instance, if you do try await (a(), b()) and a() throws an error, b() is not supposed to run. But in await (try? a(), try? b()) both b() need to run regardless of that happens with a().

So to avoid unintended side effects or unnecessary work, batching would need to define the dependencies between tasks and have to other side honor the dependency graph.

b() would still not be run on the remote side. The calls are just being made in a regular, sequential way on the remote actor. It's merely how they're being presented to the actor to be sent over the wire that I think could be different. The only redundant work that would be done in that case would be encoding the b() call for transport, which is almost sure to be dwarfed by the cost of the extra remote call in all but pathological cases (e.g. a() always throws).

Edit: Anyway, all I'm trying to nail down here is:

  • Is this a theoretically viable alternative?
  • Does it have serious implications for distributed actor performance? (i.e. the presented design appears to never allow for this optimisation)
  • Is the performance improvement worth the added complexity to both the compiler and the actor transport definitions?

That would mean batching would not support the await (try? a(), try? b()) case. I guess we could if batches were trees instead of sequences of dependent calls.

Call tree examples

(1) This:

(try? a(), try? b())

would creates a sequence of calls to perform (independent tasks):

  • a()
  • b()

(2) While this:

try (a(), b())

creates a tree of calls to perform (dependent tasks):

  • a()
    • b() // dependent on a

(3) And this:

  try? ( a(), try? b(), c() ),
  try? ( d(), e() )

creates that thing:

  • a()
    • b() // dependent on a
    • c() // dependent on a (not on b)
  • d()
    • e() // dependent on d

So you could feed the transport with this work tree and let it decide what to do. At some point it starts to feel like you're sending a closure to the other side. I have no idea if this complexity is worth it tough. Perhaps as you suggest only #2 (sequence of dependent calls) is worth supporting.

I think a more interesting model for batching would be to have something more explicit. For instance, you could use a result builder to create a "closure" that can be serialized and sent to the other side. It would also avoid inflating the transport with semantics of what to run when.


There’s a bit much speculating over performance here to be honest… even IF such batch were to exist, it means that array needs to be allocated. Is that worth it? Etc etc etc.

This is a potential optimization which we could look into some day. I don’t agree that the design prevents any such future experiments. It cerainly is not necessary to write a high performance transport because other runtimes without such weird tricks are able to do so (and I worked on one of such).

Mind if we leave this optimization idea at rest for now, or separate it into a different thread? I don’t mean this in a dismissive way; just that I feel there’s a lot of other questions to tackle fist.

The feedback/idea is taken and noted, thank you for sharing it. I’ll think more about it.


Sure, I didn't understand the suggestions that it was either impossible or improper, but if it's more complexity than you want to take on for the initial implementation then no problem. It gives the opportunity to gather statistics about how many such calls could be theoretically batched once people start to use it in practice.


For just a small side note - I really like the choice of going with Codable for the initial release, for the purposes of making the interactions incredibly simple and easy for any new developer to step into. The ease of accessibility to new developers, and the fore-thought for making distributed systems easily testable, are two of my personal highlights of this pitch.

I also agree that the ability the optimize this is important, although I don't see how making Codable a requirement would reduce anyone's ability to optimize a transport layer. Maybe I'm wrong there, but I think the ease of initial accessibility is easily worth it.


That's what i'm sort of advocating here. Of course being able to customize the transport and the message codec open to all sorts of implementations, but on the other side make assumptions about compatibility hard.
I'm all for the transport to be open and custom, giving this could be used for IPC, TCP, UDP, HTTP, etc..

In practice this means that you would have problems interfacing with channels from others applications given no assumption about compatibility can be made. And this sort of defeats the purpose of distributed communication between nodes.

Of course if the player have control of every node, lets say in a cloud computing farm, this will not be an issue, but it would become harder to adopt in scenarios where nodes are out of your control, and giving no compatibility assumption can be made in this case, it might not be chosen to deal with this kind of problems.

Its not that the message codec should not be customized, but a couple of default implementations are already there making more likely that the channel will be compatible when you want to use it as a client when the transport is beyond IPC.

For the IPC transport case everything is more simple of course, but beyond this compatibility will be a issue and in that case people would probably stick with something like gRPC because every node talks the same idiom.

Will there be anything in language to help transport and service discovery layers identify incompatible endpoints as code changes over time? I'm just wondering how rollout of non-backwards compatible distributed type changes will generally work and how much manual effort will be required when making such changes ...

Say for example a server changes the name of a distributed method and rollout to clients with the needed update will take place over some longish or uncontrolled timeline ... I guess deploying this change naively to all servers would lead to runtime failures on clients that haven't been updated?

I wonder if the language could maybe compute some kind of hash to identify all the distributed types involved in a module/transport and make that value available at runtime in such a way that a transport/service discovery layer could be implemented so as to ensure that connections are only negotiated between compatible peers?

A feature along these lines could maybe help enable fully automatic service lifecycle management processes to keep older "abi versions" of an actor-service running in parallel with newer "abi versions" until all peers have been updated or until some sunset trigger has been hit ... (*note i'm using the phrase "abi versions" in quotes to reference the concept of compatibility across the distributed type surface area -- not sure what the right term for that compatibility would actually be)


We should be focused on keeping language-level features as orthogonal as possible.

I don’t think it is possible to coherently automate name changes. Current best practice says that you should release a minor version with the deprecated old declaration and the new declaration, using Swift’s existing @available attribute to express that relationship. Then you tear out the deprecated references in the next major update.

I do think there’s an argument to be made for allowing multiple module versions (that is, different modules with overlapping names) simultaneously. I’m not sure it’d be worth the complexity, though: to be useful, you’d have to negotiate competing protocol witnesses, dynamic module names, and numerous other major problems. It’s far beyond the scope of this discussion, at any rate.

1 Like

Perhaps this discussion is out of scope and I don't want to be noisy -- I'll just reply with hidden details in case anyone might see a utility in reading my thoughts on the process of deploying code with changes to the distributed type surface area. I'm guessing this topic has already been considered by the proposal authors in more detail than this discussion anyway with their thoughts just perhaps not yet surfaced in this proposal ...


My comment wasn't a suggestion aimed at automating name changes ... Changing the name of a method was just a simple change to the 'distributed type surface area' that I was using as an example. If such a change was made and then deployed naively to all servers and deployed via a slow rollout to all clients then there would be a period of time where there would exist clients with a type incompatible model of the distributed actor which would probably result in runtime errors if the client connects and sends messages to the newer server implementations.

There are obviously a variety of approaches to handle this problem -- the fully manual approach you mentioned being one of them.

Another simple approach as my comment pondered -- would be to simply synthesizes hashes for the distributed type surface area and then make these hashes available to transport/service discovery implementations. It seems like this would be easily accomplished within the compiler machinery involved in synthesizing type stubs and if this stable identifier for the distributed type surface area was exposed to transport/service layer implementations it would be easy for the ecosystem to then arrange transparently that only peers with the same distributed type surface area ever communicate with each other.

Another more advanced (perhaps better?) feature could involve something like the elm language ecosystems automated semver enforcement for packages. In the elm ecosystem the compiler enforces across public api of a package that any release containing type changes that could lead to a compile error in any client code must be released with a major version bump. If there are type changes which only expand the set of valid client programs then there only needs to be a minor version bump. This level of feature requires some notion of code history -- which could be by a file synthesized by the compiler that is expected to persist and live in source control or some manner of integration with project source control itself. The implementation either way would make the feature significantly more complicated -- which is why I pondered just the simpler thing -- and to be honest in the service context, I'd probably personally prefer that 'additive' changes to the distributed type surface area produce peers that don't communicate even if such communication might technically be possible at the type level. To my mind, in the single organization context, these distributed type features enable tighter coupling across distributed peers through overall reduction in coupling friction -- and as a result we should expect tighter implementation coupling across peers to happen than we might be used to in other distributed systems. So then from an overall risk management perspective I think there is an argument to prefer management processes that encourage communicating peers to be as close as possible to each other in version space and tooling that enforces that communicating peers share the exact same distributed type specification would be one way to do that.

Anyway external tooling could certainly be used to accomplish either of these features -- but I think part of the hope with the distributed type feature is that required code generation steps will be minimized or at least not require the integration of external tooling. So I'd think, especially for the common case where all peers all deploy from the same codebase, including support for a mechanism to enable simple ways of deploying breaking changes to the distributed type api without manual effort seems worthwhile to me ...

I love this.

  1. Seems to me that I am not understanding how one gets a unique actor on the remote end? Or is that all up to the transport? In a multiple worker pool/cluster, Does my local instance of the distributer actor care about which actor I am picking? In a way the local distributed instance is like a proxy to the remote actor. Are the remote actors more or less like global actors where there is only one instance of them? I guess I am stuck in the mental model of remote workers.

  2. I love that we are requiring distributed actor methods to be async throwing. I don't like that this would be implicit. It should stay explicit. perhaps we could introduce a new distributed effect that equal async throws.

  3. I think Distributed actors should all have a default deadline.

  4. I agree with no including properties in local instances of remote actors.

Deadlines are generally only something that makes sense in the context of unreliable transports like networks. If we were to use distributed actors for inter-process rather than inter-machine communication, having deadlines/timeouts would likely not be correct.

1 Like

The alternative is a method that awaits forever, no? for the inter-process case, if the process gets killed and I am awaiting on it -- how would I know the process is gone?

I would expect that to produce an error of some sort, or at least that's what XPC does

Answering the question of “is a node down or just slow” is actually pretty tricky in distributed systems :wink:

Thankfully that’s why we developed: Swift.org - Introducing Swift Cluster Membership

Distributed failure detector algorithms are able to increase the confidence with which we declare a node down, and could fail calls to actors located on it.

This usually though goes hand in hand with plain old timeouts for single requests — in networked transports.

For IPC there’s different tools to use to detect such issues.

In xpc we get errors from the underlying mechanims, and we’d want to carry deadlines but not in wallclock time but in monotonic time etc.

All these are topics we’re aware of and working on :slight_smile: it can be summarized as “it depends on the transport, but no call should wait forever”.

PS: I’m on a trip right now so sorry for not getting back to all the bigger questions — I’ll go so next week replying in depth to many of the topics up thread :thread:


The alternative is having actors and/or transports handle that at their own discretion, rather than adding a language-level requirement.

The transport needs to specify which types it can deal with. And ideally this would be checked at compile-time. What happens if my transport only allows only a fixed subset of Codable types and I try to use it for an actor that needs unsupported types? I fear with the approach that everything must be Codable this will only be possible to detect at runtime.

Or you can imagine an transport that allows some specific types that cannot reasonably implement Codable. For example, using XPC or Unix Domain Sockets you can pass file descriptors to the other processes. If I wanted to build a transport and actor that uses that feature I would be forced to implement Codable for my file descriptor type, even though there is no other encoder that could encode that in a reasonable way.

That, of course, doesn’t mean I’m completely against Codable. I think it would be a very good idea to have a default transport available that relies on it. But other custom transports should be able to decide on their own which types they handle.


Okey then, back to it! I'll go top-to-bottom from the last bigger topic I responses to.

Thanks again for the great review @Douglas_Gregor, I'll deep dive into the topics one by one.

Codable requirement

I'm definitely on board with dropping this "hard" requirement and making it only be the "convenient default" for a bunch of transports -- let's move it into initial proposal rather than keeping a future direction.

I think it's important we define such "codable message transport" protocol, and a generic mechanism so people can define other requirements, and I think it could look like this:

// previously known as `ActorTransport`
protocol MessageTransport { 
  // or maybe DistributedActorMessageTransport... we'll see

  // The type requirement that should be checked against all distributed
  // members, including all function parameters and return types of
  // functions and properties. 
  // E.g. setting this to `Codable` means that all `distributed` member 
  // types must be `Codable`. 
  // Types must also be `Sendable` but it is not 
  // necessary to specify this here, since such members are sent cross-actor
  associatedtype MessageTypeRequirement
  // ... all funcs the same as today's ActorTransport ...

protocol CodableMessageTransport: MessageTransport {}
extension CodableMessageTransport { 
  typealias MessageTypeRequirement = Codable

// allowing us to do:
protocol AsRawBytesMessageTransport: MessageTransport {
  // protocol AsRawBytes { func asRawBytes() -> [UInt8] }
  // ...  

resulting in the optional checking for Codable messages if the codable transport protocol is used:

distributed actor X { 
  // way 1
  typealias Transport = CodableMessageTransport // TODO: decide name for alias
  // 1) causes synthesized init to be:
  //      init(transport: CodableMessageTransport)  
  // way 2
  init(transport: CodableMessageTransport) {}

We would support both those ways; Specifying at least one user-defined designated initializer automatically causes the Transport type to be inferred to that type. All initializers must use the same transport type.

extension X {
  // either cause checks to be for Codable, like today's proposal:
  distributed func hi(n: NotCodable) -> NotCodable 
// error: (X.Transport (CodableMessageTransport)).MessageTypeRequirement
//        requires that parameter 'n: NotCodable' conforms to 'Codable'
// error: (X.Transport (CodableMessageTransport)).MessageTypeRequirement
//        requires that return type 'NotCodable' conforms to 'Codable'

If one wanted to, one might use a specific transport there as well -- though that's discouraged since it makes testing harder.

The important bit here is that we really want to support compile-time knowing if I'm allowed to pass certain types around the distributed actor or not. E.g. a SpecialTransport could only accept SpecialType conforming parameters and return types, and it might then audit and implement specific types to conform to this SpecialType.

Or it might actually be a concrete type allowing transports to enforce a "closed world" (i.e. we know EXACTLY what is allowed to be passed in a message) – this is an important use-case for a few potential adopters of this technology we're working with.

distributed properties

Yes, that's a fair observation and I think I'm okey with lifting the prior restriction.

The ban on nonisolated stored properties is a fundamental issue and we cannot support it. I'll expand wording about it in the proposal why that is so though :+1:

The restriction on wether or not they may be distributed is indeed more of a policy may be overstepping a bit into stopping users from doing it because "usually, it's a bad idea". But I agree this is perhaps a too weak argument and in practice "getter funcs" are equally as bad of an idea, so we didn't truly solve anything with this ban.

Conclusion: agreed, let's lift this restriction, it has the following implications:

  • stored and computed read-only properties are allowed to be distributed, e.g. distributed let x = 1
    • they appear implicitly async and throws from the outside of the actor; same as distributed funcs
  • stored mutable properties (var x = 2) cannot be distributed, because we cannot have "async setters" currently
    • if we really had to, and swift gained the ability for effectful setters (throwing and async) we could add this in the future, same as for normal actors though I think that's a bad idea to expose setters like that, so I'd prefer not to (ever).
  • any such distributed property, would be subject to all the same return type checks as a distributed func is (if the transport wants Codable messages, the property must be Codable, etc.)

We'll have to figure out a way to identify such properties "next to" functions, but that's doable.

Distributed methods

I'm generally interested in getting distributed actors as "close to" normal actor semantics as reasonably possible, however there's a number of issues with just removing the distributed func marking like that. The last topic here is the most interesting I think, but let's go one by one. ​

1) Extensions and cross actor calls

Correct, however I don't think this is a problem? O
They're not stuck stuck, they can define a distributed func df() { self.f() } if they really need to fix this inside module B, and cannot update module A.

2) distributed message type requirements

I don't think it's just an outcome of the current (temporary) source generation based implementation that we need those distributed markers on functions. Specifically, it was driven more by the type-checking aspects where we want to help developers write correct distributed actors on "first try" such that "when it compiles, it will work distributed" :slightly_smiling_face:

So currently this means that the "extra checks" are applied only to those distributed functions, this allows us to "error early" during the development of such actor. A developer writing a distributed func, that accepts e.g. a closure, will be immediately notified this will never work and they can fix it right away.

So that was the driving idea here. Let's explore the implications of what you are proposing though:

struct Thing: Sendable {} // NOT CODABLE

public distributed actor DA { 
  typealias MessageTransport = CodableMessageTransport // see above

  public let getMe: Thing
  public func hi(thing: Thing) {}

func ext(da: DA) async throws { 
  da.getMe // error: should error, not Codable
  da.hi(thing: Thing()) // should error, 
  // reason: Thing is not Codable, DA.Transport requires Codable
  // result: Thing can never be called cross-actor

This is a bit odd, but let's discuss the tradeoffs:

We lost:

  • errors at implementation-time of the distributed actor
  • clear denotation which API is the "intended to be called cross (distributed) actor"

We traded it for:

  • use-site errors, so one would only notice when trying to use such function or property cross actor
    • It feels a little bit "accidental", but realistically if a function never was invoked cross-actor... wasn't there even a test for it etc...? So perhaps this tradeoff is sneakily smart?
    • This is interesting since technically it'd even support a generic transport :thinking: Though in practice that sounds too annoying to use for real; I'd expect most implementations to stick with e.g. the CodableMessageTransport

Without getting into the how do we even implement this (it gets complicated, but I think we could make it happen with enough mangling, and special synthesis and extension points).

This trade-off is... definitely interesting. As far as the type checking goes I think it's workable -- it'll be hard to implement I feel but perhaps we can make it happen.

Resolving non-distributed actor constrained protocols

IF we went with this... how would a resolve work like though? It feels pretty weird/dangerous to allow a resolve of any protocol.

Currently the resolve is defined on distributed actors, so a distributed actor constrained protocol has it and thus we'd be able to call it - try Worker.resolve(..., using: ...), given a protocol Worker: DistributedActor.

If we went with this "Worker, happens to be implemented by an actor somewhere" we'd need to allow:

protocol Worker { 
  func work() async throws // OK! Can be implemented by dist actor

let worker: Worker = try transport.resolve(knownID, as: Worker.self)

This trivially works, since the work function is async and throwing...

Problem case 1) but what if resolve is passed a protocol that cannot be implemented using a distributed actor, such as:

protocol CantBeDist { 
  func work() // not async, not throwing... can't be a dist actor

try transport.resolve(knownID, as: CantBeDist.self) // should throw

We really should make this throw, because the implementation is "impossible" - we can't do the thunks and "make" the protocol suddenly be implicitly async and throwing.

The implementation of a resolve is never going to be possible in "plain swift" anyway, so perhaps we make it powerful enough that it inspects the type metadata that it is passed and check if it is able to resolve the type, and throws if not?

That would be reasonable I guess... resolves are not frequent, and almost never called explicitly by end-users (usually frameworks do it internally).

Following this train of through, I would really want to also allow but not require defining a DistributedActor constrained protocol, like this:

protocol Greeter: DistributedActor { 
  // implicitly has address/transport nonisolated properties, as if 'dist actor'
  func work() // implicitly throws, async; as if defined in 'dist actor'

I feel this is useful/practical in a system where I specifically know that those are actors, and I want to use their identity for things (storing them, watching them, etc), it's fairly important in cluster settings where users are intended to implement such "greeters" and the framework does things with them.

With such protocols, if we take the message type requirement story from the beginning of the post, we could:

protocol CodableGreeter: DistributedActor {
  typealias Transport = CodableMessageTransport
  func hello(checked: Checked) 
  // 'Checked' is required to be 'Transport.MessageTypeRequirement' (Codable)

This would bring back the notion of "when I'm defining my protocol, I want to know for sure it's usable with my transport" at declaration time of my interface.

I think this is quite nice:

  • when just working with explicit distributed actor declarations
    • we have the least friction and checks are similar to isolation checks, during the "wrong use" at a call-site
  • when declaring protocols, and we specifically opt-into knowing this should only ever be used with a specific distributed actor transport (family, e.g. the codable ones), we get the checks early, because declaring anything that would violate these requirements cannot be implemented, so we can error early
    • this helps use-cases where we want to be "API first" which some use-cases have been asking for
    • users of such protocol also exactly know what they can expect from it and all the requirements are clear.
    • any mis-use at use-site would result in the same errors as-if they worked with a specific distributed actor (great, this is what we want!)

Thank you very much for the ideas here, it is very promising, though we'll have to figure out quite a bit of implementation tricks to make it happen I feel - seems doable though.

Uff, this ended up pretty long and I didn't yet get to identities...
I'll post this for now and write about the default protocol conformances next (they're crucial).



Haha yeah -- we definitely need it, but it should feel "yuck!" to use it :wink: I like what we arrived at and should thank @kavon here for a number of bouncing ideas around it recently.

It also works well with the "distributed func"-less model you are proposing; it does the same thing -- does not apply any of the checks. And since the checks have moved to call-sites mostly, rather than declarations, this composes very nicely:

// future direction, if/when we drop the `distributed func`
distributed actor Worker { 
  typealias MessageTransport = CodableMessageTransport
  func take(closure: @Sendable () -> String)
func test(w: Worker) async throws { 
  try await w.take { ... } // under the new rules should fail, param not Codable 
  try await w.whenLocal { ww in try await ww.take { ... } } // OK!

So this composes very nicely even with the new rules if we'll get them.

Thanks for the Sendable note, that's an oversight I think!

Equatable and Hashable conformances

This one is non obvious yet super crucial! I'm glad you're asking, since it seems the rationale in the proposal isn't clear enough, I'll add to it what I'm about to explain below.

Specifically it's driven by issues wrt. "ensuring uniqueness", like you mention:

On the other hand, ensuring uniqueness for remote actor instances means you probably have a big old lock in the actor runtime around actor resolution.

however in a very different way. If we said that every resolve returns the exact same proxy instance (exact same ObjectIdentifier for given resolved remote address), it effectively causes us to retain and grow memory of looked up actors forever (!), this is not acceptable in long-running server systems which are resolving an unbounded number of actors (they can be short-lived after all!) during such node's lifetime.

Problem scenario, showcasing why object identity cannot be used as "remote actor identity"

Consider the following problem scenario that arrises if we used object identity as identity of remote (resolved) actors:

  • we get "some" ID1 from network
  • we resolve it, it is a remote address, we have to create a proxy instance for it

IF we said that every resolve of ID1 we have to remember this mapping of ID1 -> this proxy ObjectIdentifier because...

  • eventually the proxy was used and dropped; ref count drops to zero, the proxy object is released... (good!)
  • we get the same ID1 over network...
  • we resolve it, it is a remote address...

IF we guaranteed that every resolve results in stable === identity of the resolved proxy... we must have some dictionary to remember that ID1 -> Objectidentity... Okey fine, we had such dictionary, we do the lookup and return (magically, ignore if this is even doable), "the same" proxy...

Notice that we'll never be able to clear this dictionary! :scream: :fire: Relying on object identity causes us to practically speaking accumulate and "leak" memory forever, even if all those resolved references are long dead and will never be used again, there's no way to know that.

The same scenario, with relying on ActorIdentity for equality and identity

  • we get "some" ID1 from network
  • it is remote, return a proxy, the proxy contains ID1
    • it is Equatable and Hashable are implemented in terms of ID1
    • any lookups in dictionaries that users have or else just work; including "the same actor came back" etc.
  • ...time passes...
  • we get "some" ID1 from network
  • we don't care at all if we've ever seen it before, we create a new proxy -- proxies are cheap.

Yes, the object identities of those two proxies will be different, but that isn't an issue for any semantically meaningful things. The only thing the proxies to is forward messages to the remote instance, as such it does not matter "on which proxy" we call a function; the result is the same (no impact on message ordering etc).

And we solved the potential infinite memory growth issue - we don't have to care if we've already seen a remote address or not.


It would not be enough to share just the "remote object's ObjectIdentity" in general for risk of conflicts with other nodes who might just happen to have allocated an object at the same identity;

A distributed identity must contain the remote node's identity (unique ID of a node/process), and the identity of the specific instance (that could actually be ObjectIdentity if we really wanted to I guess, though it sounds like a bad idea security wise to expose those?).

For IPC a bunch of these things can be simplified, that's why our ActorIdentity is a protocol, but in general we can't make too rough assumptions here.

Note also that it should be true that:

  • node A creates DA at ID1
  • node B resolves ID1

In unit tests, you can run two nodes in the same process, and write tests like "is the actor that was greeted the right one?" etc, and you can just compare the actor that node B received with the original.

So... for these reasons, the identity and equality/hashability is tremendously important for distributed actors. There's a world of pain and complexity avoided by the flexibility we gain by delegating identity to the specific ActorIdentity rather than treating those the same as local-only actors.

I hope this clarifies this necessity for the special treatment of identity of distributed actors. Happy to discuss more in depth if necessary of course!


Yeah that indeed is another of the use-cases. You could do such "tree of processes" which could help isolate "hard crashes" even without any actual crash handling in Swift itself -- the entire process does go down still on a hard crash, however the supervisor process just notices that, and same as with distributed systems can bring up a replacement. I have implemented such example transports while prototyping a long while back, so it definitely is viable :slight_smile:

Yeah client to server scenarios are very interesting for this as well, though we don't have anything specific about that yet to be honest. Practically speaking this language feature allows you to do "whatever you want" really, so all those ideas are doable.