[Pitch] Distributed Actor Runtime

Hello everyone,

Please note that the first distributed actor proposal is in swift evolution review right now: SE-0336: Distributed actor isolation
We'd really appreciate your input on that one. It covers the distributed actor isolation semantics and how SerializationRequirements are enforced on distributed methods.

Since many questions have been coming up about the runtime details of the distributed actor feature, which the first proposal does not touch upon, we decided to pitch this second proposal already, so those curious can spend time understanding the entirety of the feature, and how all the type-checking works in tandem with the runtime.

Reading this pitch can be helpful in gaining a deeper end-to-end understanding of distributed actors, however it dives quite deep into the implementation details of distributed actor systems, and how their remote calls are actually implemented.

This pitch can also be viewed as an implementation guide for potential distributed actor system authors, and at the same time is probably too low-level for day-to-day users of distributed actors, which may be better served by the first, type-system focused proposal.

Having that said, I hope you'll enjoy the read, and us sharing the entirety of our design so far.


This proposal won't go to review for quite some time still, especially over the holiday break, however for those interested, please enjoy and let us know any feedback you might have. Though most importantly, we are interested in feedback about the type-system aspects in the SE-0336: Distributed actor isolation review right now, as it runs until December 22nd.


Table of Contents:

As usual, thanks a lot in advance for your comments.

Editorial fixes and comments, like typos, please submit directly to the Swift Evolution PR: [WIP] Distributed Actors Runtime Proposal by ktoso Β· Pull Request #1498 Β· apple/swift-evolution Β· GitHub - thanks!

[edit: fixed the links]

8 Likes

Thank you for posting this early. It makes it much easier to understand the interactions and relative importance of some of the aspects of the current proposals under evaluation with these additional interconnects visible.

2 Likes

Thanks for sharing this sooner than expected ^^

One thing I'm not clear about yet is how a "local" ActorSystem exposes the IDs of alive actors to other "remote" ActorSystems. Since the register calls are sync I don't see how that can transmit the data across the wire. But then... I guess that's done outside the runtime? as in with a "framework" on top like the Receptionist pattern (which I'm still trying to grasp)?

What is fascinating to me is how this system provides a sort of runtime reflection in Swift that reminds me to what we could do in Objective-C (invocations, selectors....). I wonder if this capabilities have utility outside the distributed actor feature and could be used in other parts of the language; being able to implement proxy objects that delegate protocol witnesses or straight method calls to other instances, etc.
On that note, I see the "summon types" is not part of the system, but I'm curious if Swift provides a way of doing that already. I know and agree that since this is for system implementors it can be complicated and is fine, but I was wondering if it would make sense to provide facilities for it. Maybe they could be used also for other parts of the language like summoning dynamic types from Decoding.

One thing I really liked is how forward thinking the registration of the invocation is made, including generic substitutions and even error types, and how performance has been present during the entire proposal. Kudos!

1 Like

Right, what you're asking about is effectively "where is service actor discovery implemented?"

Indeed it is not the Swift runtime's job to implement any of that. Though maybe some day as we gain enough confidence we could solidify the Receptionist "pattern" into a specific protocol that actor systems are asked to implement... but realistically: all actor systems implement these very differently.

You'll likely enjoy this talk I did recently: [Video] Distributed Actors announced at Scale by the Bay where we announced our distributed actor cluster implementation. In there I mention a little bit about how the receptionist is implemented. If you're really adventurous, you could even read the implementation, over here: swift-distributed-actors/OperationLogDistributedReceptionist.swift at main Β· apple/swift-distributed-actors Β· GitHub


Yeah, the mechanism to "register functions and look them up" is general and not tied to distributed actors per se. We currently only surface this feature for distributed funcs but the implementation is not tied to actors specifically. It is not a goal of this proposal to make this into a general mechanism, but we're doing out best to not hardcode it to anything actor specific.

Oh, summon can take many forms which is why I skipped over it.

The simplest thing could really be:

func summonType(id: Int) -> Any.Type {
  [1: Int.self, 2.String.self][id]
}

:wink: Which may seem silly but is actually very powerful! Thanks to known type registries like that a system is able to survive type renames as well as avoid sending serialized mangled names over the wire for "well known types". This is extensible to user types, so not only Swift types can be send around as IDs. This style does require registering all types you intend to trust in messages though. Which honestly sometimes is a good thing, even if annoying :slight_smile:

The other approach, is what we'll be doing for now in our impls at least... is sending around mangled names, and yes, there is a function though it is not supported to get a type from a mangled name. So you can _typeByName a mangled name and get an Any.Type back... To actually make use of it statically there are a bunch of other trickery one can do, none are really supported -- which is another reason I did not want to explain them in the proposal.

So this is the two techniques: mangled names (very convenient, not super great for security), or type registration (great for security, performance).

--

Thanks for reading the early proposal and the comments already, cheers!

1 Like

Cross referencing this thread with the updated Clock, Instant, and Duration thread, I'm curious about how deadlines are expected to be handled since they came up in the original pitch and review threads for that API.

Should the DistributedActorSystem protocol have a specific Clock implementation as an associated type so that users of the runtime know how to tell and measure time across the entire system?

1 Like

I think that it is reasonable to have one of two potential implementations: either a distributed actor system as you infer has its own definition of a clock, or the deadline is based upon a continuous clock and some sort of conversion is sent over the wire.

I would expect that either way would be a negotiated value of a reference point in time (e.g. an agreed upon "epoch"). For example the shared epoch could be the initial point of connection and all frames of reference are based upon continuous derived intervals from that initial connection.

These ideas are definitely not the only way to implement that strategy. I am quite sure there are likely other ways to accomplish that. The clock protocol and friends should be flexible enough to accommodate for any of these approaches.

1 Like

I think we should leave this to specific implementations; I don't think we should add more constraints and associated types onto the DistributedActorSystem other than the bare minimum it needs to perform messaging (so the serialization requirements and everything related to it).

As you'd be using a specific system implementation, it is free to define a MySystem.Clock and use it for everything it wants, including deadlines of calls etc. I have to re-review the latest clocks proposal, but from skimming it recently it was shaping up rather well.

I totally agree with this part…

…but I'm less convinced that the bare minimum consists solely of serialization. In my experience, remote invocation implementations need to be deadline aware in their core networking code to fully support them.

It's a huge proposal and I haven't fully read it yet, but it seems to me that the remoteCall method would be a natural place to implement deadline support, either as an additional method parameter or possibly even via the DistributedTargetInvocation protocol. That would allow the deadline to not only be serialized to the remote system's runtime, but also (and just as importantly) propagated to the underlying transport.

Deadlines we'd really want to implement as task local values, because setting timeouts one by one on every single call is not what we want. Instead we want a grand overall deadline on the operations trying to achieve some task (pun intended), and that overall task shall have some deadline. So if I'm making three distributed calls, like so:

withDeadline(in: .seconds(5)) { // deadline == now + 5 seconds
  try await worker.hello()
  try await boss.hello()

  let one = Task {   
    withTaskGroup(...) { g in 
      g.addTask { try await worker.work(1) }
      g.addTask { try await worker.work(2) }
      g.addTask { try await worker.work(3) }
      for try await res in g { ...combine results... }
      return 1
    }
  }

  let two = Task { try await boss.otherWork(4) }

  try await one + two
}

What I'm interested in for each of those calls is the remaining amount of time I'm interested in waiting; I really don't care about setting individual timeouts for the individual calls, although I could do so as well -- by doing another withDeadline around the specific calls.

As far as the transport layer is concerned (the actual "write some request onto the wire") it is implementation dependent how and if to propagate there. But since this is all within the same task-call hierarchy, such transport impl can pickup the deadline, check remaining = system.now() - theDeadline and write that as "well, it seems we have 200ms remaining".

We don't need to pollute all APIs with timeout parameters to achieve this is what I'm saying basically :+1:

It is important to remember that we're interested in deadlines in swift concurrency itself, so whatever exact shape that'd take, we'd interoperate with it.

PS: Yeah in practice there's likely to have some default "timeout" applied to remote calls, just to avoid things hanging un-necessarily -- though this can be either done by failure detectors, or by just a minumum timeout applied to calls. To be honest this is usually a system wide default configuration value, and tuned more with the pattern I shown above.

Right, I forgot about task locals and instead misremembered that the compiler would pass deadlines to the distributed runtime or through the local async runtime where necessary. Thanks for the reminder and the details!

And this gets back to my initial question. Now that the Clock protocol is proposed to have both instants and intervals as associated types that conform to the relevant protocols, how is withDeadline(in:) expected to compose with distributed runtimes when there's a mismatch in clock/instant/interval types?

In concrete terms, if I set an async deadline of 5 frames from now based on a GPU clock, what happens if I make a call to a distributed actor that only knows how to propagate deadlines in UTC?

Aside on GPU deadlines

Being able to express an async deadline in terms of GPU frames is, I think, a really cool idea that shows that the various async pitches are definitely going in the right direction. Kudos to everyone working on them!

An option for cases where intervals aren't convertible (like in my example of GPU frames to seconds) would be to not propagate a deadline (or propagate a conservative estimate) to the remote end and instead deal with the timeout purely locally, ignoring late results.

Sure, might be the only option for some things. Point being, either way, it is implementable using the current design and we've been preparing task local values, and distributed instrumentation / tracing all this time with such flexibility in mind.

While we can't predict everything, we can make sure to be flexible enough to support all kinds of weird scenarios -- even though I don't think the GPU one is an area we're targeting: IPC and networks are. But yea, even if we ventured into other weird cases, I believe the design is flexible enough to support "whatever is appropriate for the system/transport".

Hello everyone,
just as a heads up that since the isolation focused proposal ("proposal 1 of 2") [Accepted] SE-0336: Distributed Actor Isolation was accepted and we've been actively working on the runtime (re-)implementation:

I'll be updating this pitch to a proper proposal shortly and hope to put it up for review sometime soon when the core team has time for it. There's quite some (good!) changes in the runtime I hope people will enjoy reading about :slight_smile:

9 Likes