[Pitch] Distributed Actors

hassila · September 20, 2021, 6:53am

I’m still interested in thoughts on ‘protocol evolution’ and compatibility / best practices to handle that - is it fair to say that the current pitch expects all participants in a system to be built with identical definitions? Many wire formats allow for adding additional fields to payloads without breaking things, and also well defined handshakes to verify compatibility during the connection phase, how are the thoughts here?

ktoso · September 20, 2021, 7:51am

Hi Joakim,
So while I absolutely understand where you're coming from with the question and indeed all those compatibility and rollout questions are the hard bits of designing a robust distributed runtime (well, transport in our wording)... this proposal won't address these questions at all because it is designed to completely leave it all up to specific transports. There is no wire protocol in this language proposal, and as such, it cannot answer questions about the wire protocol.

There's a strong and real tension between "define and synthesize more in the language" and "leave it up to transports." The current design, on purpose, strongly leans towards the latter design strategy. Some of @Douglas_Gregor's points and ideas would push it towards defining more and more in the language, including some aspects of serialization (i.e. mangling "distributed method handles" etc), but it is not yet proven it actually is implementable with supporting all the flexibility we need from this language feature.

To dig into the root of your question though: will runtimes take care of wire compatibility and allow things like rolling deploys of new versions (of nodes) in a cluster (with zero downtime) etc etc... Yes, this is all doable, but again, up to specific transports.

I understand this may sound like hand-waving the problem away, but I assure you it is not. If you are concerned about wether we have enough experience with such systems to care about wire / forward / backward compatibility of clusters and wire protocols... I can assure you I spent years worrying about this and everything we do has this accounted for that we can do the right thing. Even if it is not specified in the language itself (because IMHO it should not be specified by the language).

If you are still concerned, you can refer to some of my prior work in this domain, Akka's rolling upgrades documentation which documents a few cases, and long term (even through persisted messages) schema and message evolution in actor runtimes.

Bottom line is:

In today's design, this is completely up to transports.
- Because message "dispatch" (from wire to function call) is implemented by transports.
The more we move "into the language", the more the language has to "lock in" features and schemes. This is not necessarily a good thing.
- Long term, to do a really really great job at this, we'd need more powerful meta-programming to annotate "favored" versions, or even keep same functions with "old version" annotated etc... We don't have this expressivity today, and we should take this one-step-at-a-time:

Same as the Structured Concurrency and Actors proposals... this work is not going to be "finished" with this proposal alone. There will be many more steps polishing, and adding missing features as we mature this work over the coming months and years even.

Yes, this is absolutely expected that transports do such handshakes. It feels wrong to specify this in the language proposal though.

I expect we should do some "transport implementation guidelines" and I think I should get to that eventually. I think it is fair to expect this should happen before the proposal is seriously to be considered for acceptance

Sure, if your transport uses a format which supports that (json, bson, SBE (which I like a lot), protobuf), can all support this.

Not really, there can be differences. One would expect a message send to a remote peer which "does not have this method" to jsut fail and get back a failure that "method did not exist" pretty much like a 404

Saklad5 · September 20, 2021, 8:01am

I’m very much in favor of doing this right rather than doing this quickly.

Implementations always have trade-offs, and Swift is at its best when you can avoid making any by way of clever abstraction. If one transport becomes sufficiently popular to be dominant, akin to the language-level wire protocols people have mentioned, so be it. The ability to choose something else for edge cases will remain useful.

johnfairh · September 20, 2021, 5:53pm

This does a great job of carving out a decently-sized piece of the problem with good boundaries.

withLocalDistributedActor The doc says this is a DistributedActor method, but the types read as though it’s a free function. It feels like it would work OK as a method?

Actor-Transport lifecycle I got confused here, but I think the semantics of resignIdentity() are that the ID is definitely one from assignIdentity() but because of failable init(), actorReady() may or may not have been called?

Passing actor references Is there a conflict between location transparency and Swift reference counting? It feels like I could need to know whether to keep a strong ref to an actor method parameter because it is a proxy, or a weak ref because it is just another object in this process that already references me. As I understand the design, resolve() returns proxy actors with refcount of 1 (transport will not keep them alive) but instances with refcount of at least 2 (bcoz resignIdentity() is still owing).

Code gen I understand the doc says that this will be embedded before an un-feature-flagged release. It feels like all transports during this bringup phase will need the Analysis part and maybe some of the SourceGen - it would probably help transport developers to pull out that common piece into a separate high-quality module for reuse, if only to avoid multiple people struggling through the same swift-syntax etc. issues.

AnyActor It would make part of my brain happy to join up the types like this; otoh I'm not sure what I'd do with it in practice. The doc says it would be useful at least in the ability to signal that a type "definitely is (any) actor" -- could you say a bit more? Useful in what way?

Separate client/server I thought the protocols-as-IDL thing came out very cutely and immediately useful for a client-server (or at least heterogenous-node) setup to avoid shipping server modules on clients. I have worked on systems where [closed] software licensing has made this an actual issue, far from the common case though. Being able to share a programming model with ‘true’ distributed actors as so far as method restrictions and interactions with the transport go seems very convenient for programmers.

I agree with Doug upthread that if dactors are purely for same-code-everywhere clusters then this feature doesn’t make sense, but I’m missing the argument for why we’d restrict the feature like this.

Saklad5 · September 20, 2021, 6:01pm

I think the AnyType naming convention should be reserved for type-erasing containers, so you know to avoid them if possible. Sort of like Unsafe.

Ideally, the “root” protocol should read like a noun: Actor. Since renaming the existing Actor to LocalActor is apparently off the table, I vote we name it ActorProtocol along the lines of StringProtocol.

I know naming is less consistent than it could be (and I really think the API Design Guidelines should be fleshed out for things like this), but I think the meanings I described are more common than not.

benlings · September 20, 2021, 8:47pm

I'll leave others to comment on the specifics of the pitch, but will just say that I'm excited to see the details of this.

Will there be a way to do a one-way message send to a distributed Actor? I've mentioned this during the 'normal' actors proposal, and know it was something being considered then, but wanted to bring it up again because I think it's probably even more important for distributed actors where there is additional latency involved in a round trip. I'm not sure how this would interact with wanting to know if the message was successfully sent - would it still always need try? Or would it even need await (I guess this would require some other way to report on transport errors)

ktoso · September 21, 2021, 12:25am

I'm strongly advocating to fixing this in general, because it already is causing issues with plain local actors, somewhat undermining their usefulness as synchronization tool in some situations: Task: Is order of task execution deterministic? - #2 by ktoso

One possible solution would be to allow send actor.message() which would be the "don't wait for reply; but DO wait until the message was enqueued (in the local sense, or 'sent' in the distributed sense)". This would also solve the issue of non-deterministic message order that is highlighted in the above forums thread.

It's not yet clear if the team would be willing to accept this language extension though; we don't have specific plans to share about this as of now.

If we had such uni directional calls, I'm not sure but we still may consider them throwing, just because the local part of a message send could still throw I suppose (i.e. serialization failure).

// edit 1: typo, plan -> plain

ktoso · September 21, 2021, 12:36am

Thanks John! Going through the comments then:

Thanks, I'll clean that up as we go through implementing this function. It has some minor mistakes elsewhere as well.

I'm leaning towards having it a free function because it makes it harder to discover and not just "randomly" use it everywhere since it ends up in auto-completion -- that would be bad.

You're right that failable inits make this a bit harder, and these semantics need more docs. I have not implemented all the edge cases here. The exact semantics of when which function is called will indeed be dependent on throws and probably where exactly the throw happened -- i.e. before all properties were initialized or after etc.

Note to self to follow up on exact lifecycle call semantics in face of failures.

The missing bit here is perhaps more lifecycle monitoring things, I have this worked out but it'd be a library thing -- I'll add a note to myself to follow up on this, but in general I don't think it's a big issue.

Practically speaking when you need a "keep this alive because others will refer to it" you'd register it with some "registry" that would strongly hold it, and upon deinit you'd remove yourself from it. Though you're right to call out perhaps more discussion of this will be needed as we mature the project and discover appropriate patterns for it.

Yeah the use-case for it is pretty weak but it would allow making sure that some protocol is implemented "as an actor", say "Worker & AnyActor". It doesn't really gain us much though... the entire protocol is still just Worker after all.

But it does feel pretty "right" to allow expressing this need in any case... We'll see if we end up with this marker protocol or not

Thanks again for the comments, I added some followups about this for when I flesh out the proposal's next revision, thanks!

johnfairh · September 21, 2021, 8:34am

Thanks Konrad, all makes sense; agree the best way to figure out the refcount-type things will be to build some software with this.

s-k · September 21, 2021, 12:10pm

First of all: Thank you for this pitch. I think this will be a cool feature.

Just in case anyone is interested: I was inspired by this pitch to create a framework that does something very similar for client-server communication: SwiftyBridges. You basically write a type that handles the logic on the server and the framework creates a matching type for the client plus all necessary communication code. This means you don't need to worry about building requests and responses or creating an API client.

First, I wanted to implement this framework as a transport for the feature proposed here. However, I decided against it. Maybe my reasons are valuable to someone:

The server API type very probably relies on libraries not available on the client. Therefore, the API type would not compile on the client-side. This would require the future direction 'Resolving DistributedActor bound protocols' to be implemented.
The current version of the feature relies on code generation. If the framework needs to generate code anyway, then generating the whole type is not much more work than generating the dynamic method replacements. This means that the proposed feature would have meant almost no benefit to me as a transport implementor.
I do not think that an actor is a good fit for the server API type:
- If the actor was responsible for multiple clients, then it would only be able to respond to one request at a time. It would also pose this question: How does the actor receive the identity of the client or the user? Would every method need a parameter clientID or userID? That would mean that the client code would need to add this parameter to every method call.
- If one actor was just responsible for one client, then keeping all actors in memory would be infeasible on a server with many users.
  Therefore, I opted to model server API types as structs and for every request, a new instance is created.
Modeling the client API type as an actor would mean that only one method could be called at a time. This is an unnecessary restriction.

To make it clear: These reasons do not mean that the proposed feature has no value. It just means that, in my opinion, it does not fit this application, at least not in its current form.

hassila · September 28, 2021, 1:26pm

Hi Konrad,

Many thanks for the exhaustive reply, I finally got time around to re-read and follow up.

this proposal won't address these questions at all because it is designed to completely leave it all up to specific transports. There is no wire protocol in this language proposal, and as such, it cannot answer questions about the wire protocol.

That's fair enough, my main concern is perhaps not so much the wire format (and we may end up using e.g. SBE and/or protobuf there depending on use cases as you suggest), it is more about the higher level questions - like how to perform discovery of possibly concurrent different versions of distributed actors (so we can pick the right one) - it's a bit unclear how the resolve mechanisms would allow for doing this at a first glance (but quite possibly I just miss it).

As the distributed protocol definitions fundamentally are defined in Swift, rather than an external schema (which I think is super nice!), there is a bootstrap question to consider on how to perform the resolving (should there be a standard protocol for this?) - if there are two concurrent running implementations of a distributed actors, I (may) need to be able to choose between them. I guess we can implement such a protocol ourselves and just never change it, but perhaps it would be of common utility?

• The more we move "into the language", the more the language has to "lock in" features and schemes. This is not necessarily a good thing.
Agreed.

◦ Long term, to do a really really great job at this, we'd need more powerful meta-programming to annotate "favored" versions, or even keep same functions with "old version" annotated etc... We don't have this expressivity today, and we should take this one-step-at-a-time:
Same as the Structured Concurrency and Actors proposals... this work is not going to be "finished" with this proposal alone. There will be many more steps polishing, and adding missing features as we mature this work over the coming months and years even.

For sure, it's another (like concurrency) too big chunk to bite off in one go.

I expect we should do some "transport implementation guidelines" and I think I should get to that eventually. I think it is fair to expect this should happen before the proposal is seriously to be considered for acceptance

I think this might be the key point, that I'm simply not seeing the full end-to-end picture without some more guidance...

Looking forward a lot to see this evolve - as exciting as the concurrency work has been really - super for Swift on Server.

Nickolas_Pohilets · October 3, 2021, 3:45pm

Also note that promise to have stable identity of live proxy objects is not the same as to promise of stable ObjectIdentifier’s produced out of them.

The first one is nice to have and probably could be implemented without unbound memory growth using weak references. Simple implementation might use a dictionary with weak references as values, and prune dead keys from time to time. In this case you are promising clients that if they have a live instance, it’s identity is stable. And if clients don’t have a live instance, they should not care if it’s stable or not.

The second one is not needed, IMO. Even with non-actor classes, you can observe memory address being recycled for object of the same or different class. It is already not safe to let ObjectIdentifier escape, unless you are taking additional measures to ensure that instance stays alive.

There is also identity of weak references (address of the side table). There are some applications where this is needed, but currently it does not really work in Swift anyway (see Hashing Weak Variables). But if it would work, stability of weak references cannot be implemented exactly for the reasons you mentioned.

Saklad5 · October 3, 2021, 5:13pm

For most purposes, it is best to ignore ObjectIdentifier in favor of the Hashable implementation when determining whether something is functionally "the same” anyway. By the same token, determining whether something refers to “the same thing” should be done using Identifiable (whose documentation explicitly encourages providing a domain-specific implementation rather than using the default for classes).

masters3d · October 4, 2021, 6:55pm

Distributed actors seems to implicitly have the necessary requirements to fault tolerance (since they are running in a distributed mode). Would it be technically possible to run full fault tolerant distributed actors within the same process in production? (As in if the code fatalErrors, it only takes out the one actor where it fatal errored. )

Saklad5 · October 4, 2021, 9:01pm

fatalError(_:file:line:) simply returns Never. Since you can’t make a value of an uninhabited type, that means control flow never leaves that scope.

If you want a non-fatal error, don’t use that.

ktoso · October 4, 2021, 9:16pm

So yeah in general (in-process) crash handling is closely associated with some actor implementations out there. This is really a property of the "local" actors. What would potentially enable this is actually distributed actors stronger isolation requirements (banned stored property access).

Long story short: we are very interested in this topic, but do not currently have any work happening towards this. Improving the crash handling story indeed could be one of the future work with actors and distributed actors, but right now it is not part of the pitch or plan of record.

We can implement failure handling of entire nodes right now with distributed actors of course -- so if you were to implement "a bunch of actors in this other process" and the entire process dies because of a fatal error, we can indeed notice and notify about the crash of all the distributed actors located in that process. We could provide some helpers for that, but that would be a linux/server solution. (Basically like a multi-process transport). In a way this is similar to what XPC can do -- bunch of actors in other process, if process dies we can easily notice that.

Saklad5 · October 5, 2021, 12:27am

Was there any change on whether computed properties will be permitted, by the way?

ktoso · October 5, 2021, 2:28am

Yeah, I agreed on that in principle up-thread – we'll evaluate that.

In the next iteration of the proposal (probably a few weeks out) I'd like to explore getting rid of source generation, so that adds quite a bit of complexity and functionality to the language to support that, it'd also impact if and how we can support computed properties.

johnburkey · October 6, 2021, 7:23pm

Hey, any thought to having "streaming" funcs built in, like GRPC? Its nice to keep an HTTP2 connection open, and actually the semantics of some interactions like being setup that way.. In GRPC we have client, server, and bidi streaming.

johnburkey · October 6, 2021, 7:43pm

Oh, and in our case we use GRPC for both internal and external to Brighten RPC, and there is an additional advantage to us to use GRPC for external to customers API (so they can bind to our services with any language) - so when this goes lives we would probably still keep that, but if Swift distributed RPC is about as performant we will toss the GRPC internal use in the name of simplicity ...