Retain counting for the remote actor instance

ser-0xff · February 20, 2023, 3:49pm

Hi there!

We are playing with DA trying to use them in the product we build, it is not clear how to manage reference counting for remote actor instances.

When we instantiate distributed actor instance in the local actor system the swift runtime manages references to the actor instance and it is deallocated with a last reference release.

When we instantiate distributed actor in the remote actor system then it is not clear how we could (or should) manage its lifetime. By the common sense the DA transport layer should propagate retain/release invocations from the client actor system to the remote, so when the proxy object is retained then the distributed actor instance should be retained, and the same for release. In that case the remote distributed actor instance would be deallocated when the last proxy instance would be deallocated, thus for the remote distributed actor we could provide exactly the same lifetime as for the local actor proxy object. But we can't see any possibility for that.

Does anybody know how that issue was supposed to be solved with the remote actor instances?

John_McCall · February 20, 2023, 5:55pm

I don’t think there’s any need to directly propagate every reference count operation to the remote actor. The remote actor library should keep the remote actor object alive as long as the connection is open, and the local actor object should close the connection when all the references to it go away.

ktoso · February 21, 2023, 6:52am

Yeah you wouldn't want to go into distributed refcounting territory, that's quite a headache and not actually employed by any of the widely known distributed actor systems (akka, orleans, distributed erlang and their various ports and things they inspired). As such, the swift distributed cluster library also doesn't do distributed refcounting, instead you're tasked to figure out what suits your usage patterns, and handle it yourself.

You may be wondering if that's just punting the responsibility of a very hard problem to end users, and the answer is that not really because the usage-patterns driving distributed actor systems usually would actually not really benefit from such transparent counting IMHO anyway.

For example, the most popular and powerful (I really like linking Hoop's talk from 2014 to drive that point across, since he focuses nicely on the business and productivity side of what this programming model enabled them to do: https://www.youtube.com/watch?v=I91ZU8tEJkU&t=800s) way of using distributed actors is in a model where you "just assume the remote instance is always where", and leave it up to the cluster to either find the existing instance, or activate a fresh instance for given id (and potentially re-hydrate it with prior state). This pattern is called "virtual actors" (paper) as coined by the Orleans project, or cluster sharding + persistence by Akka, and we've not yet gotten to baking it into the distributed cluster library in Swift, but it'd certainly be a great addition (issue here) (as IMHO this is the most useful distributed actors pattern out there)

Imagine this example: you implemented a chess or card game using a distributed actor cluster. The "game instance" (state of the game) is stored inside a distributed actor in the server side cluster. There are two clients playing as opponents, maybe there's some spectators too. Now, the player's connections drop, but the game is still in a valid state and ongoing... Here's two ways to think about what to do:

this is actual connection failures, so a cluster (like we offer) offers detecting those and you could indeed stop that game, but it's a business level decision of "yeah, they're gone, let's kill this game, maybe persist the state before we do so" etc;
alternatively, and what akka and orleans systems are great at and you could build with our cluster system as well actually (!), the game can keep going, notice that remotes failed, kick off a timer that until they don't come back and make a move within 5 minutes... you'll actually kill that game instance.

So, IMHO, with distribution rarely is as simple as just shutting things down based on refcounts. Rather, often times the lifecycles are more interesting than that. I'm sure there are cases where this simple counting may be desirable, but I'd suggest not doubling down on it and either adopt an explicit way of communicating "usage".

Maybe also worth noting, what often is employed by all those distributed runtimes is "death pacts" or "lifecycle monitoring", which boils down to "if that actor [a, on some node] dies, I should died too" styles of behaviors. In the distributed actor cluster this is exposed as the LifecycleWatch protocol which allows you to be notified about failures of others, but the keeping an instance alive is still bound to local refcounting - so e.g. you'd have some "KeepAlive" container and could remove yourself when you see noone will be using you anymore etc.

Specifically, in our cluster's case, simplified code would end up looking like this:

distributed actor Player { ... } // those exist on other nodes

distributed actor CardGame { 
     func startGame(player one: Player, player two: Player) async { 
        // store the players, start the game etc...
        watchTermination(of: one)
        watchTermination(of: two)
    }

    func terminated(actor id: ActorID) async {
        print("Oh no! The player \(id) is dead! What should we do...")
        // decide what to do... (kill the game, wait for player?)
    }
}

If you really wanted to, you could build some "i'm using you / i've stopped using you" messages but it'd be done on the high level protocol, and not by propagating the raw refcounting traffic itself.

I think these lifecycle discussions are best done on concrete examples actually, so if you have a specific distributed actor system (transport) and use-case in mind, I'd be happy to help design how I'd deal with your requirements there

ser-0xff · February 21, 2023, 10:00am

Thank you very much for the reply.

Yes, the reference counting requirement depends on the semantic of the DA interface.
And the approach when we "just assume the remote instance is always where" is fine until the actor instance is not related to the particular caller and its context. With just a class we could clean up context in the class deinitializer, and it also will work with DA when instance is local.

ktoso · February 21, 2023, 10:01am

Not sure if you're agreeing, or not agreeing and confused how to implement some specific use-case you have in mind If you'd like to explore some specific use-case I'm happy to dive into it.

ser-0xff · February 21, 2023, 11:16am

We are trying to build a cluster with many services (tens) in each working process, where each service can serve many clients, having in mind a possibility to scale them by adding more processes. We thought the DA could give us a possibility to hide most IPC details from the end user. As of now it is not clear how we can workaround the issue with reference counting
I just need some time to gather all my thoughts and put them in text
Let me come back a little bit later with particular example :)

ktoso · February 21, 2023, 12:49pm

I see, cool — that sounds like a good use case I've supported many such clusters in my ”previous life” while working on Akka, so I'm happy to assist with any tips or perhaps adding small things here and there to make your life easier in the cluster impl, keep in touch please!

ser-0xff · April 3, 2025, 4:06pm

Hi Ktoso,
It's been a while—hope you're doing well!

As I told before we're building a cluster where a few processes run on different machines, each hosting different services implemented as distributed actors. These actors are dynamically allocated. Some services make requests to others, which may be instantiated either within the same process or in a different one.

The usage pattern looks like this: we resolve the required actor type and make a request to it. Since distributed actors can be used like any other Swift type, the code looks something like this:

let actor = Actor.resolve(id: id, using: distributedSystem)
actor.makeSomeRequest()
// actor is no longer needed

What I observe is that when the actor instance is resolved locally within the same process, Swift’s standard reference counting kicks in. We can rely on the deinit method for cleanup (which is important sometimes), and the actor instance is properly deallocated as well.

The problem is that we can't implement similar behavior for remotely running actors—they just live forever. In my opinion, if an actor behaves like any other reference type in Swift (and local actors do), then it should always behave like a regular reference type.

I agree that transmitting every retain/release call to the remote distributed system doesn't make sense. However, it would be really helpful if there were a way to let the remote system know that an actor instance is no longer needed.

From my point of view, it would be sufficient to notify the distributed system via resignID() callback when the local proxy instance is deallocated, and in that case the distributed system would let remote distributed system that actor is not needed any more so can be released.

What do you think?