Distributed Worker Pool Example - Support for Specific Hardware

spprichard · December 10, 2021, 6:29pm

Hello,

The more I read about distributed actors the more excited I get! I am curious if it would be possible for distributed actors could support specific hardware requirements?

Expanding a bit...

In the Worker Pool example given at here

// **** APIS AND SYNTAX ARE WORK IN PROGRESS / PENDING SWIFT EVOLUTION ****

extension Reception.Key {
  static var workers: Self<Worker> { "workers" }
}

distributed actor WorkerPool {
  var workers: Set<Worker> = []

  init(transport: ActorSystem) async {
    Task {
      for try await worker in transport.receptionist.subscribe(.workers) {
        workers.insert(worker)
        watchTermination(of: worker) {
          workers.remove($0) // thread-safe!
        }
      }
    }
  }

  distributed func submit(work item: WorkItem) async throws -> Result {
    guard let worker = workers.shuffled.first else {
      throw NoWorkersAvailable()
    }
    try await worker.work(on: item)
  }
}

A worker can have work submitted to it. What if this work item had to be ran on specific hardware? Could this spec support this type of requirement? A real world example could be, wanting work to be ran on a GPU vs CPU if supported.

ktoso · December 11, 2021, 1:03am

Hi there, great question

That indeed is a very typical use-case in such systems. There's two ways to approach it. Needing to run on some specific set of hosts is one scenario, and another silly one I had hit in the past was running only on specific nodes since only they were licensed to run some 3rd party software etc...

The first approach is to just have different Reception.Keys for the different workers. As the worker starts, it knows on what node it is, and registers itself with the "high-gpu-workers" key, rather than just the "workers key". Those other nodes which care about these, subscribe to that key and fine them that way. You can register an actor using multiple keys as well, if you needed to do so.
The second approach we don't have in our Swift cluster yet, but I think would be nice to do is "node roles". It's a thing we did in the past in Akka, with it's node roles which are just strings you can attach to nodes (see the swift distributed actor cluster library's UniqueNode, we'd add it there). Then, through cluster membership you'd know the capabilities of specific nodes and can decide what node and actor to use (since each distributed actor has an ActorAddress as ID and an address has an UniqueNode so you know what capabilities the node the actor is located on has).

Hope this helps :)

spprichard · December 11, 2021, 2:41pm

This is very helpful. Thank you!

Thinking about this a little more. In the GPU example, do you have an idea on how that could be implemented?

Specifically, how would the actor know how to use the GPU? If I’m following what you are saying, we would create an actor type that has a specifc ‘Reception.Key’. Which when doing this we could create an Actor type that has an initializer that perhaps takes in some GPU specific transportation information which tells the actor how to communicate with the GPU when dispatching work.

Does that make sense? Or am I missing something?

ktoso · December 12, 2021, 5:36am

I'm not quite sure what you're asking for with:

I mean, you write the code in the actor, so it's up to you to write whatever "this code needs gpu stuff" in an actor that will be run on a node that "has the gpu" (whatever that specifically means, e.g. a high-gpu instance on EC2 or something else etc).

This is perhaps either weirdly phrased, or misunderstands what actors are doing (or I don't understand the sentence)? Actors are not going to "use the gpu" just in some magical way. Distributed actors are just a communication mechanism, whatever code you have that already does require/make-use-of gpu acceleration would just be sitting there as usual ("in the actor"), and the actor only serves as nice way to discover and communicate with any such actor(s).

In practice the actor code is just:

if <I'm a high-gpu instance> { 
  ... = HighGPUWorker(system: cluster, ...)
} else {
  // normal node... don't spawn high gpu workers
}

distributed actor HighGPUWorker { 
  init(system: ActorSystem, ...) {
    system.receptionist.register(self, withKey: .highGPUWorkers)
  }
}

// others are listening for .highGPUWorkers

You may want to watch the talk about this we did recently: [Video] Distributed Actors announced at Scale by the Bay maybe that'll help with wrapping your head around the usage patterns

spprichard · December 12, 2021, 4:02pm

I think there was a misunderstanding on my part. The example implementation clarified some misunderstandings on my end.

Thank you for the thoughtful response!