Building Durable Workflows in Swift with Distributed Actors

As mentioned in this post last week I made a small presentation at our Swift community here in Munich, it was about showcasing distributed actors with examples of what you can build. One of them was Durable workflows. There were some questions about implementation details, so I decide to make a separate post explaining how exactly distributed actors are involved here.

The Examples directory has a full working version with a web frontend for anyone interested to check:


There is a class of problem that keeps appearing in backend systems: you need to run a multi-step process where each step calls a different external service, e.g. charge a card, book a flight, send a confirmation—and you need this to be reliable. Not just retryable, but truly durable.

The first thought is usually a database transaction—but that doesn't work for operations that take minutes, call external APIs, or span multiple services. You need each step to commit immediately and independently. That means if step 3 fails after steps 1 and 2 already succeeded, there is no rollback—you have to explicitly undo what was done: call the flight API to cancel, release the funds hold.

But there is another problem: what happens if the process crashes in the middle of the compensation itself? Or if it restarts and runs the whole thing from scratch, double-charging the user? You need the execution to be durable—able to survive process restarts and continue exactly where it stopped.

Systems like Temporal and AWS Step Functions exist to solve this. I wanted to explore whether Swift's distributed actors ecosystem already has the right primitives to build something similar without an external workflow server.


The Core Idea: Workflows as Replayed Event Logs

The approach, borrowed from event-sourced systems and Temporal's design, is to make the event log the source of truth. Current state is never stored directly—it is always reconstructed by replaying events. On restart, replay the log and arrive at the same state deterministically. When replaying, skip re-executing steps that already completed—return their cached results instead.

This means your workflow function is plain sequential Swift code:

@Workflow
public struct TravelBookingWorkflow {
    public typealias Activities = TravelBookingActivities

    public func run(input: TravelBookingRequest, context: WorkflowContext) async throws -> BookingResult {
        try await context.executeActivity(Activities.ReserveFunds.self, input: ...)
        let flightId = try await context.executeActivity(Activities.ReserveFlight.self, input: ...)
        let hotelId = try await context.executeActivity(Activities.ReserveHotel.self, input: ...)
        try await context.executeActivity(Activities.CaptureFunds.self, input: ...)
        return BookingResult(status: "Confirmed", ...)
    }
}

On crash and restart, the same function runs again. The first two executeActivity calls find their results in the event log and return immediately. The third call picks up fresh. No external state machine, no manual checkpointing—just a function that the runtime knows how to replay.


API Design: Inspired by the Swift Temporal SDK

The API design here is directly inspired by the Swift Temporal SDK. If you have seen it before, the WorkflowContext, executeActivity, and the separation between workflow definition and activity implementation will look familiar.

The Swift Temporal SDK also uses macros to generate type-safe wrappers around activity functions. This project follows the same approach. The @ActivityContainer and @Workflow macros expand at compile time and produce everything the runtime needs to dispatch and replay activities without any manual boilerplate.

Concretely, @ActivityContainer expands this:

@ActivityContainer
public struct TravelBookingActivities {
    @Activity
    public func reserveFlight(input: ReserveFlightRequest, context: ActivityContext) async throws -> String {
        // just a regular async function
    }
}

Into:

  • A handle(invocation:on:) function that switches on the activity name, decodes input, calls the right function, and encodes output
  • A nested Activities enum with one case per @Activity method, each conforming to ActivityReference with typed Input and Output

The result at the call site is fully type-safe:

// compiler enforces input type and return type
let flightId = try await context.executeActivity(
    TravelBookingActivities.Activities.ReserveFlight.self,
    input: ReserveFlightRequest(...)
)

@Workflow generates WorkflowProtocol conformance and derives a stable string identifier from the type name (TravelBookingWorkflow → "travel-booking") used for journal persistence and registry tracking.


Why Distributed Actors Fit Here

Isolation. Each workflow execution is its own actor—no shared state, no locks, no coordination between concurrent workflows.

Location transparency. Whether a WorkflowActor runs on this node or another does not affect how you call it.

Logical identity. With the VirtualActors plugin, actors are addressed by a logical ID, not a physical address. You ask for a workflow by ID and it is created on demand, anywhere in the cluster.

Persistence. The @EventSourced macro gives each actor a durable event log. State is reconstructed by replaying events, which is what makes crash recovery possible without coordination.

Combined, these four give you durability: a workflow actor can disappear—process crash, node restart, rebalancing—and come back on any node, replay its history, and continue as if nothing happened. No external orchestrator needed.

Observers are just actors. Because distributed actors are Codable and Sendable by default, you can pass any actor reference directly as part of an activity's input. It serializes as the actor's identity and resolves back into a live reference on the worker side. This means an activity can call back into domain state—update a balance, push a notification—just by calling a method on the actor it received as input.

@Activity
public func reserveFunds(input: BalanceRequest, context: ActivityContext) async throws {
    try await input.user.holdFunds(workflowID: context.workflowID, amount: input.amountCents)
    try await input.user.notifyWorkflowUpdate(id: context.workflowID)
}

AFAIK in Temporal this is not directly possible because activities are plain functions with no reference to cluster state. The classic solution is signals—string-named fire-and-forget messages sent into a running workflow, which requires designing a signal protocol, handling delivery ordering, and polling the workflow state from the observer side to know what happened. Temporal Updates improve on this by adding a request-response pattern, but it still means extra boilerplate and an explicit protocol on top of the framework.

Here, the observer is already a distributed actor in the cluster. You hand it to the activity as a typed parameter. The activity calls it directly when something happens. No signals, no polling, no extra protocol—it falls out naturally from the model.

You might wonder why not just call user.notify(...) directly in the workflow body between activities. The workflow body re-executes on every replay, and only executeActivity calls are served from cache—a direct actor call there would fire again on each replay, producing duplicates. Side effects belong inside activities.

The execution model maps cleanly:

  • Each workflow instance → one WorkflowActor<WorkflowType> (virtual, event-sourced)
  • Activity execution → DurableActivityDispatchWorker<WorkflowType> (distributed, in receptionist)
  • Cluster-wide tracking → WorkflowRegistry (cluster singleton, event-sourced)

The Architecture

WorkflowActor—the orchestrator

This is the core of the system. It is a virtual actor, so it is spawned on demand by ID and its state is fully persisted via event sourcing. Its job is to run the workflow function and record everything.

@EventSourced
public distributed actor WorkflowActor<WorkflowType: WorkflowProtocol>: VirtualActor {
    struct State: Sendable {
        var status: WorkflowStatus = .idle
        var inputData: Data?
        var activityOutcomes: [Int: ActivityOutcomeRecord] = [:]
        var events: [WorkflowEvent] = []
    }
}

activityOutcomes is the replay cache—it maps activity execution index to the stored result. First run: empty, every activity dispatches to a real worker. After a crash and restart: already-completed activities are served from cache instantly, the next pending one is dispatched fresh.

Events emitted:

public enum WorkflowEvent: Codable, Sendable {
    case executionStarted(inputData: Data)
    case activitySucceeded(index: Int, name: String, outputData: Data)
    case activityFailed(index: Int, name: String, failure: ActivityFailurePayload)
    case executionCompleted(outputData: Data)
    case executionCancelled
    case executionFailed(message: String)
}

Every state change is an event, persisted immediately. On actor init, the journal replays them through handleEvent, reconstructing activityOutcomes and status deterministically. If status comes back as .running, the actor resumes automatically.

ActivityExecutionCursor—replay position

When _run() starts, it creates a cursor seeded with the cached outcomes:

let cursor = ActivityExecutionCursor(cachedOutcomes: self.state.activityOutcomes)

Every context.executeActivity call goes through the cursor:

func nextCall() -> (index: Int, cached: ActivityOutcomeRecord?) {
    let index = self.nextIndex
    self.nextIndex += 1
    return (index, self.cachedOutcomes[index])
}

The workflow function always calls activities in the same order. The cursor hands out sequential indices. Cached at that index → return immediately. Not cached → dispatch for real. The cursor is a separate actor (not a struct) because concurrent activity calls from the workflow need safe access to the counter.

WorkflowContext—what the workflow sees

This is passed to workflow.run(input:context:) and exposes two things:

public func executeActivity<ActivityType: ActivityReference>(
    _ activity: ActivityType.Type,
    options: ActivityOptions = .init(),
    input: ActivityType.Input
) async throws -> ActivityType.Output

public func getActor<ActorType: VirtualActor>(
    identifiedBy id: VirtualActorID,
    dependency: any Sendable & Codable
) async throws -> ActorType

executeActivity checks the cursor, returns cached or dispatches and emits an event on completion. getActor lets workflows resolve virtual actors and pass them as typed inputs to activities—which is how activities can call back into domain state without going through a shared service layer.

DurableActivityDispatchWorker—the activity executor

Workers are distributed actors registered in the receptionist under a key scoped to their workflow type. The WorkflowActor holds a WorkerPool that discovers them dynamically and dispatches in round-robin.

distributed public func submit(work: ActivityInvocation) async throws -> ActivityInvocationResult {
    do {
        let output = try await container.handle(invocation: work, on: self.actorSystem)
        return .success(outputData: output)
    } catch let applicationError as ApplicationError {
        // typed errors are preserved across the wire as ActivityFailurePayload
        ...
    }
}

The container is the macro-generated struct that routes by activity name. All the dispatch boilerplate is generated, nothing written by hand.

One nice property: when WorkerPool picks a worker on the same node as the WorkflowActor, ClusterSystem detects the call is local and bypasses network transport entirely.

When a worker node restarts, new workers observe the cluster membership event for their own node coming up and call recoverAll()—re-instantiating any WorkflowActors that were running when the node went down. Those actors replay their journals and self-resume.

WorkflowRegistry—the recovery index

A cluster singleton, also event-sourced, storing a map of workflow ID → workflow type name. It exists for one reason: so that on recovery you can ask "what was running?" and re-instantiate the right virtual actors. It is not the source of truth for workflow state—that lives in each WorkflowActor's own event log. The registry is just a lookup index.

@EventSourced
public distributed actor WorkflowRegistry: ClusterSingleton {
    struct State: Sendable {
        var running: [String: String] = [:]  // id → typeName
    }
}

Writing Your Own Plugin

The ClusterPlugin protocol is what ties everything together. A plugin gets access to the ClusterSystem at startup and can register singletons, virtual actors, or any other infrastructure. It is the right place for cross-cutting concerns that need to be set up once and accessed everywhere.

public actor DurableWorkflowsPlugin: Plugin {
    public func start(_ system: ClusterSystem) async throws {
        self.actorSystem = system
        self.registry = try await system.singleton.host(name: "workflow-registry") { actorSystem in
            try await WorkflowRegistry(actorSystem: actorSystem)
        }
    }
}

The plugin hosts the WorkflowRegistry as a cluster singleton—exactly one instance cluster-wide, migrates automatically if its node goes down. It exposes the public API (execute, getStatus, cancel, recoverAll) as a facade over virtual actor resolution.

Registering it:

settings.plugins.install(plugin: ClusterSingletonPlugin())
settings.plugins.install(plugin: ClusterVirtualActorsPlugin())
settings.plugins.install(plugin: ClusterJournalPlugin { _ in FileEventStore(...) })
settings.plugins.install(plugin: DurableWorkflowsPlugin())

And a ClusterSystem extension makes it accessible everywhere without passing it around:

extension ClusterSystem {
    public var workflows: DurableWorkflowsPlugin { ... }
}

What I found nice about the plugin system is that it keeps infrastructure setup out of application code. If your app needs a workflow runtime, it declares it in settings. The combination of Plugin + ClusterSingleton + VirtualActors maps well to the layers of concern: plugin owns the singleton infrastructure (registry), virtual actors own per-entity state (each workflow), plugin is the facade that ties them together.


Usage

Starting a workflow:

let bookingId: UUID = ...
let result = try await system.workflows.execute(
    type: TravelBookingWorkflow.self,
    options: WorkflowOptions(id: "booking-\(bookingId)"),
    input: TravelBookingRequest(...)
)

Cancelling:

try await system.workflows.cancel(type: TravelBookingWorkflow.self, options: options)

Querying status:

let info = try await system.workflows.getStatus(type: TravelBookingWorkflow.self, options: options)
// info.status: .running / .completed / .cancelled / .failed
// info.events: full activity history

What's Next

  • Snapshotting. Long-running workflows accumulate large event logs. A room for improvement for event-sourcing plugin, and since this belongs there, workflow implementations would stay completely unchanged—a good example of why separating these concerns pays off.
  • Timeouts and timers. ActivityOptions.startToCloseTimeoutMillis is already in the API but not enforced yet. More importantly, durable timers—like Temporal's workflow.sleep(for: .days(7))—where the delay is recorded in the event log and survives restarts, explaining simply Task.sleep will be lost on crash and a durable timer won't.
  • Cancellation. Deliberately kept simple—task.cancel() propagates through Swift's structured concurrency. This simplified the implementation significantly and leaves room for improvement.

What This Shows

The point is that Swift already has the right primitives. Distributed actors give you isolated, location-transparent execution. Event sourcing gives you durability without coordination. Virtual actors give you lifecycle management without explicit spawning. Cluster singletons give you coordination points that survive node failures. Each primitive pulls its weight.

And the user-facing API ends up being just a struct with a run function and some methods annotated with @Activity. The entire framework—plugin, runtime, registry, macros—is roughly ~1k lines of Swift. That covers crash recovery, distributed activity dispatch, deterministic replay, and type-safe macro-generated boilerplate. Event sourcing and virtual actors, which do the heavy lifting underneath, are not large libraries either. It is a good demonstration of what the right primitives can do for you.

This is a research experiment for presentation and obviously has not been tested in production. The framework provides at-least-once guarantees for activity execution, though, which is reasonable for a lot of real use cases. I plan to try it exactly in that context—as part of the distributed Swift chat application described in this thread.


Thanks for reading, see you in the next one! :upside_down_face:

17 Likes