Seamless LLMs with Distributed Actors in Swift

jaleel · April 15, 2026, 8:42am

This is second part about building distributed systems in Swift. The previous post covered durable workflows. This one started from a different question: can you build a framework where LLM execution happens locally or remotely based on some criteria, while completely hiding that complexity from the caller? That idea is also where the name comes from—Seamless.

We'll see again how distributed actors transform what looks like a quite complex task—multi-client streaming, local/remote routing, session persistence, transcript sync—into something that is actually fun to build.

The full code is on GitHub. The Examples/Conversation directory has a full working version with a native SwiftUI app alongside a web frontend, both connected to the same backend:

FoundationModels

WWDC 2024 introduced Apple Intelligence, and with it a developer-facing framework called FoundationModels. Starting with iOS 26 / macOS 26, you can run the same on-device model that powers Writing Tools and Siri suggestions directly from your app.

The API is honestly quite good. You annotate a struct with @Generable and the compiler generates everything needed to stream structured output token by token:

@Generable
struct TripPlan {
    @Guide(description: "An exciting name for the trip.")
    let title: String

    @Guide(.count(3), description: "Three day-by-day plans.")
    let days: [TripDay]
}

let session = LanguageModelSession(instructions: "Plan a 3-day trip from this request")
for try await partial in session.streamResponse(to: prompt, generating: TripPlan.self) {
    render(partial) // partial.days might already have 2 items while title is still being generated
}

The big limitation—and it is a real one—is that FoundationModels only runs on Apple platforms. No Linux, no server, no browser. If you want a web frontend for the same conversation, or you want to support clients that do not have the model available, you need something else.

When I started thinking about this framework, I haven't really thought about that limitation, it only became obvious later. But it turned out to be solved naturally by just going with this architecture.

LLMs Have No Memory

The model itself remembers nothing—every generation call is stateless. What looks like "memory" is just the app replaying the full transcript on every call. This works fine for one client on one device, but the moment you have multiple clients, server restarts, or users coming back later, you have a real state management problem.

There are many solutions: sliding window, RAG, dedicated memory layers like Mem0, managed services. But I was curious whether event sourcing could make this a natural consequence of the architecture rather than a separate concern—and give you history replay, crash recovery, and late-joining clients all from the same mechanism.

Sessions as Event-Sourced Actors

The core idea: model a conversation as an event-sourced distributed actor. Every user message and every completed AI response is an event. The conversation history is the event log. A client that connects late just replays the same log. A restart is just a replay from the journal.

@EventSourced
distributed actor Session {
    typealias ActorSystem = ClusterSystem
    typealias Event = StoreEvent

    private var recentMessages: [StreamMessage] = []
    private let engine: SeamlessEngine
}

The event type is small:

enum StoreEvent: Codable, Sendable {
    case message(StreamMessage)
    case transcript(id: String, Transcript)
}

message captures both user input and completed responses. transcript is the interesting one—it records the raw model transcript so that if the session gets evicted and re-instantiated on a different node, the new engine instance does not just have a list of message strings to show users, it has the actual conversation context to continue generating from. That distinction matters: replaying messages gives you history; replaying the transcript gives the model its memory back.

distributed func handleEvent(_ event: Event) {
    switch event {
    case .message(let message):
        self.recentMessages.append(message)
    case .transcript(let id, let transcript):
        Task { await self.engine.updateSession(with: id, transcript: transcript) }
    }
}

When a new client connects, the session broadcasts recentMessages. History is free—it is just what the event log already has.

Virtual Actors Handle the Lifecycle

Conversations are addressed by ID. The virtual actors plugin takes care of the rest:

self.session = try await actorSystem.virtualActors.getActor(
    identifiedBy: .init(rawValue: "seamless-session-\(conversationID)"),
    dependency: Session.Dependency(id: conversationID)
)

If the session does not exist yet, it gets created. If it is already running on another node, you get a reference to it. If its node went down, it comes back on another node, replays the journal, and continues—with both the message history and the model's transcript context restored.

And because Session is a distributed actor, different sessions can run on different nodes in the cluster. That opens up interesting possibilities: route a session to a node with a GPU, or to one with a domain-specific fine-tuned model, or just spread load across machines with different hardware. The session API stays the same regardless of where it ends up running.

Closing the app and coming back an hour later just works. Not because of reconnection logic, but because that is what an event-sourced virtual actor does by default.

Bridging Apple Platforms and the Web

Since FoundationModels only runs on Apple platforms, the backend here runs on macOS too—no Linux cluster. The architecture would support a mixed cluster if you had nodes that could run inference, but right now it is just Mac. Web clients connect to the same Mac backend over HTTP/WebSocket.

SeamlessClient handles routing:

public enum ExecutionTarget {
    case local                    // FoundationModels, no network
    case localAndRemote(URL)      // on-device for simple, cluster for complex
    case remote(URL)              // cluster only
}

The native app uses .localAndRemote. The routing decision is made by a small local classifier—itself using FoundationModels—that labels the incoming prompt as easy or hard. Easy goes on-device. Hard goes to the cluster. If the cluster call fails, it falls back to local.

@Generable
fileprivate enum PromptComplexity: String {
    case easy
    case hard
}

Using a model to route to a model is a bit circular, but in practice the classification call is cheap and fast, and it means the routing decision is based on the actual meaning of the prompt rather than just counting tokens. This is a quick implementation—in a real system the routing strategy could be more deterministic, rule-based, or fully configurable per use case.

The web app uses .remote. It connects to the same session as the native app, sees the same history, and can participate in the same conversation. The session does not know or care what transport is on the other end.

Typed Schemas Across the Whole System

Every output type is described by a SeamlessSchema:

public protocol SeamlessSchema: Codable, Sendable {
    static var identifier: String { get }
    static var instructions: String? { get }
}

On Apple platforms this also conforms to @Generable. The identifier travels with every message through the network so the right type can be reconstructed at any point in the pipeline. Partial results are encoded as Data for transport and decoded back to S.PartiallyGenerated on the receiving end—so the native app streams a live structured value field by field while the web app just gets text.

Since SeamlessSchema builds on top of @Generable, you just conform your existing types—no extra boilerplate. EmojiReaction is a good example:

@Generable
public struct EmojiReaction: SeamlessSchema {
    public static let identifier = "chat.emoji.reaction.v1"
    public static let instructions: String? = "Return exactly three emoji reactions"

    @Guide(.count(3))
    public let emojis: [String]
}

Not every LLM call needs to be a streaming conversation. For a single request-response—generate and return—there is respond(to:):

let reaction: EmojiReaction = try await client.respond(to: message)

Same routing rules apply. On the backend, one-shot requests go through a pool of ResponseWorker actors, separate from the session. The web app emoji button uses this—no WebSocket or jsonl, just a regular HTTP call that returns a typed result.

One thing that is easy to miss: stream() returns AsyncThrowingStream<SeamlessMessage<S>, Error>—fully generic, fully type-safe—even when S is being generated on a remote node. The type flows end-to-end: from the schema definition, through the network, back to the caller. In distributed systems this is not something you get for free, and it is one of the places where Swift's type system really shows its strength.

There is one rough edge right now—you have to manually add conformances for the generated partial type:


extension EmojiReaction.PartiallyGenerated: Codable, Sendable {}

This is boilerplate that should not be the user's problem. SeamlessSchema will probably become a macro at some point to hide this completely.

The Web App

To test whether this actually works with non-Apple clients, I built a small web frontend using Elementary and HTMX over WebSocket. The server renders HTML fragments and pushes them as WebSocket text frames. HTMX swaps them into the DOM—almost no browser-side JavaScript.

The architecture supports the native app and web app being open at the same time on the same conversation—both connected to the same session actor, session broadcasting to all of them. Multi-client sync still needs some work to be fully reliable, but the foundation is there.

What Is Still Missing

Snapshotting. Long conversations accumulate large event logs. Same problem as in the durable workflows project, same answer: it belongs in the event sourcing layer.

Context window handling. Right now the engine prunes the transcript when it hits the limit, which is lossy. A summarization step before pruning probably a better approach.

Multi-model routing. One session, one engine. Routing different schemas to different models, or mixing local and remote within a single session, is something the schema identifier could support but does not yet.

Multi-client sync. The session broadcasts to all connected clients, but clients currently have no stable identity—so the session cannot track per-client state, know who has received what, or handle reconnects correctly.

Remote reconnection. In .localAndRemote mode, if the remote stream is not established at startup or drops during a session, the client does not attempt to reconnect. Resuming the remote stream transparently, without interrupting the local fallback, is still missing.

The hard parts here—session durability, history replay, multi-client broadcasting, crash recovery—are not specific to LLMs. They are distributed systems problems, and they have clean solutions in the distributed actors ecosystem. Framing a conversation as an event-sourced virtual actor means the memory problem is solved at the infrastructure level, not in application code.

Where This Could Go

Beyond chat. The schema-based approach means LLM features in an app do not have to look like a chat interface. Any type conforming to SeamlessSchema is streamed through the same session, persisted in the same event log, and delivered to all connected clients. One session could generate emoji reactions, summaries, structured annotations, and conversational replies—all from the same actor, all typed end to end. The UX does not have to be a message bubble list. And since all of this—streaming, one-shot responses, multi-client sync, local and remote routing—flows through the same primitives, you get it all just by defining a type.

Different models. My next step is to try AnyLanguageModel from HuggingFace—a drop-in replacement for FoundationModels that keeps the same API (@Generable, @Guide, LanguageModelSession) while adding support for other backends. Since SeamlessSchema is built on top of these primitives, switching the underlying model should not require touching the session layer at all. Also FoundationModels supports custom adapters trained with LoRA, so there is room to fine-tune the on-device model for a specific domain without changing anything in the infrastructure.

A Note on OOP

Alan Kay, who coined the term, was asked on Quora what he thinks about Joe Armstrong claiming Erlang might be the only object-oriented language. His response:

I love Joe Armstrong — we lost a great man when he recently left us. And, he might be right. Erlang is much closer to the original ideas I had about 'objects' and how to use them.

Swift's distributed actors are explicitly inspired by Erlang and follow the same idea. Each actor has private state that you can only reach by sending a message; the runtime handles where the actor lives and what happens when a node goes down. A conversation maps naturally onto this: it is an isolated entity with its own context, and everything that happens to it comes in as a message. It turns out that is also a reasonable way to build LLM sessions.

IMHO a lot of what I see online around LLM infrastructure feels like reinventing things that already exist—it is just OOP, the way it was originally meant.

Thanks for reading!

There is also a small showcase of the let it crash philosophy I built along the way—I'll post it on GitHub, but I don't think it's worth making a separate post about. As for what's next, I'd rather focus on Migrating the Cluster system to Swift structured concurrency and picking up where I left off on the distributed actors chat implementation.