Awaiting readiness of a service (swift-service-lifecycle)

sliemeobn · March 20, 2025, 12:40pm

I am in the processes of kicking a few more of my legacy modules into beautiful swift-6-fully-structured-concurrency territory, and working with ServiceGroups and graceful shutdown handling generally is very nice.

The biggest thing that has been missing for me was a clean way to know if a service has "set up" its stuff to be truly ready for business. So I have been playing around and I think I like the general direction. It is still baking, so I felt a forum discuss might be more appropriate than a github PR. (see code below)

A few thoughts from my side:

could be used to compose groups/trees of "stateful" services to have combined "readiness" and a more defined start-up sequence
"StatefulService" may be a bit thick, since there is only readiness
but: I think the readiness is the only truly "standardizable" one
trying to be flexible over more statuses increases complexity by a lot, and you'll always race with the run anyway

So, here is what I got so far:

import ServiceLifecycle

public protocol StatefulService: Service {
    var readiness: Void { get async throws }
}

actor MyService: StatefulService {
    private let readinessSource = ReadinessSignalSource()

    var readiness: Void {
        get async throws { // this a big awkward maybe?
            try await readinessSource.waitUntilSignaled()
        }
    }

    func run() async throws {
        try await readinessSource.withSignal { signalReadiness in
            // connecting, preparing, making sure we are open for business
            await Task.yield()

            signalReadiness()

            // run until shutdown
            try await gracefulShutdown()
        }
    }
}

import Synchronization

public struct ReadinessSignalSource: Sendable, ~Copyable {
    typealias Suspension = CheckedContinuation<Void, any Error>

    enum State {
        case waiting([Int: Suspension])
        case readinessSignaled
        case failed(any Error)
    }

    private let criticalState = Mutex<State>(.waiting([:]))
    private let _id = Atomic<Int>(0)

    public init() {}

    public func waitUntilSignaled() async throws {
        let id = nextId()

        try await withTaskCancellationHandler {
            try await withCheckedThrowingContinuation { continuation in
                criticalState.withLock { state in
                    switch state {
                    case var .waiting(continuations):
                        continuations[id] = continuation
                        state = .waiting(continuations)
                    case .readinessSignaled:
                        continuation.resume()
                    case let .failed(error):
                        continuation.resume(throwing: error)
                    }
                }
            }
        } onCancel: {
            criticalState.withLock { state in
                switch state {
                case var .waiting(continuations):
                    let continuation = continuations.removeValue(forKey: id)
                    continuation?.resume(throwing: CancellationError())
                    state = .waiting(continuations)
                case .readinessSignaled, .failed:
                    break // continuation already resumed
                }
            }
        }
    }

    public func withSignal(isolation: isolated (any Actor)? = #isolation, _ operation: (_ signalReadiness: borrowing() -> Void) async throws -> Void) async rethrows {
        do {
            try await operation { self.signalReadiness() }
            signalReadiness() // TODO: should a completed body be its own status, and maybe throw when awaited?
        } catch {
            signalFailure(error)
        }
    }

    func signalReadiness() {
        criticalState.withLock { state in
            switch state {
            case let .waiting(continuations):
                for cont in continuations.values {
                    cont.resume()
                }
                state = .readinessSignaled
            case .readinessSignaled:
                break // this is fine
            case .failed:
                preconditionFailure("Readiness signaled after failure")
            }
        }
    }

    func signalFailure(_ error: any Error) {
        criticalState.withLock { state in
            switch state {
            case let .waiting(continuations):
                for cont in continuations.values {
                    cont.resume(throwing: error)
                }
                state = .failed(error)
            case .readinessSignaled:
                break // maybe not ideal?
            case let .failed(error):
                state = .failed(error) // TODO: is this ok?
            }
        }
    }

    private func nextId() -> Int {
        _id.add(1, ordering: .relaxed).1
   }
}

// This is especially nice for testing, 
// I am playing with an extension like this for example:
public extension Service {
    func whileRunningAndReady<T>(_ body: () async throws -> T) async throws -> T {
        let group = ServiceGroup(services: [self], logger: .init(label: "whileRunning"))
        async let serviceRun: () = group.run()

        if let stateful = self as? any StatefulService {
            try await stateful.readiness
        }

        let result = await Result { try await body() }

        await group.triggerGracefulShutdown()

        switch result {
        case let .success(value):
            try await serviceRun
            return value
        case let .failure(error):
            try? await serviceRun // TODO: maybe log shutdown error
            throw error
        }
    }
}

// used like this
func myTest() async throws {
    let service = MyService()
    try await service.whileRunningAndReady {
        let result = try await service.doSomething()
        #expect(result, "foo")
    }
}

@FranzBusch since we chatted about something like this many a time before ; ) WDYT?
I wonder if this is something we could see in the swift-service-lifecycle package directly.

ktraunmueller · March 20, 2025, 12:58pm

I am not very familiar with the design thinking behind ServiceLifecycle, so sorry if this is a naive question:

If you want to extend ServiceLifecycle with a mechanism to handle service readiness, shouldn't run() only be called after a service has become ready?

With what you are suggesting, readiness is part of / mixed into run(), if I understand correctly.

sliemeobn · March 20, 2025, 1:14pm

It's all about that gosh-darn structured concurrency - with its inescapable cleanliness and order, calling to you in your dreams ^^

In all seriousness:
A big issue with "splitting" the prepare from the run function is that you have no "task" to run anything on in the meantime: when the scope of prepare ends, nothing can keep executing. The things you can prepare will then all have to be "offline-storable" - or you'll be forced to run a Task in the background and manage it somehow - but then you are outside of "structured concurrency".

Trust me, I have been there and back and wondered the same as you. But it is pretty clear where the ecosystem is headed...

ktraunmueller · March 20, 2025, 1:30pm

I see. I was vaguely comparing this against OSGi's bundle lifecycle, where installed and resolved are different states, and a bundle can only start once it's in the resolved ("ready") state.

But it looks like this comparison doesn't make much sense.

sliemeobn · March 20, 2025, 1:56pm

I thought about a more "traditional" state-transition mechanism as well.

You could have "StateTransitionSource" or similar that vends AsyncSequences to the outside world to inform you about the current and future states the services goes through (all while running)

But then the IMHO "prime use case" of just waiting until the internal plumbing is set up gets a lot more involved, and all the other status transitions are kind of meaningless since graceful shutdown and cancellation are already handled in a structured way.

That is why I believe "readiness" is the only signal worth vending.