[Pitch] Retry & Backoff

Hello everyone!

I've been working on a proposal to add retry functionality with backoff strategies to Swift Async Algorithms, and I'd like to pitch it.

Retry logic with backoff is a common requirement in asynchronous programming, especially for operations subject to transient failures such as network requests. Today, developers must reimplement retry loops manually, leading to fragmented and error-prone solutions across the ecosystem.

This proposal introduces a standardized retry function that handles these scenarios cleanly.

nonisolated(nonsending) func retry<Result, ErrorType, ClockType>(
  maxAttempts: Int,
  tolerance: ClockType.Instant.Duration? = nil,
  clock: ClockType = ContinuousClock(),
  operation: () async throws(ErrorType) -> Result,
  strategy: (ErrorType) -> RetryAction<ClockType.Instant.Duration> = { _ in .backoff(.zero) }
) async throws -> Result where ClockType: Clock, ErrorType: Error

Here are the two main use cases:

When you control the retry timing:

let rng = SystemRandomNumberGenerator() // or a seeded RNG for unit tests
var backoff = Backoff
  .exponential(factor: 2, initial: .milliseconds(100))
  .maximum(.seconds(10))
  .fullJitter(using: rng)

let response = try await retry(maxAttempts: 5) {
  try await URLSession.shared.data(from: url)
} strategy: { error in
  return .backoff(backoff.nextDuration())
}

When a remote system controls the retry timing:

let response = try await retry(maxAttempts: 5) {
  let (data, response) = try await URLSession.shared.data(from: url)
  if
    let response = response as? HTTPURLResponse,
    response.statusCode == 429,
    let retryAfter = response.value(forHTTPHeaderField: "Retry-After"),
    let seconds = Double(retryAfter)
  {
    throw TooManyRequestsError(retryAfter: seconds)
  }
  return (data, response)
} strategy: { error in
  if let error = error as? TooManyRequestsError {
    return .backoff(.seconds(error.retryAfter))
  } else {
    return .stop
  }
}

The design provides error-driven retry decisions, composable backoff strategies (constant, linear, exponential, decorrelated jitter). It also includes jitter support to prevent thundering herd problems when multiple clients retry simultaneously.

Please read the full proposal here.

Thank you.

17 Likes

I've recently updated this proposal since I've initially pitched it. This is the change log:

1. Dropping support of DurationProtocol in linear and exponential backoff

Due to a flaw found by Github user bobergj I made the decision to drop support of DurationProtocol in linear and exponential backoff algorithms and use Duration instead. This also had the side effect, that I had to bump the overall availability of the algorithms of this proposal to AsyncAlgorithms 1.1.

Why?

If you consider this backoff:

 Backoff.exponential(factor: 2, initial: .seconds(5)).maximum(.seconds(120))

... it might seem like the calculation cannot overflow because it is capped at 120 seconds. However, the exponential function will silently keep multiplying each retry which eventually will overflow Duration at some point (in the given example after 64 retries).

One solution to this problem would be to stop multiplying when the maximum has been reached. However this is would only work if we'd bake the concept of "maximum" into the exponential backoff strategy. This has the consequence that every "top level" backoff strategy would have to implement their own version of maximum.

The other solution would be to keep multiplying in exponential backoffs, but detect that it has overflown (by using multipliedReportingOverflow) and then stop multiplying. This has the consequence that, as stated previously, DurationProtocol is not supported, because only Duration exposes such functionality.

2. Dropping decorrelatedJitter
I might've been a bit overambitious in adding backoff strategies. I initially used this resource: Exponential Backoff And Jitter | AWS Architecture Blog as a reference for jitter variants. However, a lot of other retry frameworks of different programming languages do not support such a wide variety of backoff algorithms. Most support constant, exponential and also some form of randomization.

This is why I am dropping support for decorrelatedJitter and also considering dropping equalJitter (but keeping fullJitter) and linear (since exponential backoff seems to be the industry standard for retry scenarios). I am curious, though, what people think about this. Please provide feedback which timing functions or backoff variants you'd like to use or are using currently.

Since there is no precedent of standalone retry-with-backoff in either Apple frameworks or Swift-related frameworks, I feel like this is quite a balance act of not overdoing it / confusing adopters with too many options but also providing enough so most of the use cases can be solved.

2 Likes

Just want to mention although I have not had the time to go through the proposal, I think the issue that you're trying to solve is very valid.

I had it on my long TODO list to create such a package that handles different kinds of backoffs, as I had to manually implement backoff strategies in DiscordBM and as I'll very soon need such backoff strategies in swift-dns's dns resolver.

As long as new strategies can be introduced in the future, I’m fine with going with a smaller set of strategies that are more common.

1 Like

I haven’t seen too many packages that contain proper backoff implementations in my day-to-day server-side use. My own DiscordBM is one of them. Another one is @adam-fowler ‘s AWSClient in Soto.

I know he's on vacation but thought I'd mention him here just incase he has any special opinions, whenever he can catch up.

Soto is a very high quality package, and has proven itself in practice over the years, so I think Adam's opinion would be of value.

1 Like

Yes, this is definitely possible. I believe it‘d probably require some sort of formal amendment or proposal, though.

1 Like

This is great. I’ve written so many versions of this over the years.

As far as I can tell, given BackoffStrategy is stateful, I would need to create a new instance of it each time I want to run a process using backoff. As a library author if I wanted to provide a method for users to define the backoff strategy parameters I would have to create separate types to define these and then create a BackoffStrategy from this new type. It might be useful for this proposal to include these defining types.

Interesting, which parameters would you‘d like to see included? The timing function (=only BackoffStrategy) or also how many attempts?

Possibly everything. Hadn’t thought about max attempts though as this isn’t defined in the BackoffStrategy. Out of interest any reason this is separate from the BackoffStrategy.

Perhaps you could have something like this

let rng = SystemRandomNumberGenerator() // or a seeded RNG for unit tests
let backoffType = Backoff
  .exponential(factor: 2, initial: .milliseconds(100))
  .maximum(.seconds(10))
  .fullJitter(using: rng)
var backoff = backoff.instatiate()

let response = try await retry(maxAttempts: 5) {
  try await URLSession.shared.data(from: url)
} strategy: { error in
  return .backoff(backoff.nextDuration())
}

Where backoffType is a Sendable type that conforms to BackoffStrategyType which is defined as

protocol BackoffStrategyType {
    associatedtype Strategy: BackoffStrategy
    func instantiate() -> Strategy
}
3 Likes

Mainly because maxAttempts does not fit to BackoffStrategy. I thought backoff should not really be concerned about how often it should compute a timing, it is more related to retrying rather than backoff.

If this would be added to this configuration type it should probably be called RetryStrategy rather than BackoffStrategy

This reminds me of IteratorProtocol. I wonder if it'd make sense to imitate Sequence and IteratorProtocol like this:

public protocol BackoffStrategy<Duration> {
  associatedtype Iterator: BackoffIterator
  associatedtype Duration: DurationProtocol where Self.Duration == Self.Iterator.Duration
  func makeIterator() -> Iterator
}

public protocol BackoffIterator {
  associatedtype Duration: DurationProtocol
  mutating func nextDuration() -> Duration
}

Library authors could make their API accept BackoffStrategy where concrete strategies can easily be made Sendable. The iterators will be stateful, while the backoff strategy itself will be stateless. Strategies could still be composed like this:

let backoffType = Backoff
  .exponential(factor: 2, initial: .milliseconds(100))
  .maximum(.seconds(10))
  .fullJitter(using: rng)

... and it even shares similarities with already existing concepts within Swift, like (lazy) sequences or asynchronous sequences.

Very valuable input. Thank you, @adam-fowler.

1 Like

Thanks again for the feedback, @adam-fowler. I've revised the design based on our discussion. Here's what changed:

BackoffStrategy / BackoffIterator Split

Before: BackoffStrategy was a stateful protocol with a single mutating func nextDuration() method. Each call mutated the strategy to track progression (attempt count, previous duration, etc.). This meant strategies couldn't be Sendable since they needed to be mutated during use.

After: Following the Sequence/IteratorProtocol pattern, BackoffStrategy is now an immutable configuration type that creates a BackoffIterator via makeIterator(). The strategy itself holds no mutable state, all state lives in the iterator:

public protocol BackoffStrategy<Duration> {
  associatedtype Iterator: BackoffIterator
  associatedtype Duration: DurationProtocol where Duration == Iterator.Duration
  func makeIterator() -> Iterator
}

public protocol BackoffIterator {
  associatedtype Duration: DurationProtocol
  mutating func nextDuration() -> Duration
}

This means strategies are now Sendable and can be stored, shared, and reused. You create a fresh iterator each time you need to generate a sequence of delays.

RandomNumberGenerator Handling

Before: Jitter strategies accepted a RandomNumberGenerator in their initializer and stored a copy internally. This was problematic because copying an RNG means both the original and the copy could potentially produce the same sequence of random numbers, which defeats the purpose of randomization.

After: The RNG is no longer stored. Instead, BackoffIterator has an overload that accepts the generator at call time:

  public protocol BackoffIterator {
    mutating func nextDuration() -> Duration
    mutating func nextDuration(using generator: inout some RandomNumberGenerator) -> Duration
  }

The default implementation of nextDuration(using:) ignores the generator and calls nextDuration(); this is useful for "top level" backoff iterators such as constant or linear, since they do not have to interact with random numbers. Jitter iterators override this to use the provided generator.

New Retry Overloads

Before: There was only one retry function that took a strategy: (Error) -> RetryAction closure. If you wanted to use a BackoffStrategy, you had to manage the iterator yourself:

var iterator = backoff.makeIterator()
try await retry(maxAttempts: 5) {
  try await operation()
} strategy: { error in
  return .backoff(iterator.nextDuration())
}

After: There are now convenience overloads that accept a BackoffStrategy directly. The function creates the iterator internally and advances it on each retry:

try await retry(maxAttempts: 5, backoff: backoff) {
  try await operation()
}

For these overloads, the strategy closure is simplified to return Bool (retry or stop) rather than RetryAction, since the backoff duration comes from the strategy. There are also overloads that accept inout some RandomNumberGenerator and forward it to the iterator on each attempt.

The base retry with the RetryAction closure is still available for cases where you need full control (e.g., honoring a server-provided Retry-After header).

1 Like

This is starting to look quite elegant; is it perhaps the case that the backoff itself might contain the concept of a max attempts?

e.g.

try await retry(backoff: .expontential(maxAttempts: 5)) {
  try await operation()
}
2 Likes

I’m with @Philippe_Hausler here on including the maxAttempts. But otherwise that looks great.

Passing along the attempts is certainly interesting, I will think about how we can spell that. Either way, unfortunately, the leading dot syntax for backoff strategies is currently not possible, even though I'd also prefer to have it, due to the opaque return types.

This does not compile (or rather the usage of this would not compile):

@available(macOS 15.0, *)
extension BackoffStrategy {
  static func constant<D: DurationProtocol>(_ d: D) -> some BackoffStrategy<D> {
    ConstantBackoffStrategy(constant: .zero)
  }
}

However, this would:

@available(macOS 15.0, *)
extension BackoffStrategy {
  static func constant<D: DurationProtocol>(_ d: D) -> ConstantBackoffStrategy<D> where Self == ConstantBackoffStrategy<D> {
    ConstantBackoffStrategy(constant: .zero)
  }
}

This is why I initially introduced the Backoff namespace enum, but I am unsure if this is the right call, or if exposing those types would have enough benefits to outweigh the cost of publicly exposing the concrete backoff strategies. (It would also reduce the amount of overloads of the static factories, due to the conditional conformances of Sendable...)

One pattern that we use in our system that might be of interest here is to define an error type that will stop retrying, something like NonRetryableError(underlying: Error). While this functionality can be implemented by using a custom strategy, maybe it might be useful if one would like to handle error cases inside the main closure. Example would be:

let response = try await retry(maxAttempts: 5) {
    let response = try await httpClient.execute(request)
    if response.status == .forbidden { // or any other 4xx code
        // This is not retried, as would likely lead to similar result
        throw NotRetryableError(AppError.requestForbidden)
    }
    if response.status != .ok { // all 5xx errors for example, or a set of them
        // This is retried
        throw AppError.serviceError(response.status)
    }
    return response
}

While definitely possible with passing strategy closure, it will move some logic further down in code, making it possibly more difficult to follow?

Also, I wonder if it would be useful to pass in current attempt number? It might be useful to log current attempt in some cases. As an example:

try await retry(maxAttempts: 5) { attempt in
    logger.log("Attempting to read data", metadata: ["attempt": "\(attempt)"]
    // do request
}

Thank you for your input!

Interesting that you mention this, I had previously thought of this as well, but dismissed it because of two reasons:

  • You essentially have to decide, if you either want the "whitelisting" behavior (RetryableError) or "blacklisting" behavior (NonRetryableError) and I wanted to leave the decision up to the user.
  • The use case where the server dictates how much the client's backoff duration still requires you to forward the duration to the retry function, in some way. You could handle this with another special error that wraps the underlying error and a duration, however, it felt like I was fighting the second closure too much.

I am not entirely opposed to that, I do see use cases where this makes sense. (Either way, the current design does not prevent the user from doing this themselves, though.)

This had me pondering for a while.
Including the maxAttempts in a backoff strategy as currently written came with some rough edges, which would make implementing a custom backoff strategy a bit more complex. However, another way of combining maxAttempts and a backoff strategy would be to introduce another type RetryStrategy:

public struct RetryStrategy<Backoff: BackoffStrategy> {
  public let maxAttempts: Int
  public let backoff: Backoff
  
  public init(maxAttempts: Int, backoff: Backoff) {
    self.maxAttempts = maxAttempts
    self.backoff = backoff
  }
}
@available(AsyncAlgorithms 1.1, *)
extension RetryStrategy: Sendable where Backoff: Sendable { }

We would then end up with these two (six implementation-wise, but either way) retry overloads:

func retry<Success, Failure: Error, DurationType: DurationProtocol>(
  maxAttempts: Int,
  tolerance: DurationType? = nil,
  clock: any Clock<DurationType> = .continuous,
  operation: () async throws(Failure) -> Success,
  action: (Failure) -> RetryAction<DurationType>
) async throws -> Success

and ...

func retry<Success, Failure: Error, DurationType: DurationProtocol>(
  strategy: RetryStrategy<some BackoffStrategy<DurationType>>,
  tolerance: DurationType? = nil,
  clock: any Clock<DurationType> = .continuous,
  using generator: inout some RandomNumberGenerator = SystemRandomNumberGenerator(),
  operation: () async throws(Failure) -> Success,
  action: (Failure) -> Bool
) async throws -> Success
// this would actually expand to 4 overloads, because inout types can't have default values, nor can you spell `.continuous` this way, because `DurationType` could be inferred from multiple parameters (clock & strategy)

They are a bit asymmetrical now, one takes maxAttempts and requires a RetryAction, the other takes a RetryStrategy and returns a boolean in action, but in the end, those two functions are used in different contexts.

At call site, the most minimal client-side-backoff-version would look like this:

try await retry(
  strategy: .init(maxAttempts: 3, backoff: .constant(.seconds(3)))
) {
  try await doSomething()
} action: { error in
  return Bool.random()
}

and the other, non-client-side-backoff-version:

try await retry(
  maxAttempts: 3
) {
  try await doSomething()
} action: { error in
  if Bool.random() {
    return .backoff(.seconds(3))
  } else {
    return .stop
  }
}

These versions forces the user to think about backoff, though. Either when creating a RetryStrategy, or when returning a RetryAction. We could still provide default versions, where both versions use .constant(.zero) as backoff strategies, but I am unsure if this is a good idea or not.

I'd also make the backoff strategies, that come with this pitch, public, so we can get the continuous way of spelling the retry strategy: RetryStrategy(maxAttempts: 3, backoff: .constant(.seconds(3))). Previously I had the Backoff namespace enum which made it possible to return opaque types. In my opinion, a rather unintuitive way of spelling this at call site, though.

Either way, what do you think?

It’s not clear to me why we’d need an overload with a distinct retry callback signature if we make the max attempts a strategy. I would expect to be able to write something like:


try await retry(
  strategy: .constant(.seconds(3))
    .maxAttempts(3)
) {
  try await doSomething()
} action: { error in
  return Bool.random()
}

Which would compose with any retry strategy. You could also have a static factory to be able to use maxAttempts with no backoff.

Making maxAttempts optional, which you‘d essentially do, if you were to make it part of this builder pattern, you‘d have to decide:

  • accept that the default behavior is trying indefinitely
  • provide a fallback number of maximum attempts which is hardcoded into this library

BackoffIterator would also have to return an optional DurationProtocol in nextDuration() to indicate that the maximum number of retries has been attempted. If we were to do this, implementing backoff strategies adds complexity, similar to those in IteratorProtocol and AsyncIteratorProtocol, about handling past-end iterations.

And I guess the last thing, which might be fixable with different namings, is, that the maximum number of attempts is not something backoff strategies should be concerned of. It is related to retry not backoff.

if you were to make it part of this builder pattern, you‘d have to decide:

  • accept that the default behavior is trying indefinitely
  • provide a fallback number of maximum attempts which is hardcoded into this library

I don’t think it’s true that the default behavior needs to be indefinite retries in lieu of a hardcoded fallback. We can provide a default value for the strategy to provide a reasonable backoff/max attempts (as the API in your lib does today IIRC).

the maximum number of attempts is not something backoff strategies should be concerned of
I agree under the current naming scheme l

FWIW when you first previewed your library on the forums I had the same reaction that max attempts should be a part of the strategy. I went down the same path you have and came to the same conclusion.

I believe the options are:

  1. Make max attempts a distinct concept.
  2. Make max attempts something each strategy implementation has to consider.
  3. Make max attempts its own strategy that can easily be composed with other strategies.

Other retry operations I’ve used in the wild go with (1) where maxAttempts is a convenience for tracking the count in the action parameter.