The Context Passing Problem

ktoso · August 17, 2020, 6:10am

Thanks a lot Tanner & Nathan for giving this some thought! Let's unroll and deep dive into the differences here.

Before we dive in, allow me to share some context in which we're operating, as it shapes a lot of the needs and tradeoffs we need to take when designing these APIs.

Very roughly, one can claim that there are two "kinds" of users (and everyone somewhere on a line between those extremes), which shapes what requirements they pose to their tools and frameworks:

[a] teams with a simple not-very-distributed system
- it may have one or two databases, caches, maybe it talks to one or two other services in the same company; it's still "fits in your head" and maybe you even know teams which operate the other services
- (though note: all systems start small, and some of them then inevitably become more complex large distributed systems.
  - We want to avoid "trapping" users in situations where they and their frameworks can't "grow" with them (i.e. let's avoid having people hit a wall and go off writing "we had to rewrite our system from Swift to X because the tools did not scale as our needs grew!", let's avoid these by planning forward)
[b] teams operating in large organizations, interacting with tens or hundreds of other services
- in such teams/organizations, it may even be a requirement that a language/runtime has a great observability (meaning: logging + metrics + tracing) story, otherwise it'll be dismissed as cute toy but not "serious" for the backend needs of such organizations.
- for Swift to break through and be taken seriously in such organizations, the core SSWG APIs such as Logging, Metrics and Tracing (often called the "Three Pillars of Observability") are crucial, and already with Logging with Metrics we saw the situation improve by a lot.
  - Now we need to take the 3rd step here to offer a complete picture to serious large organizations evaluating if they should or shouldn't trust Swift on their backends. In other words, this is about answering the infamous "is it ready for production yet?" with a resounding yes! when the organization asking has very strong requirements on operating their services.

So today, Swift is already doing great with [a]-style use-cases on the server! And this is very much thanks to all the libraries/frameworks people in this thread have developed - this is no small feat and we should keep and carter to this audience.

But the [b]-style use-cases still have a lot of nitpicks they can point at and dismiss Swift without seriously evaluating it; Tracing is one of those boxes that we can not only deploy "somewhat", but do it truely great which will cause such teams looking at Swift see it in a much better light, even as an improvement over what they used to use for their backends. This is what we need to aim for here.

These two "styles" are obviously somewhat conflicting: [a] want the least amount of noise and friction possible, they may not care much about all these things which matter only in larger deployments, yet we also need to grow server side swift in those large organizations / systems [b] which have more requirements, and also a slightly larger tolerance for some noise to get the benefits they care about (of course no noise would be best, but we can't get there in one magical step).

So, how do we make APIs that can serve both types of users, and grow the impact Swift has on the server?

Why context is not configuration, and should not be hidden away

Let's use the most recent example of the "hide context in client protocol" style that Nathan provided:

// what-if (this API shape has issues/pitfalls)
protocol RedisClient {
  var eventLoop: EventLoop { get }

  func send(command: String, with: [RESPValue]) -> EventLoopFuture

  func logging(to: Logger) -> RedisClient
  func with(context: BaggageContextCarrier) -> RedisClient
}

This is easy but not simple, it's easy for the library maintainance to ask users always do the right thing, but it invites a few kinds of programmer errors:

First, storing a client with context:

// don't do this
var redis: RedisClient
func something(context: BaggageContext) -> Future<...> {
  redis = redis.with(context) // oh no!

  // some many redis operations
  ... = redis.something(...)
  ... = redis.something(...) // maybe ok, maybe broken
}

func other(context: BaggageContext) -> Future<...> { 
  // <100 lines of code here>, 
  // because that's how real world projects get messy after a long time...
  redis.get("foo") // whoops!
}

So I have a few redis operations here, and I got lazy and decided to "oh well, it's all in the same context anyway!" and I can accidentally cause such mistake. So let's say the code in something() maybe even still is correct, but I stored the client somewhere and in other() (maybe because some refactoring, merge conflict gone wrong etc) I missed to do this and now it not only has no trace but it has a completely unrelated trace Such things are very painful to discover when "production is on fire", and that is exactly when we look at those traces/metrics.

One can always say "don't do that", but it is a good APIs job to lead to the right out-comes and prevent abuse and mistakes. We'll discuss below also the linter situation -- we can't write good linters for this pattern making it more difficult to recommend to larger teams.

client protocol style: this pattern makes it possible to make silly, hard to "visually" (or automatically) track down mistakes;

This problem is not new to Swift, albeit with more fatal consequences, the same manifests with getting pointers to underlying storage. It should not be valid to just "get the pointer out and store it," in the same way it should not be possible to store a tracing context. This however would yield a "meh" API, such as this:

let res: Res = client.with(context: context) { client in 
  client.operation(...) -> Res
}

// in the same style / for the same reason as:
// array.withUnsafeBytes { pointer ... }

this is "better" because it stops me from doing the storing the wrong context mistake, but since context should be passed "in a lot of places" this is obviously too annoying of an API. So I don't think this is viable.

client protocol style: Trying to use the protocol as a with() { ... } would prevent the bugs of storing contexts, but is quite annoying both for API designers and end-users.

explicit context style: it easy to express the rule of "don't store context objects, just pass them along," and as we'll see below, we can write linters to help with this.

To see how really this pattern is more complex and makes some automation and tooling impossible, let's look at function wrappers as well as tooling and linters below:

Simplicity and "Function wrappers"

While function wrappers have not been proposed "seriously", it is not unreasonable to speculate they would be a very useful and natural addition to Swift. And since we're trying to form a standard way to pass context around for instrumentation/tracing, and such annotations would be tremendously useful to these use cases, we should make sure we're future-proof for those.

Today, in order to trace executions one has to do "manual weaving" of managing spans, such as:

func send(work: Work, context: BaggageContextCarrier) -> EventLoopFuture<> { 
  let span = tracer.startSpan("#function-\(work.type)", context: context)
  
  let someFuture = ...
    .always { result in
      if case .failure(let error) = result {
        span.addError(error)
      }
      span.end()
    }
  return someFuture
}

This is a bit annoying, and while it is fine for "that one single send() function in my entire library" this is not something one would like to sprinkle around one's codebase everywhere when I'd like to trace some time a piece of code takes. (Sure, we can work around this with sugar on top of futures and with... functions there etc, but that's not the end-game because we will have asynchronous functions).

Specifically, in Redis client and other databases there usually is exactly one (or very few) function which handle the "send the query" and then a lot of convenience API on top of them. This dance today only has to be performed once, in that send function.

What we'd want to be able to do though, is the following:

@Traced
func send(work: Work, context: BaggageContextCarrier) -> EventLoopFuture<> { 
  return ...
}

much nicer...! And in the future, the same composes just as perfectly with async functions:

@Traced
func send(work: Work, context: BaggageContextCarrier) async -> Thing { 
  return ...
}

Now, we could not nicely express this if the context is hidden away in library specific types, since each library has its own "client" type. Trying to express this with "the thing I call has to conform to some protocol on which I can set a context before I call it" becomes all kinds of weird and not understandable IMO. (I tried to PoC this a bit but it looks terrible each time).

Hide context in client-protocol: It is impossible to implement @Traced if we hide away the context parameter from the function.

Explicit context (even if optional): It will be possible to implement @Traced naturally if/when some form of function wrappers were to land in the language.

So yes, it is natural to express the context to be "flowing with the calls" (same as a stacktrace isn't a thing "on the side" a trace is not a thing on the side after all -- it is the lineage of the execution). but we still have that pesky explicit parameter that people don't like...

Bear with me and read on, we'll get rid of the explicit context parameter it in the future (!).

Simplicity and Tooling

Keeping context passing Simple is very powerful, not only to avoid "having to remember which special DSL I need to call here" but also because it enables tooling.

I mentioned this above a little bit, but let's dive into it deeper. Funnily enough I had forgotten this is also a reason fow why Go's context wroks like it does (and here I thought I invented the reason, silly me, of course not -- it's been done before ):

Note:

// context package - context - Go Packages
Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it. The Context should be the first parameter, typically named ctx:

(We're not doing "first" but "last", after consultation, sswg and nio teams and the std-lib team and just general swift function naming guidelines; but the general idea remains similar)

And notably:

// context package - context - Go Packages
Programs that use Contexts should follow these rules to keep interfaces consistent across packages and enable static analysis tools to check context propagation [...]

Here is what this specifically means for us. The following "hidden via protocol" pattern:

client.with(context: context).hello()

client.helloAgain() // ⚠️ it... *probably* does not have the context?
someClient.hello() // ⚠️ does it have the context or not? we *can't* know!

How could we decide where to issue warnings with a linter or even xcode itself? It becomes impossible to statically definitely decide if a call has the right context or not

The following however is trivially machine decide-able:

// hello(<other-parameters>, context: BaggageContextCarrier?)
client.hello(one, two, three) // 💡 missing context pass spotted!

Note: We can say we'd do optional context if that to ease the adoption adoption and allow people who don't care about traces to ignore the context dance. This is a tradeoff, but one we can make to ease adoption and future evolution IMO.

This allows [a] style of users to not care about it, and [b] style of users can use a linter that we can build already today using SwiftLint's analyze mode, that would detect missing context passes easily. It can be warnings or errors and thus it can be tuned to what a team needs. Say a team really wants to make sure tracing works because they have a complex system, and need all the insight they can get during an outage or performance tuning, they can make a not passed context fail to compile.

We can also provide fixits:

client.hello(name: "Charlie", option: .something)
// vvvvvvvvv
client.hello(name: charlien, option: .something, context: context)

* not a real fixit in Xcode just an image to keep your attention

This is why the concistent naming of the context parameter (if it's BaggageContext instance, or FrameworkContext that carries a context) always as "context" is important. We can easily fix things and it will be a smooth ride to add propagation when needed.

hiding context in client protocols: impossible to write linters and fixits for teams which care and want to always ensure the highest standard and quality of their traces.

explicit context passing: It is possible and relatively simple to build tooling, linters, and fixits that would fill in the missing parameter.

What we'd really want: Asynchronous Contexts / "Execution Scoped Variables"

Let's also keep reminding ourselves that this isn't about the Swift today but where Swift is heading. And while John has not yet published his concurrency proposal, it is fair to say it involves some form of awaiting and coroutines (although not just that).

So these APIs need to work well with async/await style and the arrival of these features means having to change all signatures and patterns anyway, e.g. because you may want to avoid returning EventLoopFutures perhaps, i.e.

func hello(context: BaggageContext?) -> EventLoopFuture<Hello> {} 
// vvvvvvv
func hello(context: BaggageContext?) async -> Hello {}

Then, it is only natural to want to express "some data that flows with the execution" along with async tasks and functions. This would be "Async Context" (or similar, other languages call these Scope Variables), and if we'd get to that point we can remove the explicit passing around but semantically it is exactly the same and migration will be simple:

@deprecated(explicit passing is not necessary anymore)
func hello(context: BaggageContext?) -> EventLoopFuture<Hello> {} 

// In Swift "X"
func hello() -> EventLoopFuture<Hello> {
  BaggageContext.current // which would use AsyncContext<BaggageContext>.get
}

this looks like but is not thread local variables. Thread locals cannot work well in Swift with its various executors and concurrency libraries; The use of thread locals to pass around context should not be part of the core story; though if you are some single threaded framework... you could get away with it -- but really, it's 2020 and being async and non-blocking is the norm, and the same will be true to distributed tracing soon enough.

In any case, we don't know when or if we'll get this, so it does not make sense to hold our horses until this feature lands, but we should make the evolution towards what we really want a stright one, and then allow the language to help us removing more and more noise from it.

explicit context passing: is a natural "predecessor" to asynchronous, implicit, context propagation that could be offered by a coroutine runtime, which has proven useful in existing coroutine implementations, and as such it is not unreasonable to expect such feature to exist someday.

More reading on this topic:

A good read on CoroutineContext and Scope (Scope.coroutineContext) in Kotlin: Coroutine Context and Scope highly recommended to read this

this specifically applies a lot to us, as concurrency's fundamental building block basically known to be coroutines by now; Note how a coroutine context can have values added to it, and retrieved from by type; this is the same how we'd carry a baggage in such "coroutine context" / "async context"

Java's pursuit of Scope Variables in project loom (coroutines)

A few specific ways forward to consider

As I see it, we have a few "ways" forward, and I'll list them below explicitly, and we have to keep in mind the upcoming "Swift Concurrency" story, so things will change, so the various ways to can phrase the APIs can either prepare for migration to that future concurrency, or we can wait it out, and do nothing until the new things land perhaps (but I think it'd be a waste to completely punt/wait until then).

Let's focus on the concerns you mention first though:

tanner0101:

req.client.get(...)
As far as I understand, req already has all the context you need. It has an EventLoop, Logger, and in the future will have BaggageContext (with trace ids deserialized from the incoming request). So can't we keep the actual call site clean?

I understand your desire to keep the context out of here, I really do.

But looking at the examples above, I really think we're trying to "eat the care and keep the cake" if you want to provide a super fluent DSL that strongly relies on Vapor style / protocol style.

I really do understand your desire to keep the APIs as lean as possible, this is what Vapor promises to its users, but I do not think (see examples above) this is the right style for context passing and it is counter to how asynchronous and "execution scoped variables" (which is what (baggage) context is) really work. I fear that the more we try to hide it, the more pain you'll be inflicting on future adoption, unless holding out i-don't-know-how-many-years until async contexts become a thing.

Sadly I don't know if or when those would land, because that would solve the entire "explicit" problem of the explicit context passing, but I don't think we should adopt patterns which hide what we're really doing, as it will only make composition, tooling, and future evolution towards async functions (and more) very difficult.

What is wrong with not wrapping all those clients and saying the same everywhere, Vapor need not be in the business of solving every possible http / database / service client's APIs with a super fluent DSL -- the goal is consistency, not short-function-calls-regardless-of-complexity-cost-to-get-there?

If one want traces, pass the context + it will prepare us well for the future asynchronous swift world. I can't make promises about the future, but the trajectory of such APIs is consistent and in alignment with how coroutines and asynchronous contexts usually work.