Server Distributed Tracing

Hi fellow server developers :wave:

I'm reaching out to let you know that the work on "Distributed Tracing" for Swift has officially started. We are still in early PoC stages and are looking forward to your input in form of forum discussions and/or GitHub issues.

Summary

In an approach similar to Logging and Metrics, we want to enable universal instrumentation for systems written in Swift. Instead of focusing on one use-case only (Tracing), we want systems to be instrumented once and use the result in various ways. If you know the Logging and Metrics libraries this may sound familiar. In Logging e.g., users are able to access logging through one central API, no matter how they set up how these logs are being handled (LogHandler).

Regarding cross-cutting tools there's also prior art that inspired this instrumentation approach: "Tracing Plane" paper

Current state

The repo contains three Swift libraries so far:

:package: BaggageContext

Contains a storage type of the same name. This type is intended to be used as a container for storing all sorts of things in a type-safe fashion, uniquely keyed similar to a dictionary. So if you're a cross-cutting tool developer you have the freedom to store your custom values in the baggage without the nasty casting of a [String: Any].

:package: Instrumentation

This is the API that library developers would use to introduce points in their libraries where values may want to be injected/extracted to/from. It defines the InstrumentProtocol, which will eventually be implemented by different cross-cutting tools to extract/inject metadata through a BaggageContext. This layer of abstraction enables us to not only instrument using tracers but also other tools, all at the same time.

:package: NIOInstrumentation

Contains both an inbound and an outbound NIO handler that invokes Instruments on read/write. (These handlers will likely change as soon as we decided how to best integrate BaggageContext into NIO).

Use cases

Additionally, in the UseCases folder you can find some existing examples.

Outlook

  • Integrate the BaggageContext into NIO so that it's easier to carry through the Channel
  • Add the ability to merge BaggageContexts
  • Integrate with swift-log
  • Showcasing instrumented end-to-end AsyncHTTPClient <--> NIO Server communication

Last but definitely not least, shout out to my awesome GSoC mentor @ktoso :star_struck:

20 Likes

This is really exciting work to see happening! Distributed Tracing is a fantastic power tool in complex systems, and making this investment fairly early in the ecosystem lifetime is a great way to try to bed it in widely. I'm very excited to see where this goes.

3 Likes

That's super cool, really happy to see this! Let me CC some other folks who may be interested: @tanner0101 / @tachyonics / @fabianfett / @adam-fowler

2 Likes

This looks great! The BaggageContext is very similar to Vapor's Storage type. Having a consistent, interoperable API for this sounds great. The examples are straightforward and what I would expect. The NIO channel handlers also look good.

Questions:

  • Why the name Instrumentation instead of more commonly known Tracing?
  • Will this eventually have a static bootstrapping system like MetricsSystem and LoggingSystem?
  • Being able to access channel-specific context statically (in this case, BaggageContext it seems) has been discussed in the past. Will you be looking into this at all?
  • Do you have any examples of instruments that would not use HTTPHeaders or instruments that would have asymmetric inject / extract types?

Notes:

  • Naming a type the same as its module makes qualifying types difficult. I'd recommend naming the module Baggage and the type BaggageContext.

Nits:

  • What about Storage and StorageContext instead of Baggage?
  • More descriptive names than ...Protocol are better where possible. Maybe InstrumentHandler like SwiftLog and SwiftMetrics?
  • The associated types InjectInto and ExtractFrom are prepositions, not nouns. Maybe something like Source / Destination would be better?

Thanks for the great work so far @slashmo and @ktoso !

6 Likes

So far looks great. SmokeFramework/SmokeAWS 2.x has a rudimentary tracing mechanism that from the examples looks similar so should be relatively straightforward to migrate over.

Is the plan to make this is compatible with mechanisms such as OpenTracing?

5 Likes

This is fantastic progress. Together with logging and metrics, tracing is a pillar of observability, and as the ecosystem matures and more complex distributed systems are being built with Swift, distributed tracing becomes increasingly critical. It is encouraging to see that the proposal is addressing it in a generic and forward looking way that enables scaling the ecosystem without tying it down to a specific tracing solution.

2 Likes

Thanks! Iā€™m really excited about different frameworks/libraries potentially benefiting from this work. OpenTracing: Yes, ideally OpenTracing/OpenTelemetry would be one Instrument that can be used by systems. We imagine Smoke/Vapor/NIO to implement a point of configuration for which instruments should be used, and they then use the Instrument APIs to instrument without (ideally) knowing what specific tools their end-users are using.

@tanner0101 I think this also answers your question about why we decided to go with ā€œInstrumentā€ instead of ā€œTracerā€. We want to keep this part of the API as generic as possible so that you wouldnā€™t have to set up instrumentation multiple times, but instead ā€œjustā€ set up the systems to use different cross-cutting tools, not only tracers.

2 Likes

Thanks for the detailed reply. Feedback like this helps a lot!

Interesting! Where is the Storage type being used in Vapor?

I hope this was already answered in my previous reply :slight_smile: Basically, we donā€™t want to only enable Tracing but be flexible enough to let users use whatever cross-cutting tools they want to use without having to change the systems being instrumented. (This is the part that is inspired by the ā€œTracing Planeā€ paper I linked above).

Good question. We didnā€™t discuss this so far, but I think it probably makes sense. Our current approach to things like this is to create many examples and ā€œdiscover the pain pointsā€.

How to access the BaggageContext is one of the things weā€™ll look into next. I think the idea for NIO is to have it accessible through the ChannelHandlerContext. Iā€™m afraid Iā€™m not quite sure what you mean in this case by accessing ā€œstaticallyā€, could you please elaborate a bit? Also, if possible, could you please include a link to the mentioned discussions so that I can take a look?

So far we donā€™t have any in the repo, but I think @ktoso could give some more examples.

ā€”

Also, thanks for your Notes and Nits, these are very helpful.

1 Like

Both Application and Request have Storage as a stored property. This allows for the developer and third party packages to easily extend these types. This works particularly well with Swift's extensions, for example:

extension Request {
    private struct Foo: StorageKey {
        typealias Value = Bool
    }
    
    var foo: Bool? {
        get { self.storage[Foo.self] }
        set { self.storage[Foo.self] = newValue }
    }
}

It functions similarly to the userInfo: [AnyHashable: Any] pattern many Apple libraries use.

Ah, ok I see. That makes sense.

Most of these conversations have been IRL. (I found an SSWG meeting note about it here but that's not super helpful).

Static context access might look something like this:

func foo() {
    BaggageContext.current[SomeKey.self] = someValue
}

Doing this without static context access would look like:

func foo(baggage: inout BaggageContext) {
    baggage[SomeKey.self] = someValue
}

The key benefit to static access being that you don't need to "clutter" your API with context passing. Achieving static context passing in a one-thread-per-request based framework is fairly straightforward since you can use thread storage. Doing it in an event loop design like NIO is more complex. I'm not a Node.js expert, but it seems they allow static context passing in an event loop system with something called Continuation-Local-Storage. Whether or not doing this is a good idea, I don't know (http://asim-malik.com/the-perils-of-node-continuation-local-storage/).

FWIW, I'm not personally very invested in this type of context passing. Vapor opts for making it easier to pass things explicitly instead. I'm just interested if you planned on addressing it as a part of this proposal or had any thoughts.

Thanks!

1 Like

Thanks everyone for chiming in!

I'll fill in some additional info on a few of the questions that were asked so far:


@slashmo already answered this well I think, but to expand on our thought process here a bit more:

The choice of words here is not accidental and implies a few "layers" at which tools and implementations exist. The work we're doing with baggage here is at the very "bottom" of all possible cross-cutting tool implementations. I could not phrase it better myself, so I'll quote the tracing plane paper:

Despite their demonstrated usefulness, organizations report that they struggle to deploy cross-cutting tools. One fundamental reason is that they require modiļ¬cations to source code that touch nearly every component of distributed systems. But this is not the only reason. To see why, we can break up cross-cutting tools into two orthogonal components:

  • ļ¬rst (1) is the instrumentation of the system components to propagate context;
  • second (2) is the tool logic itself.

... what the metadata is, such as IDs, tags, or timings, and when it changes ā€“ depends on the tool (2),
while the context propagation ā€“ through threads, RPCs, queues, etc. ā€“ only depends on the structure of the instrumented system and its concurrency (1)

In other words, a Tracer is a specific tool, while instrumentation is the not tool specific "carry these values please" part of the system.

We specifically choose to talk about "cross-cutting tools" on this layer, but perhaps it's too abstract without more examples of what we actually mean, so here's a few examples of what could be implemented as baggage instruments but is not "directly" distributed tracing:

  • deadlines
    • in a multi-service system, where a request has to go "through" n services from the edge, and the edge has a strict SLA of "must reply within 200ms", we may want to carry around the deadline value with the requests as they are propagated to downstream systems. If a system receives a request and the deadline is exceeded (wallclock time makes this tricky in dist-systems of course, but bear with me) we know that the upstream has already replied with "well, request timed out" so there is no reason to even start working on the request in the downstream service, so we can drop it.
    • This is a pattern built-in to Go's Context as well as gRPC Deadlines [1]
    • I've also heard about some developers wanting to do a TTL in terms of "how many hops a request makes before we abort it"; Such instrument would get a ttl-hops counter and keep decrementing it at each remote hop a workload causes.
  • resource utilization / analysis / management
    • in multi-tenant environments if may be useful to capture congestion of resources and "who is responsible for this overload".
    • Again, is not exactly tracing, but very similar to it, and needs the same kinds of baggage context propagation. Say we want to group all "work" caused by a "request" made by a Client, and give it some quota, and if that client exceeds it's quota, we want to de-prioritize serving it, because it's badly behaved and starving well-behaved clients for resources. A simple example would be Client calls [A calls B calls C] and exceeds its allocated quota (however we measure that...) on service C; since Client always enters the system on A, we'll want to tell A that "hey, that Client is not well behaved, throttle it a bit". But we can only do this if we can track
    • one example of such system is Retro [2]
  • authentication, delegation / access control / auditing
    • This is not an area I'm an expert on but does come up as another use-case of such instruments; It feels right, since usually these also mean carrying along the execution of some task some identity information. I do not encourage building ad-hoc security if anyone ever gets to this, there's plenty literature about it, our only hope is that if such system needs to carry metadata, it should be able to use the same instrumentation "points" as tracers would. :wink:
    • Baggage can be used to carry around information "on who's behalf" we are performing actions and similar, which can be used for auditing etc.
    • The Universal Context Propagation for Distributed System Instrumentation [3] paper lists a number of such use cases, but I'm not familiar enough to say much about them.

The word "Tracer" is bound to appear in implementations, but it is slightly different than the instruments I believe.

[1] gRPC and Deadlines | gRPC
[2] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/mace
[3] https://people.mpi-sws.org/~jcmace/papers/mace2018universal.pdf

Yes, I think it's quite likely we'll end up with a similar shape to these.

Same as with metrics, most libs will be fine to just call "whatever the current tracer/instrument is", while some tracers may need to offer (very likely) user level APIs for folks who want to explicitly start/stop spans in their code. In that case they'd get and specifically call CoolTracer.[...] things (another reason we don't necessarily want to use up the word tracer just yet.

We've already seen this pattern with e.g. SwiftPrometheus [4], where for most things the generic "just use the global one" is fine (esp. in frameworks, since you can't assume which metrics backend users will run with). But users may want to utilize some extra features Prometheus offers, which in this example is "help texts" [4] which are prom specific so if one wants to use those, one cannot use the generic instrument but would reach for the PrometheusClient metric factories.

We're expecting the same to happen with tracer implementations where some impls may have some specific features which don't fully fit into the instruments' APIs.

[4] GitHub - swift-server/swift-prometheus: Prometheus client library for Swift

Great catch, thanks: ticketified Rename BaggageContext module to Baggage Ā· Issue #26 Ā· slashmo/gsoc-swift-tracing Ā· GitHub

I think it's very important we do not tie ourselves to HTTP as it opens up very interesting future development potential.

There's two main categories here I suppose:

First, "some binary protocol that is not HTTP" for whatever reason those exist, and we also want to carry metadata in them. It could be local XPC services, some other IPC mechanisms, or it could be some custom binary wire protocols (e.g. RSocket [5]) that also have metadata fields but they are not specifically "HTTPHeaders".

[5] http://rsocket.io/docs/Protocol#frame-header-format We're not specifically looking at rsocket, but it's a good example for making the case :slight_smile:

Second, databases! We might be able to trace "into" Cassandra requests [6] (ancient post, but has nice pictures :wink:), meaning that you would not only get the trace of your HTTP calls, futures, but also where the time is being spent inside the database to serve your request. This can allow noticing some problems with your schemas, indexes etc. It's a bit of a power user feature, but again -- we want this BaggageContext to be ubiquitous meaning that a database client could expose APIs which allow this use case by doing fetch(query, context: context). There the inject() implementation would take some form of statement.setOutgoingPayload("zipkin", ...) or similar, so z ZipkinCassandraInstrument would do just that, and some other tracer would set some other things there, while the Cassandra client library simply knows "there is some Instrument<CassandraStatement, ...> I will call here".

[6] Replacing Cassandra's tracing with Zipkin

We can do the same with any client library just "on the client" though that would not really require the inject/extract calls... We'll get to these use cases soon enough I think, It is likely this may need a generic tracer type. There's folks who instrument Redis but it's all "on the client" which is also valuable.

Yup that's the same usage-style we'd envision here (and is possible today).

Yeah that's one of those "looks great in small snippet, completely breaks down in complex system" things... Having that said, it definitely is very helpful sometimes.

My personal hope is that we can aim for explicit passing as it's less error prone, and make it pleasant-enough. Yet for cases where a framework really would like to do the .current style there should be some way to do so. Not sure on what type those would exist but there is some prior art for this in Rust's tracing crate (where a tracing provider can implement .current, but all APIs require passing a context in, thus if a user uses the implicit passing, they can always summon it when they hit an API they should propagate the context to: request(url, context: .current /* summon */).

I want to be very cautious about it though, because getting implicit passing around using TLs right with highly async frameworks is notoriously difficult to get right and hard to debug (resulting in "dropped traces" which are a nightmare to debug) since it's not "visible" where a baggage context was "dropped" ("forgot to pass it along") so I don't think it should be the default way, but it can be an optional way for end-users perhaps...

For ThreadLocals (TLs) to work well all participants which are async have to be aware of the fact that they must store/restore the context when they are about to fire off async work, and then when they're about to actually execute it (by storing the current TL "somewhere", and then getting from "somewhere" into current thread's TL again). This sadly breaks down the moment a library is not aware of it and can break in subtle ways; TLs also have the problem of not being "scoped" so if you keep setting stuff on a TL, but forget to clear it when you're "done" you might have "polluted" the storage -- and suddenly a request without a traceID attached shows up as if it was traced and as part of the previous trace :scream: These are annoying to debug and fix... but yes, it's possible to make it pleasant if you control all the threading of an application.

So... Thread Locals have known and very annoying limitations, and also incur an annoying performance hit to access them. But there's ideas how a structured concurrency runtime could improve on the state of art here, and there's some ongoing work in Java's project Loom called Scope Variables [7] which are pretty promising. If we were in position to build something like that those would be much more interesting for baggage than jumping onto the TL train right away. It's an interesting read, give it a look :slight_smile:

[7] State of Loom: Part 2


Whoo~... that ended up much longer than I expected, but I hope it's interesting and shines some light at how we're looking at the problem space :slight_smile:

Please keep the feedback coming and stay involved on forums and the repository! :pray:

2 Likes

Looks good, thank you!
One question though, why BaggageContext and not just Context? Seems like a lot to type, making signatures even longer, and it will be in contrast to other languages, like go, where its called just Context. Isn't just Context a widely accepted and understood type? Or am I confusing something here?

Thanks for the question. @ktoso & I discussed this as part of Naming naming naming :-) Ā· Issue #12 Ā· slashmo/gsoc-swift-tracing Ā· GitHub. Basically, we think that Context is too generic and already used in many libraries (ChannelHandlerContext, LamdaContext, and even Context itself), where the argument name often names is called context. Also, the way I imagine BaggageContext to be used is as a property (if possible) of a libraries existing Context type, instead of replacing it entirely.

First two examples do not collide in naming, were is the third one coming from? Is it coming from other context libraries?
:sad panda:
I was hoping for this to be the context library, that would be proposed to SSWG and be widely adopted like swift-log and swift-metrics? There were no problems using names like Logger and Counter, even though they were probably used in other libraries. Maybe it's actually a good thing that we use a name that is the same, it will make migration easier (this is what I would prefer for replacing library I use)
Can't we disambiguate by using module name?
Sorry for my very silly comments, but I feel like reinventing names is bad, especially if one has to work with projects in different languages...

1 Like

Just saw that module is named Baggage, for some very silly reason, I reeally don't like that name :(. Though I would prefer for module to be named Baggage and have Context (disambiguated as Baggage.Context), since I'll have to use it once, and Context will be everywhere...

1 Like

One other comment I have (this one is not silly, I promise :smiley:). Was dependency on NIO a requirement? Do we not expect for this to be used in non-NIO projects? Or, since its server-oriented, we expect all projects using NIO anyway?

Edit: What about naming the module Tracing and type TracingContext or just Context?

Thanks for the feedback @artemredkim.

Before I dive in, it's important to keep in mind tha this is not a "pitch of the final API" and yes we're still exploring the space. There are different tradeoffs that were considered that led us here though; So yes, things are up for change, but they need to address all the considerations / cases we need to handle.

I do want to push back on "reinventing names", if anything then we're following prior art here to be honest. The Go example is somewhat of an exception to be honest, since it is the absolute core of the language, and libraries don't often offer "framework context" which many of ours do (including non SSWG where we'd also want to "attach the context to"). Go also has some notable rules about Context which I'm not sure will work out in reality for us, including "one must NEVER wrap Context in another type", given that we have existing types that would be well served to "carry a baggage/context" (examples below).

So the baggage naming follows prior art and keeps in mind what we're not likely to pull off the "Go style" of passing context around (which we are also consulting with stdlib folks).

Examples of prior art terming this type of object as "baggage":

  • Jaeger tracing
    • It is useful for manually providing some baggage items for testing purposes, which we can exploit. Alternatively, the baggage can always be explicitly set on the span inside the application by using the Span.SetBaggageItem() API.
  • 2016 Pivot Tracing
  • 2018 Universal Context Propagation for Distributed System Instrumentation, which effectively explains the "future of context propagation in distributed systems"
  • OpenTracing's "Baggage" is
    • _The SpanContext carries data across process boundaries. Specifically, it has two major components:
      • An implementation-dependent state to refer to the distinct span within a trace
        • i.e., the implementing Tracerā€™s definition of spanID and traceID
      • Any Baggage Items
      • These are key:value pairs that cross process-boundaries.
    • OT's general "bag" type is the TraceContext and it contains the Baggage
  • Related thread in B3 (the zipkin protocol)

So... things are not set in stone, but we're trying to explore the space and see what will read and use the best in various usage patterns. We do however find the problem of "everyone already has a context object and will keep it" problematic.

Specifically, we do anticipate frequent use cases (lambda, nio, other libraries we're working on) to already have a "highly framework specific context" which may include objects which are not meant to be propagated across threads etc. I.e. "do not call this from another event loop" style values.

For library composition however, we do anticipate libraries accepting a (Baggage)Context, since that is the interop and "the carry values across threads" type. When the two meet, we anticipate the following to happen frequently:

// a framework specific context exists:
context: ChannelHandlerContext 
/* or Lambda Context or Other.Context or Request (Vapor) */

// since other lib does not know about the above framework framework:
someOtherLib(param1, param2, context.baggage) 

// additional sugar to accept context would be possible, 
// if it conformed to some HasBaggage protocol (idea):
// protocol ???Baggage { var baggage: Baggage??? { ... } }
// someOtherLib(param1, param2, frameworkContext) 

Specifically, how would we in this situation (a framework context) and a library unaware of framework context accepting a context) avoid the following:

someOtherLib(param1, param2, context.context) // ouch 

So I think we have room here to wiggle around with the types, including existing APIs which do not want to break API. (The above example ain't far from reality, as we are likely to introduce/need a "associate some value with a channel" type, so again the ChannelHandlerContext would not "be" the (baggage)context, but it offer one).

I would argue (strongly) that BaggageContext should not be "random bag of any random stuff", including closures and non-serializable values (e.g. closures and other non-carryable-over-process/network-boundaries values). Specifically, it is unreasonable to now claim that NIO and all libraries have to express all their parameters AS (Baggage)Context, however it makes much sense to allow them to "carry the baggage" if they already have context objects.

I.e. in NIO (or Lambda, or similar "framework") use cases, it is frequent that the "framework context object" already is being passed around when necessary, so we want to avoid having to pass "two context objects" (keep in mind we may not be able to force all implementations to IS-A (Baggage)Context).

The second point is perhaps possible to be resolved in other ways but it comes to mind that we very likely will want to carry W3C TraceContext (not yet final) values (as a type) in a baggage, it makes it easier to spell and understand if not both are called context, i.e. accessing a W3C context inside a baggage would be baggage.traceContext rather than context.traceContext, and similar with other tracers baggage.zipkin.traceID etc.

The name TraceContext is problematic because we want to enable use cases which are not-just-tracing, and the naming would feel quite wrong if we tied it all to tracing using that type name.

So I agree that definitely have to keep exploring the naming here (and likely to change around more than a few times before this gets "stable"), I do strongly believe that we have to do so while implementing specific tracers and use cases in specific frameworks. Currently I think the case for let's use Context as the type name is problematic for adoption and implementation reasons listed above. We could be wrong though, and as goes with making common abstractions we do need a few real-ish implementations to really figure out what will work and what not -- that's the upcoming 2~3 months of the GSoC before us (and we're only in week 2 now) :slight_smile:

Rest assured: The "context" project depending on NIO is a strict non-goal :slight_smile: ... and is only accidental for the time in order to get the GSoC running and get the UseCases implemented as soon as we can.

While it is server-oriented we do envision its use in non-server scenarios as well, thus the hard zero dependency rule here.

You're right that the "context project" will be stand alone and zero dependency, because indeed we do want to use it in use-cases which do not care about NIO at all.


Summing up: yes I agree the naming needs to be flushed out, but I disagree that "just use Context"ā„¢ is the obvious answer that we don't even need to think about -- we do need to investigate more, and perhaps we'll realize it'd be possible, but currently I see a few roadblocks to achieve that.

How about we open another ticket to "revisit naming" and do so explicitly once we have at least one end-to-end use case as well as more than one "what type of metadata is being carried around" and PoCs of Tracers?


I do strongly recommend reading up on https://people.mpi-sws.org/~jcmace/papers/mace2018universal.pdf (though we can argue if it makes sense or not of course) which is a strong inspiration for this work, and highlights some reasons why "(baggage) context" justā„¢ being a glorified dictionary (even if it is today implemented as such) is not necessarily the end goal.

2 Likes

Thank you! this is waay too many words for my silly concern :) I was just a bit surprised about the clashing part, I did a quick and not thorough enough search of Context libraries and didn't find that many, what libraries are you referring to when you say that other libraries use Context? (this is just curiosity, feel free to ignore it :D)

2 Likes

Out of open source ones I can think of:

All of those are frameworky "I control execution semantics" and are the right/easy place for the framework to "set some headers/context i got from somewhere".

In all those cases if feels natural (IMO) to extend them to also carry baggage, as they are often passed around already anyway to achieve some framework specific task. It is only when we hit a non-framework function or other framework we'd need to pass the "generic context", which I hope we can make happen by some CarriesBaggage (name invented on the spot, let's not bikeshed it here yet) :slight_smile:

Aren't those exactly not clashing types? They either namespaced or have distinct naming?

How would we spell extracting things to avoid context.context?
(Perhaps that's possible, if we had that CarriesContext protocol :thinking:, worth looking into)

Go avoids this by:

Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it. The Context should be the first parameter, typically named ctx:

I.e. "never wrap", how would we then avoid having to pass function(context, context)?