Distributed Tracing in Swift - our experience so far

Hi all,

After a fair bit of interest recently I thought I'd write down some thoughts/experiences of our experience using distributed tracing in Swift, especially as far as I know, we're one of the first (if not the first) people to ship it in a production app. You can consider this the technical background to my talk from Borderless Engineering Conference:

App Background

Our app currently consists of 11 microservices and growing, all written in Swift. In addition we have a few lambdas (again in Swift) triggered by events that either poke our API (e.g. image upload notifications) or do data transformation (e.g. reading a DynamoDB stream and creating a search item that's inserted into Elasticsearch.

As the number of services grew, debugging errors got more and more difficult, so we decided that enough pieces were in place to make tracing work. (It helped that @slashmo is part of the team since he did the initial GSoC implementation and has been involved in tracing with Swift since!)

Tracing implementation

All of our services are built on top of Vapor. Most of them apart from our API gateway use Soto to talk to AWS for some sort of service, such as DynamoDB, SNS, SES, Elasticsearch etc and they all use Async HTTP Client to talk to each other (though this is done through a couple of levels of abstraction through both Vapor and our own wrapper to make testing and local development easier). So we knew at a minimum we'd have to implement it in those three libraries.

Async HTTP Client

@slashmo already maintains a fork of Async HTTP Client with a tracing implementation so we could stick with that. We then forked both Vapor and Soto to point to this AHC fork.

Vapor

Implementing tracing in Vapor was fairly easy (here's the draft PR back to upstream. Essentially once we'd pointed to the AHC fork and adding dependencies on tracing we just needed to ensure we passed the LoggingContext everywhere. We also exposed Baggage on Request and made it conform to LoggingContext so you could pass that around easily. Finally we changed the DefaultResponder to create a span when a request comes in to kick off tracing in that app.

We then forked a couple of our other dependencies that use Vapor, though we've hit a bug/feature in SwiftPM where we depend on brokenhandsio/vapor and one of our dependencies (e.g. Leaf) still depends on vapor/vapor and SwiftPM seems to resolve fine using our fork ¯_(ツ)_/¯

Soto

Soto was by far the hardest library to add tracing in to because of the size of the code base, code generation and the number of services it touches. This is the main repo where we'll be doing most of the future changes.

Soto is a 3rd party AWS SDK written in Swift. It uses the Go JSON API specs to generate code for every single AWS service, which would be impossible to write manually as there are so many. Soto then interfaces with soto-core which does the actual interaction with AWS, sending and receiving requests using Async HTTP Client and signing the requests.

Adding tracing to Soto Core was similar to Vapor and AsyncHTTPClient, essentially add the tracing dependencies and then pass LoggingContext to everything. We then changed AWSClient to start a span before a request to AWS and add any attributes we want to pass through. This bit is the hardest and something we'll likely change significantly going forward (as it was mainly hacked in to get some initial images!). Each service should specify a number of attributes that are not only unique to that service, but also unique to the operation being performed in that service. Currently we've manually added it to make DynamoDB work and there's some context leaking between Soto and Soto Core that we need to find the best way to work out to resolve.

Once Soto Core was done, we then changed the code generation steps to pass LoggingContext everywhere and then manually added some of the required spans to the DynamoDB service. As mentioned above we need to find a better way to make this work across code generation and all the different services.

You can see the PR for Soto Core here and the PR for Soto here

Bootstrapping

Now that everything implements tracing we just need to set it up so we can capture traces. We use the OpenTelemtry Client with X-Ray support to push traces to an OpenTelemetry Collector running in a sidecar The collector is configured to push traces to AWS X-Ray. The setup for our app is pretty much the same as the docs, with an added sampling rule to ignore health checks.

End Result

When it's all implemented, you can then use your tracing tool of choice to make the most of it! Here are some screenshots from our test environment.

You can see a successful request trace below. This request gets all the orders for a user, and you can see it hitting the different services needed to fulfil the request, including the requests to the database with the specific query type and table used.

In the below screenshot you can see a failed request trace. This request was looking for an item that didn't exist in the database and we can see that whilst the request to services-stores was successful, we got a 404 for the request to services-items. This massively reduces the time taken to debug errors as we know the exact service request that failed and some tracing tools will even collate logs and metrics to help work out what line of code it failed on etc.

Both the above screenshots show the time taken for a request across each service and database query. This is another advantage of tracing - finding performance bottlenecks or slow queries is really easy.

Finally, in the below screenshot you can see the service map generated by AWS X-Ray for us. This allows us to see which services talk to other services, the average time for requests and any failures.

The above screenshots were taken from AWS X-Ray, but we'll likely end up just ingesting the trace data into Grafana as we get the extra trace information from AWS services and can integrate it with metrics and logs easily.

Future Work

We will continue to evolve our forks as we get more experience using tracing and roll it out across more services. As mentioned above we have a number of things we need to fix, including several hacks made to make things work so we could actually get it shipped and better support for more AWS services. We'll be focusing mainly on the services we use but if you want to contribute any other services that would be awesome and we'd definitely welcome them (that's across any of the forks and/or Vapor packages that are yet to have tracing implemented, e.g. Fluent).

We also know there are a few bugs with our implementations that we need to resolve (e.g. X-Ray registers an extra POST request for service-to-service requests, instead of integrating it with the service request) but overall it's working well.

Can I Use It?

If you want to add tracing to your apps then you can definitely do that. The forks are all available for use and we're using them in production so consider them 'production ready'.

The good part - we'll be maintaining our forks of Vapor, AHC and Soto until we can upstream our tracing implementation.

The bad part - the work won't be upstreamed until task local values are implemented in Swift 5.5, which means that you'll need to depend on the forks until then (and potentially maintain your own forks of other dependencies). And whilst we consider the forks to be production ready, they certainly aren't API stable. We may need to make breaking changes several times as our implementation evolves and tracing wont be widely supported by libraries until 5.5 and the distributed tracing library is tagged as 1.0. And when task local values land, we'll be making major breaking changes to pretty much every interface when TLVs replace passing LoggingContext everywhere.

TL;DR - whilst tracing is massively beneficial and ready for use I can't fully recommend it as you'll need to decide if it's worth the maintenance effort.

To conclude

Working with tracing has been great! We've successfully shipped it in a production app and will continue to roll it out across our services. We've already used to debug errors and it's made a huge difference.

We still have a significant amount of work to do but I'm excited for task local values to land and for tracing to be widely adopted by the server-side Swift ecosystem!

Finally a shout out to @slashmo - not only did he do the original GSoC implementation of tracing he also did most of the tracing work for this project. So when I refer to 'we' above, in most cases it mainly means him!

I'm also happy to answer any technical questions you may have!

34 Likes

Oh and I forgot to add the actual important links. Our forks of the libraries (and their branches) are:

4 Likes

Thanks so much for the great efforts here and the writeup!

I really can't wait until we're able to use the 5.5 task-local features to simplify all of those integrations! :slight_smile: We should be able to offer tracing support in the http client without breaking API perhaps even - if the application is using using Swift 5.5 that is.

Awesome work on all the integrations and the tracer implementations themselfes @0xTim, @slashmo !

6 Likes

This is really great work! I'm very excited for the upcoming Swift 5.5 releases that will allow us to pervasively add tracing support to the Swift on Server ecosystem. This kind of introspection is hugely valuable, and it has a fantastic network effect: the more of the ecosystem that supports it out of the box, the more value you get.

9 Likes

This is great, thanks for the hard work on this. At my company we have 4 services already running in production and getting integration with the newrelic c agent has been interesting. Once this is in place we could remove so much code and use open telemetry instead.

2 Likes

I apologize for bumping an older thread, but with swift 5.5 out the door and work ongoing with the ramp up to 6.0, I wanted to circle back and see where these various branches were holding - and in particular what they might be waiting on before things get merged into upstream releases.

I've got a side project ramping up now where I'd love to enable tracing from the start, and I'm sorting through what I might need to do in order to enable it.

Can you share more details on the project and patterns it uses? The revised tracing libraries "just work" because they use task-local values, so as long as you play well with swift concurrency, child tasks and don't use detached tasks or callbacks you'll be covered -- at least on the API layers.


In general status is that myself, @slashmo, @fabianfett have been thinking about the last missing piece here which is connecting up tracing with swift-log. I anticipate we'll get to this early in 2021.

On my end the distributed actor work is taking up all my attention, so until we get through evolution and implementation calms down a bit there I won't be able to focus on the 1.0 of tracing -- but we are very close. So once distributed work calms down a bit, we'll focus on integrating with logging and should be able to call it a 1.0 :slight_smile:

1 Like

The project work is a simple HTML service endpoint that I'm setting up to provide an HTTP(s) endpoint for a library I'm experimenting with. It's not even fronting other microservices elsewhere - but I'll be interacting with it from a iOS client app.

I wanted to leverage the tracing work done so far to just work it myself - and it looks like in order to do that I'll likely need to change out some of the standard vapor dependencies to use the branches/projects listed above. And to be honest, I was hoping to leverage some of the async/await improvements in Vapor just included - so I wasn't sure if there would be issues with these branches and that development work.

My goal is to get started using the libraries just to see the simplest traces - durations of calling an HTTP endpoint to having a response, and getting that all piped through so I can understand what it would take to set it up and use it on a more complex project in the future.

That's simple - wrap your call with a span and you're done. Though if you want to have it done automatically for you that's not a thing in Vapor yet since pending 1.0 release of tracing.

You could perhaps even try writing a middle ware that does it for you now that I think of it? https://docs.vapor.codes/4.0/middleware/ Could you give that a shot and reach out if you need any help with it? :slight_smile:

import Vapor
import Tracing // the tracing API

struct DistributedTracingMiddleware: AsyncMiddleware {
    func respond(to request: Request, chainingTo next: AsyncResponder) async throws -> Response {
        let tracer = InstrumentationSystem.tracer
        // extract baggage ... 
        // https://github.com/apple/swift-distributed-tracing#extracting--injecting-baggage
        return try await tracer.withSpan(...) { 
            return try await next.respond(to: request)
        }
    }
}

I think that could work, but I'm not that fluent in Vapor -- wdyt @0xTim @slashmo?

1 Like

Lol. Sweet nerdsnipe! Sure, I'll give it a go!

2 Likes

Middleware would definitely be the simplest to set up. You could also set up a metrics backend which would emit request timing data, that's already integrated into Vapor mainline without any hacks needed

1 Like

Adding a middleware wouldn't be a hack tho :slight_smile:

How do the tasks work with that next.respond? Assuming the controller is invoked from a task inside that respond call, or from a Task it creates (child would be preferable, but even Task would work but be less efficient) tracing would work and the trace would be available in the controller to keep adding more details there :slight_smile:

Sorry I meant a hack as in using all the forked Vapor libraries to integrate swift-distributed-tracing

The top level responder is future based, any tasks are wrapped in a future using NIO's helpers. Once tracing is released we should be able to make it work either by handling tracing inside the AsyncResponder that's wrapped, moving the base responder over to async and then wrapping the futures in the task helpers or having an async responder behind a switch.

Option 3 would probably be the cleanest and easiest to implement. Option 2 is probably a no go until we a) better understand the performance implications of using async and b) are able to back deploy async to older OSes otherwise we're going to stop a large chunk of people from updating

2 Likes

I see, I naively assumed it was async from here-onwards to client code. Right so we'll need some changes in there... The middleware could do the naive start/stop span to get the durations, but won't be able to propagate to the client code without changes in vapor then... I see, thanks for clarifying @0xTim !

1 Like

Yeah unfortunately to make ELFs play nice and not real existing code or leave a number of people behind it's async at the route handler level and down

2 Likes