Hi all,
After a fair bit of interest recently I thought I'd write down some thoughts/experiences of our experience using distributed tracing in Swift, especially as far as I know, we're one of the first (if not the first) people to ship it in a production app. You can consider this the technical background to my talk from Borderless Engineering Conference:
App Background
Our app currently consists of 11 microservices and growing, all written in Swift. In addition we have a few lambdas (again in Swift) triggered by events that either poke our API (e.g. image upload notifications) or do data transformation (e.g. reading a DynamoDB stream and creating a search item that's inserted into Elasticsearch.
As the number of services grew, debugging errors got more and more difficult, so we decided that enough pieces were in place to make tracing work. (It helped that @slashmo is part of the team since he did the initial GSoC implementation and has been involved in tracing with Swift since!)
Tracing implementation
All of our services are built on top of Vapor. Most of them apart from our API gateway use Soto to talk to AWS for some sort of service, such as DynamoDB, SNS, SES, Elasticsearch etc and they all use Async HTTP Client to talk to each other (though this is done through a couple of levels of abstraction through both Vapor and our own wrapper to make testing and local development easier). So we knew at a minimum we'd have to implement it in those three libraries.
Async HTTP Client
@slashmo already maintains a fork of Async HTTP Client with a tracing implementation so we could stick with that. We then forked both Vapor and Soto to point to this AHC fork.
Vapor
Implementing tracing in Vapor was fairly easy (here's the draft PR back to upstream. Essentially once we'd pointed to the AHC fork and adding dependencies on tracing we just needed to ensure we passed the LoggingContext
everywhere. We also exposed Baggage
on Request
and made it conform to LoggingContext
so you could pass that around easily. Finally we changed the DefaultResponder
to create a span when a request comes in to kick off tracing in that app.
We then forked a couple of our other dependencies that use Vapor, though we've hit a bug/feature in SwiftPM where we depend on brokenhandsio/vapor
and one of our dependencies (e.g. Leaf) still depends on vapor/vapor
and SwiftPM seems to resolve fine using our fork ¯_(ツ)_/¯
Soto
Soto was by far the hardest library to add tracing in to because of the size of the code base, code generation and the number of services it touches. This is the main repo where we'll be doing most of the future changes.
Soto is a 3rd party AWS SDK written in Swift. It uses the Go JSON API specs to generate code for every single AWS service, which would be impossible to write manually as there are so many. Soto then interfaces with soto-core
which does the actual interaction with AWS, sending and receiving requests using Async HTTP Client and signing the requests.
Adding tracing to Soto Core was similar to Vapor and AsyncHTTPClient, essentially add the tracing dependencies and then pass LoggingContext
to everything. We then changed AWSClient
to start a span before a request to AWS and add any attributes we want to pass through. This bit is the hardest and something we'll likely change significantly going forward (as it was mainly hacked in to get some initial images!). Each service should specify a number of attributes that are not only unique to that service, but also unique to the operation being performed in that service. Currently we've manually added it to make DynamoDB work and there's some context leaking between Soto and Soto Core that we need to find the best way to work out to resolve.
Once Soto Core was done, we then changed the code generation steps to pass LoggingContext
everywhere and then manually added some of the required spans to the DynamoDB service. As mentioned above we need to find a better way to make this work across code generation and all the different services.
You can see the PR for Soto Core here and the PR for Soto here
Bootstrapping
Now that everything implements tracing we just need to set it up so we can capture traces. We use the OpenTelemtry Client with X-Ray support to push traces to an OpenTelemetry Collector running in a sidecar The collector is configured to push traces to AWS X-Ray. The setup for our app is pretty much the same as the docs, with an added sampling rule to ignore health checks.
End Result
When it's all implemented, you can then use your tracing tool of choice to make the most of it! Here are some screenshots from our test environment.
You can see a successful request trace below. This request gets all the orders for a user, and you can see it hitting the different services needed to fulfil the request, including the requests to the database with the specific query type and table used.
In the below screenshot you can see a failed request trace. This request was looking for an item that didn't exist in the database and we can see that whilst the request to services-stores
was successful, we got a 404 for the request to services-items
. This massively reduces the time taken to debug errors as we know the exact service request that failed and some tracing tools will even collate logs and metrics to help work out what line of code it failed on etc.
Both the above screenshots show the time taken for a request across each service and database query. This is another advantage of tracing - finding performance bottlenecks or slow queries is really easy.
Finally, in the below screenshot you can see the service map generated by AWS X-Ray for us. This allows us to see which services talk to other services, the average time for requests and any failures.
The above screenshots were taken from AWS X-Ray, but we'll likely end up just ingesting the trace data into Grafana as we get the extra trace information from AWS services and can integrate it with metrics and logs easily.
Future Work
We will continue to evolve our forks as we get more experience using tracing and roll it out across more services. As mentioned above we have a number of things we need to fix, including several hacks made to make things work so we could actually get it shipped and better support for more AWS services. We'll be focusing mainly on the services we use but if you want to contribute any other services that would be awesome and we'd definitely welcome them (that's across any of the forks and/or Vapor packages that are yet to have tracing implemented, e.g. Fluent).
We also know there are a few bugs with our implementations that we need to resolve (e.g. X-Ray registers an extra POST request for service-to-service requests, instead of integrating it with the service request) but overall it's working well.
Can I Use It?
If you want to add tracing to your apps then you can definitely do that. The forks are all available for use and we're using them in production so consider them 'production ready'.
The good part - we'll be maintaining our forks of Vapor, AHC and Soto until we can upstream our tracing implementation.
The bad part - the work won't be upstreamed until task local values are implemented in Swift 5.5, which means that you'll need to depend on the forks until then (and potentially maintain your own forks of other dependencies). And whilst we consider the forks to be production ready, they certainly aren't API stable. We may need to make breaking changes several times as our implementation evolves and tracing wont be widely supported by libraries until 5.5 and the distributed tracing library is tagged as 1.0. And when task local values land, we'll be making major breaking changes to pretty much every interface when TLVs replace passing LoggingContext
everywhere.
TL;DR - whilst tracing is massively beneficial and ready for use I can't fully recommend it as you'll need to decide if it's worth the maintenance effort.
To conclude
Working with tracing has been great! We've successfully shipped it in a production app and will continue to roll it out across our services. We've already used to debug errors and it's made a huge difference.
We still have a significant amount of work to do but I'm excited for task local values to land and for tracing to be widely adopted by the server-side Swift ecosystem!
Finally a shout out to @slashmo - not only did he do the original GSoC implementation of tracing he also did most of the tracing work for this project. So when I refer to 'we' above, in most cases it mainly means him!
I'm also happy to answer any technical questions you may have!