[Discussion] Server Metrics API


#1

SSWG Metrics api

Introduction

Almost all production server software needs to emit metrics information for observability. The SSWG aims to provide a number of packages that can be shared across the whole Swift on Server ecosystem so we need some amount of standardisation. Because it's unlikely that all parties can agree on one full metrics implementation, this proposal is attempting to establish a metrics API that can be implemented by various metrics backends which then post the metrics data to backends like prometheus, graphite, publish over statsd, write to disk, etc.

Motivation

As outlined above we should standardise on an API that if well adopted would allow application owners to mix and match libraries from different vendors with a consistent metrics solution.

Proposed solution

The proposed solution is to introduce the following types that encapsulate metrics data:

Counter: A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

counter.increment(100)

Recorder: A recorder collects observations within a time window (usually things like response sizes) and can provides aggregated information about the data sample, for example count, sum, min, max and various quantiles.

recorder.record(100)

Gauges: A Gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of active threads. Gauges are modeled as Recorder with a sample size of 1 and that does not perform any aggregation.

gauge.record(100)

Timer: A timer collects observations within a time window (usually things like request durations) and provides aggregated information about the data sample, for example min, max and various quantiles. It is similar to a Recorder but specialized for values that represent durations.

timer.recordMilliseconds(100)

How would you use counter, recorder, gauge and timer in you application or library? Here is a contrived example for request processing code that emits metrics for: total request count per url, request size and duration and response size:

    func processRequest(request: Request) -> Response {
      let requestCounter = Metrics.makeCounter("request.count", ["url": request.url])
      let requestTimer = Metrics.makeTimer("request.duration", ["url": request.url])
      let requestSizeRecorder = Metrics.makeRecorder("request.size", ["url": request.url])
      let responseSizeRecorder = Metrics.makeRecorder("response.size", ["url": request.url])

      requestCounter.increment()
      requestSizeRecorder.record(request.size)

      let start = Date()
      let response = ...
      requestTimer.record(Date().timeIntervalSince(start))
      responseSizeRecorder.record(response.size)
    }

To ensure performance, Metrics.makeXxx will returned a cached copy of the metric object so can be called on the hot path.

Detailed design

Implementing a metrics backend (eg prometheus client library)

As seen above, the general function Metrics.makeXxx provides a metric object. This raises the question of what metrics backend I will actually get when calling Metrics.makeXxx? The answer is that it's configurable per application. The application sets up the metrics backend it wishes the whole application to use. Libraries should never change the metrics implementation as that is something owned by the application. Configuring the metrics backend is straightforward:

    Metrics.bootstrap(MyFavouriteMetricsImplementation.init)

This instructs the Metrics system to install MyFavouriteMetricsImplementation as the metrics backend (MetricsHandler) to use. This should only be done once at the beginning of the program.

Given the above, an implementation of a metric backend needs to conform to protocol MetricsHandler:

public protocol MetricsHandler {
    func makeCounter(label: String, dimensions: [(String, String)]) -> Counter
    func makeRecorder(label: String, dimensions: [(String, String)], aggregate: Bool) -> Recorder
    func makeTimer(label: String, dimensions: [(String, String)]) -> Timer
}

Here is an example in-memory implementation:

class SimpleMetricsLibrary: MetricsHandler {
    init() {}

    func makeCounter(label: String, dimensions: [(String, String)]) -> Counter {
        return ExampleCounter(label, dimensions)
    }

    func makeRecorder(label: String, dimensions: [(String, String)], aggregate: Bool) -> Recorder {
        let maker:(String,  [(String, String)]) -> Recorder = aggregate ? ExampleRecorder.init : ExampleGauge.init
        return maker(label, dimensions)
    }

    func makeTimer(label: String, dimensions: [(String, String)]) -> Timer {
        return ExampleTimer(label, dimensions)
    }

    private class ExampleCounter: Counter {
        init(_: String, _: [(String, String)]) {}

        let lock = NSLock()
        var value: Int64 = 0
        func increment<DataType: BinaryInteger>(_ value: DataType) {
            self.lock.withLock {
                self.value += Int64(value)
            }
        }
    }

    private class ExampleRecorder: Recorder {
        init(_: String, _: [(String, String)]) {}

        private let lock = NSLock()
        var values = [(Int64, Double)]()
        func record<DataType: BinaryInteger>(_ value: DataType) {
            self.record(Double(value))
        }

        func record<DataType: BinaryFloatingPoint>(_ value: DataType) {
            // this may loose precision, but good enough as an example
            let v = Double(value)
            // TODO: sliding window
            lock.withLock {
                values.append((Date().nanoSince1970, v))
                self._count += 1
                self._sum += v
                if 0 == self._min || v < self._min { self._min = v }
                if 0 == self._max || v > self._max { self._max = v }
            }
        }

        var _sum: Double = 0
        var sum: Double {
            return self.lock.withLock { _sum }
        }

        private var _count: Int = 0
        var count: Int {
            return self.lock.withLock { _count }
        }

        private var _min: Double = 0
        var min: Double {
            return self.lock.withLock { _min }
        }

        private var _max: Double = 0
        var max: Double {
            return self.lock.withLock { _max }
        }
    }

    private class ExampleGauge: Recorder {
        init(_: String, _: [(String, String)]) {}

        let lock = NSLock()
        var _value: Double = 0
        func record<DataType: BinaryInteger>(_ value: DataType) {
            self.record(Double(value))
        }

        func record<DataType: BinaryFloatingPoint>(_ value: DataType) {
            // this may loose precision but good enough as an example
            self.lock.withLock { _value = Double(value) }
        }
    }

    private class ExampleTimer: ExampleRecorder, Timer {
        func recordNanoseconds(_ duration: Int64) {
            super.record(duration)
        }
    }
}

which is installed using

    Metrics.bootstrap(SimpleMetricsLibrary.init)

Seeking Feedback

  • if anything, what does this proposal not cover that you will definitely need
  • if anything, what could we remove from this and still be happy?
  • API-wise: what do you like, what don't you like?

Feel free to post this as message on the SSWG forum and/or github issues in this repo.


February 7th, 2019
(Konrad `Ktoso` Malawski) #2

Thanks for the writeup and APIs Tom!

There's a slight mismatch between the code and proposal text: if there needs to be Metrics.global.makeCounter (as sources indicate) or if we provide the extensions which do this for us and allow Metrics.makeCounter as the proposal contains in the text. I don't see the hop over global change much, as it is visibly global already so it think this was an omission, added missing extensions here: https://github.com/tomerd/swift-server-metrics-api-proposal/pull/5


As for the proposal itself:

What is your evaluation of the proposal? [added from the questions in the previous Logging proposal, good question IMHO :slight_smile:]

Overall I'm in favor of the proposal :+1:

Some smaller things still need fleshing out though from my perspective, see below.
The types and initialization look fine, and it seems we can express all typical needs and build on top of those core types.

if anything, what does this proposal not cover that you will definitely need

I believe we are missing explicit lifecycle, in the sense that without defining some way to release() even if many implementations would have this as noop we risk having the API be potentially leaky once an impl arrives that needs eager releasing.

I explain the need for explicit lifecycle in https://github.com/tomerd/swift-server-metrics-api-proposal/issues/6 and the illustrating PR: https://github.com/tomerd/swift-server-metrics-api-proposal/pull/7 If we agree that we need lifecycle I'd stick to the style I proposed here where we can metrics.release(metric) rather what I initially though of which was (metric.release(), which makes the metric objects needlessly heavier).

The rationale is that for systems that poll or cache metrics and may contain heavy structures (high resolution histograms), we really want to be in charge when to release them. And this has to be in the common API, as otherwise 3rd party libraries will not be able to abide to these if a metrics library comes in which needs lifecycle (e.g. prometheus is an example).

if anything, what could we remove from this and still be happy?

I believe we should not ship with a caching proxy (CachingMetricsHandler) like is done in: https://github.com/tomerd/swift-server-metrics-api-proposal/blob/master/Sources/CoreMetrics/Metrics.swift#L121-L178 since it cannot assume things about the metrics types it falls back to very simple caching, and releasing resources is quite heavy then. Real implementations would know their metrics types exactly (and IDs on which they store them), and can therefore implement the cache more efficiently and smartly (e.g. yes, release, but only once the metric has been scraped at least once by a collector etc).

API-wise: what do you like, what don't you like?

I'll avoid the naming bikeshed :slight_smile: The names are good enough.

I think we should align naming of the "MUX" with what the logging proposal has; i.e. MultiplexingLogHandler -> MultiplexingMetricsHandler.

The MetricsHandler interface seems good enough to build all kinds of abstractions on top of it.

The initializing as bootstrap is fine, as it allows to be similar to logging.

Overall I think this is heading towards a solid direction, and I look forward to the last polishing touchups.


(Jari (LotU)) #3

Great post @tomerd. I like the direction of this, I only have one point of "worry".

There are many metric providers out there, and each of them handles things differently. What "worries" me is that with abstracting metrics like we abstracted logging, we might make it harder to use the full capability and flexibility of a certain metrics provider.

I think with logging this is different because, even though there are multiple ways to log, most if not all logging is about a writing some kind of text buffer to some kind of output, where the way the text is processed and the output are interchangeable. With metrics however, it's not always a person who will read this output, but might be another program, that needs specific formatting and expects data in a specific way.
For example, with the implementation you sketched out above, a metric would have a label, a set of dimensions and a value, which would be the actual metric. If I take Prometheus into this example, they allow for an extra property help, which in my opinion is a very useful bit of information to have, since it describes what the metric actually does. We could include this in the MetricsHandler protocol, but there might be a lot of handlers that don't need a help property, or that need some other specific property, which would than blow up the protocol.
Another option would be to add an extra, Prometheus specific, method to your handler implementation, in specific the one where you create a metric, that takes in a help text. But I think if we did that, we'd defeat the point of centralizing the API, since you'd use the Prometheus specific one anyways.

I hope that what I wrote down here makes sense, and if I'm overlooking something obvious, please let me know :slight_smile:

Other than this, I see no big flaws or other things worth noting in this proposal, and I like the general direction of where it's going.

Sidenote: This does not mean I'm not in favor of a centralized metrics API, I just want to point out this (for me) hole in the road, that I think we should fix.


(Nathan Harris) #4

@MrLotU What else does the help property do in Prometheus' context?

My assumption is thus:

In the application choosing to use Prometheus, and they want to make use of the help property - they will work in a PrometheusMetrics "context", where they are using more concrete types in functions, properties, etc. rather than the high level Metrics context.

However, in some low-level shared library, such as NIORedis, metrics can be provided if a Metrics object is provided. At that level, it might not be as necessary to have the help property, and the trade-off of losing that property over having metrics at all for the layer is probably worth it.

However, I could be missing what some of these types of properties, like help that you mentioned, ultimately do for each framework / service.

EDIT: Cleaned up how thoughts were expressed to be clearer.


(Jari (LotU)) #5

@Mordil good point, I hadn't thought of the fact that we could use the lower level in some parts and the higher level in other parts.

In this case, help does not much in a Prometheus context, I just took it as an example of a metrics library specific field that would be lost if we abstracted it away.

As a possibility it might be a solution to have something like this:

Metrics.bootstrap(PrometheusClient.init)

And than later in your code, where you'd need your specific instance, you could do something like this:

Metrics.Prometheus().makeCounter(...)

I think this could be achieved like this:

// In this case this'd go in the Prometheus library/package
extension Metrics {
    static func Prometheus() throws -> PrometheusClient {
        // Get a hold of the bootstrapped provider
        // Note: Pseudo code
        guard let provider = self.provider as? PrometheusClient else {
            throw MetricsError.wrongBootstrapType
        }
        return provider
    }
}

You would only have to use this if your really care or really need a provider specific thing.

Please let me know what you think :smile:


(Nathan Harris) #6

@MrLotU Correct! That was exactly my point (I should've written the code myself to better explain :slight_smile:)


(Jari (LotU)) #7

Cool! Nice that we’re on the same page :smiley: