Logging

Good point, we never really defined what 'lazy rendering' actually is. Let's say we use os_log to log

os_log_error(OS_LOG_DEFAULT, "%d %d", 1, 2);

it doesn't actually do the string interpolation when logged. It will just ship the static string "%d %d" and the 1 and the 2 as binary data to the logging system. The string interpolation is done when you actually look at the message using the log utility or Console.app or similar. That's really quite cool because you shift some of the costs from producer of log messages to the consumer. On the server however there are lots of standard solutions which expect textual/structural output in a certain format so there's not necessarily that much value in it. Again, very happy to change my mind if there's evidence that it will help.

@Joe_Groff was is something like this:

[app thread 1] \
                \
[app thread 2]   +===logging firehose (1)===> [logger daemon thread] ==(2)==> logging mechanism (ie. file/db/stdout/...)
                /
[app thread 3] /

where we have essentially two pipes, the logging firehose (1) and then a rendering stage (2). I think Joe was suggesting to support lazy rendering in (1) but not in (2). That way we could shift the rendering overhead from log message producers (app thread N) to the logger daemon thread. The benefit would be that in case the logging mechanism supports lazy rendering we could just forward it.

4 Likes

:+1: This makes sense to me.

Injection works, but it is more "setup/boilerplate", which is often annoying (but not insurmountable). I think it would be easier to have how you reach it to be defined here, and have all general purpose libraries use the same mechanism.

Yes, that's how I would imagine too. A HTTP server carries around some context anyway so it'd be able to hand you the logger from there. For SwiftNIO we can build a feature that lets you attach context to a Channel. I just wanted to be upfront because there's a lot of 'general purpose' libraries and carrying around extra context is invasive (but IMHO what we need to do).

In a previous life I worked on the "trace engine" inside IBM's Java Virtual Machine. It uses a somewhat similar model.

Each thread in the JVM (Java threads that are running bytecode etc. and internal JVM threads) has an in-memory trace buffer. When an enabled tracepoint is hit (either via a hook from the running Java code or directly from the JVM's C++ code) the tracepoint data is written into the thread's local buffer. This data includes the tracepoint identifier, optionally a high-res timestamp, and the tracepoint's parameter data (the final three parameters in your os_log_error() example). Because the trace buffer is thread-local, writing to it is extremely fast and does not require a lock.

When the trace buffer fills, the thread grabs a new buffer from a pool (using CAS), and ships the full buffer to a separate "buffer writer" thread for processing. If the user has enabled file tracing, the buffer writer thread writes the raw unformatted trace buffer to disk (along with some metadata).

There is a separate offline "trace formatter" tool which processes the binary trace file and generates a text file output for analysis.

It's obviously quite a complex subsystem, but it has proven invaluable over the years, for debugging both problems in user code (because you can pipe arbitrary Java methods to the trace engine, including support for globs) and bugs in the JVM (which is very well instrumented internally).

If anyone is interested in learning more, have a look at IBM Documentation

3 Likes

Oh cool, very good to have you over here! Just so I fully understand: If the thread local buffer has filled up and a fresh one is created using an atomic compare-and-swap, what happens to the old one? Will it still be processed? If yes: What happens if the production of messages (into the buffers) is continuously faster than the consumption (out of the buffer)? Is that just ballooning it into memory or is there some back-pressure mechanism?

Yes, the full buffer is shipped to a separate thread for processing.

By default, if the production of tracepoints into buffers is faster than they are processed, memory usage increases over time. There are diagnostics emitted which indicate that this may be happening.

Alternatively, if the user chooses to use "non-dynamic" buffers, each thread is allowed a maximum of 2 buffers: one that it's using to write tracepoints, and one that is being processed. If the current buffer fills while the other buffer is still being processed, tracepoints are dropped and a diagnostic emitted.

I should highlight a major drawback of each thread having its own in-memory buffer. Because different threads fill and ship their buffers at different rates, trace data is written to disk out of order. It is the job of the offline formatter to sort the tracepoints into chronological order.

I am not proposing we use this model for logging in Swift, because I don't think requiring an offline formatter in order to get a chronological log is a viable option.

Well, I think you may have misunderstood me a little here. I don't think the user should do this. What I'm saying is that those low-level NIO libraries would be integrated by a bigger framework like Kitura. So the user wouldn't deal with those details but rather have a higher level framework do the setup and configuration for him.

I.e. put as little "opinionation" into those libraries and leave it to the frameworks to provide the APIs they consider reasonable.

1 Like

I really like this concept :+1:.

If logs are going to be buffered, what happens when application crashes? Are logs going to be missing?

2 Likes

Very good question that we need to consider. Options we have:

  • decide against any buffering
  • accept that crashes lose the last messages (probably a bad idea)
  • fork() a process instead of a thread so we can write them even after a crash (not portable to Windows I suspect)
  • catch the SIGILL in process and only exit the process when the logger daemon thread caught up and processed everything in the buffer. This works but is relying on Swift implementation details. I suspect @Joe_Groff has an opinion here and probably doesn't think that's good idea ;)

Did I forget anything?

3 Likes

FWIW the IBM JVM installs a sigaction() handler at startup which flushes buffered tracepoints so they are not lost on SIGSEGV etc. I remember someone saying that Swift doesn't play nice with that approach, but I don't remember why.

Yes, that's what I mean with 'catching the SIGILL'. It works fine I've used that but it relies on implementation details on how Swift implements traps. If it were to use kill(getpid(), SIGKILL); pause() in the future this trick wouldn't work.

But we control all the pieces, don't we? ("we" being "the Swift project")

1 Like

Yes and No I guess. No because not all platforms probably surface a trap using a signal that you can catch in user space. Yes because on those platforms the runtime could offer a hook :upside_down_face:. But this is really something for the compiler folk to weigh in. CC @Joe_Groff/@John_McCall

1 Like

We need to be careful about installing signal handlers for the specific requirements of one library because it can interfere with other legitimate reasons that the application might want to install signal handlers. Swift generally tries to not "take over" the process.

If platforms don't provide any other way to perform emergency process cleanup, then maybe we do need to make our own system. It should be a cohesive and well-designed system in that case, though, so that e.g. other systems that need to take over the signal handler themselves can still trigger our cleanups.

4 Likes

The topic of emergency cleanup also came up in this thread about the commonly-requested ability to soft-land when subprocesses crash so they don't immediately take down an entire supervising process. That would also require a way for us to install some kind of handling logic without interfering with other signal handlers the process may have set up independently of Swift (and as @johannesweiss noted, it'd require us to standardize what trap Swift code emits to take down a process in a semi-recoverable way). It would be nice if we could guarantee that at least some level of log messages get flushed even when the process crashes, since log messages are often intended to diagnose crash conditions.

3 Likes

This is not dissimilar to how logging works in a Cloud Foundry environment. Applications hosted in such an environment have light-weight logging methods that simply write to an application container's sysout, syserr, and /dev/log. From there, a daemon called the "Metron Agent" gathers logs from these sources, converts them into a events (encoded using GBP), and sends the events to a server for distribution to Firehose end-points.

In my experience, this works great for the application. However, environments hosting applications generating great volumes of logs need to throw serious resources to support the Firehose.

I know of no log aggregator (e.g., Splunk, ELK) that can consume the Firehose protocol directly. Hence, "nozzles" are necessary to filter and transform log events into RFC-5424 compliant syslog messages, which consumer more resources.

However, the Cloud Foundry Firehose does convey events other than log messages, such as metrics, HTTP start-stop times, heartbeats and more. AppDynamics is one company I know of that consumes metrics directly from the Firehose for Enterprise scale monitoring.

1 Like

I think Python does a great job with its logging module. It allows you to:

  • Use the global logger or create custom local ones.
  • Filter the emitted messages by multiple criteria, including the origin module.
  • Attach multiple handlers (or customize the default ones) to each logger and configure them to only process some filtered messages.

Having similar features in a new Swift Logger would be very handy and powerful.