Clock synchronization

ktoso · November 2, 2021, 10:33am

Hello everyone,
now that we released our work in progress distributed actor cluster I got a bit more time to reply here

To be clear though, the "deadlines are a point in time" is not really because distributed actors per se, but because any such "multiple threads / tasks / processes / computers" system (even a bunch of microservices) will require such representation. Hopefully this post can clarify why.

@Dante-Broggi's example is quite good -- that's one of the issues. That's what we mean with "timeouts don't compose". To give another typical and very visual example how "timeouts" are just outright completely wrong when transmitting to other peers:

Client wants to express: "give me a response, within 5 seconds" (I'll use seconds for simple math).
they do so at time T0

Consider sending the request with a timeout:

set timeout 5 seconds
locally schedule a timer to fail / stop waiting after 5 seconds (so far so good)
also, transmit to the server "If it takes you longer than 5 seconds to produce a result, I'll already have cancelled my waiting and won't use the value; so don't bother producing the result anymore"

a) if you encode this as "5 seconds" in the request this is what happens:

"client: ok, timeout 5 seconds"
T0                                         Tcd                             Tx
|----\[serialize][send] ~ network latency ~ [server receive][server replies]|
      \---------\                          <client timeout>                 |
                 \----\                                                     |
                       \~~~~~~~~~~~~~~~~~~~\                                |
                                            \--------------\                |
                                                            \--------------\|
                                            ^
                                            ^
                                            ^
"oh, client wants 5 seconds... ok, set timeout 5 seconds..." --------------->

The fact that the server would just take the "oh, 5 seconds, ok" is not what the client expressed, and that timeout is pretty useless. It missed to account for any async/serialization/network latency involved in the message transfer.

b) the same expressed as a "point in time deadline" works correctly, assuming the clock drift isn't too terribly out of whack between them. I'd suggest using gracious deadlines here, and not super tight ones, this is just to avoid "clearly unnecessary work", and not "ok it'll definitely complete in 20ms"...

"client: ok, deadline = now() + 5 seconds"
T0                                         Tcd                             
|----\[serialize][send] ~ network latency ~ [server receive][server replies]
      \---------\                          <client deadline exceeded>       
                 \----\                                                     
                       \~~~~~~~~~~~~~~~~~~~\                                
                                            \--------------\                
                                                            \ cancelled=true
                                            ^
                                            ^
                                            ^
"ok, deadline min(now + maxAllowedTimeout, Tcd) == Tcd" --> xxxxxxxxxxxxxxxxxxxxxxx

Notice that the server does not have to JUST blindly trust that point in time given by the client.

The clients clock could be complete nonsense, and we guard against in in two ways:

if the Tcd point in time we got is already in the past -> we can immediately cancel and not even begin work on this request, the client is out of whack (we can have some tolerance threshold here etc...)
- again... those deadlines are best effort and gracious timeouts -- and not "100ns"
if the Tcd is way too far in the future, we bound it with something the server considers a max timeout for a request.
- This makes sense because services usually have some SLO (service level objective) how fast they should be able to reply to responses. So if the SLO is 400ms, we can set this to some "well, if this is taking longer than our SLO, definitely cancel the work" (and alert that we're missing our objectives).

So... Does anyone do this? Yeah, it's common in Go services:

The same applies equally to just asynchronous tasks locally -- you want to set a specific deadline by which tasks should be completed, and not just "5 seconds" which doesn't mean anything, as shown by @Dante-Broggi's example. E.g. every time you enter a function with "timeout 5 seconds" it keeps being the same 5 seconds... even if it's already many seconds past the "5 seconds from the beginning of the first task".

Also, in Swift Concurrency we'll want to do the same thing:

// PSEUDO CODE, NOT PROPOSAL
task.deadline = .now() + .seconds(5)
// oh deadline was already "now() + .seconds(1) -> DONT UPDATE DEADLINE

If this task, or the parent task had a deadline already set, and it is earlier than this new one -- we'd NOT extend the deadline. And that's a feature

As for the questions on "can we trust clocks"...

You have this reversed: datacenters are the ones with very well synchronized clocks, vastly better network hardware and (more) predictable latencies, especially compared to devices on random flaky networks

Since we're nerding out here a bit... here's some more fun reading about advanced clock synchronization systems. They're only used by specialized applications, but just FYI that NTP isn't the "endgame" for these things:

Clock synchronization techniques had an exciting renaissance recently, ever since Google's TrueTime. Check out Spanner's TrueTime [1], Amazon TSS [2], Sundial [3]). Those can get clocks synchronized down to hundreds of ns.

But anyway, not all systems need these super synchronized clocks at all (Spanner needs them because it uses them to commit transactions). NTP gets you around ~100ms synchronization AFAIR (again, don't trust client devices tho), so honestly for "plain old" service stuff it's quite enough...

Fun reading... none of this is a requirement for "normal boring 1 second resolution" deadlines for best effort request cancellation, but it's a very nice read:

[1] Spanner: Google’s Globally Distributed Database Spanner, TrueTime and the CAP Theorem – Google Research
[2] Amazon's Time Sync Service (it's not ntpd) Manage Amazon EC2 instance clock accuracy using Amazon Time Sync Service and Amazon CloudWatch – Part 1 | AWS Cloud Operations & Migrations Blog seems this is out since 2017 as well: Introducing the Amazon Time Sync Service
[3] Google's recent paper on Sundial Sundial: Fault-tolerant Clock Synchronization for Datacenters – Google Research
- Sundial is very very impressive (TrueTime already was very impressive)
- "Through experiments in a 500-machine testbed and large-scale simulations, we show that Sundial can achieve∼100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. "