Hello everyone,
now that we released our work in progress distributed actor cluster I got a bit more time to reply here
To be clear though, the "deadlines are a point in time" is not really because distributed actors per se, but because any such "multiple threads / tasks / processes / computers" system (even a bunch of microservices) will require such representation. Hopefully this post can clarify why.
@Dante-Broggi's example is quite good -- that's one of the issues. That's what we mean with "timeouts don't compose". To give another typical and very visual example how "timeouts" are just outright completely wrong when transmitting to other peers:
- Client wants to express: "give me a response, within 5 seconds" (I'll use seconds for simple math).
- they do so at time T0
Consider sending the request with a timeout:
- set timeout 5 seconds
- locally schedule a timer to fail / stop waiting after 5 seconds (so far so good)
- also, transmit to the server "If it takes you longer than 5 seconds to produce a result, I'll already have cancelled my waiting and won't use the value; so don't bother producing the result anymore"
a) if you encode this as "5 seconds" in the request this is what happens:
"client: ok, timeout 5 seconds"
T0 Tcd Tx
|----\[serialize][send] ~ network latency ~ [server receive][server replies]|
\---------\ <client timeout> |
\----\ |
\~~~~~~~~~~~~~~~~~~~\ |
\--------------\ |
\--------------\|
^
^
^
"oh, client wants 5 seconds... ok, set timeout 5 seconds..." --------------->
The fact that the server would just take the "oh, 5 seconds, ok" is not what the client expressed, and that timeout is pretty useless. It missed to account for any async/serialization/network latency involved in the message transfer.
b) the same expressed as a "point in time deadline" works correctly, assuming the clock drift isn't too terribly out of whack between them. I'd suggest using gracious deadlines here, and not super tight ones, this is just to avoid "clearly unnecessary work", and not "ok it'll definitely complete in 20ms"...
"client: ok, deadline = now() + 5 seconds"
T0 Tcd
|----\[serialize][send] ~ network latency ~ [server receive][server replies]
\---------\ <client deadline exceeded>
\----\
\~~~~~~~~~~~~~~~~~~~\
\--------------\
\ cancelled=true
^
^
^
"ok, deadline min(now + maxAllowedTimeout, Tcd) == Tcd" --> xxxxxxxxxxxxxxxxxxxxxxx
Notice that the server does not have to JUST blindly trust that point in time given by the client.
The clients clock could be complete nonsense, and we guard against in in two ways:
- if the
Tcd
point in time we got is already in the past -> we can immediately cancel and not even begin work on this request, the client is out of whack (we can have some tolerance threshold here etc...)
- again... those deadlines are best effort and gracious timeouts -- and not "100ns"
- if the
Tcd
is way too far in the future, we bound it with something the server considers a max timeout for a request.
- This makes sense because services usually have some SLO (service level objective) how fast they should be able to reply to responses. So if the SLO is 400ms, we can set this to some "well, if this is taking longer than our SLO, definitely cancel the work" (and alert that we're missing our objectives).
So... Does anyone do this? Yeah, it's common in Go services:
The same applies equally to just asynchronous tasks locally -- you want to set a specific deadline by which tasks should be completed, and not just "5 seconds" which doesn't mean anything, as shown by @Dante-Broggi's example. E.g. every time you enter a function with "timeout 5 seconds" it keeps being the same 5 seconds... even if it's already many seconds past the "5 seconds from the beginning of the first task".
Also, in Swift Concurrency we'll want to do the same thing:
// PSEUDO CODE, NOT PROPOSAL
task.deadline = .now() + .seconds(5)
// oh deadline was already "now() + .seconds(1) -> DONT UPDATE DEADLINE
If this task, or the parent task had a deadline already set, and it is earlier than this new one -- we'd NOT extend the deadline. And that's a feature
As for the questions on "can we trust clocks"...
You have this reversed: datacenters are the ones with very well synchronized clocks, vastly better network hardware and (more) predictable latencies, especially compared to devices on random flaky networks
Since we're nerding out here a bit... here's some more fun reading about advanced clock synchronization systems. They're only used by specialized applications, but just FYI that NTP isn't the "endgame" for these things:
Clock synchronization techniques had an exciting renaissance recently, ever since Google's TrueTime. Check out Spanner's TrueTime [1], Amazon TSS [2], Sundial [3]). Those can get clocks synchronized down to hundreds of ns.
But anyway, not all systems need these super synchronized clocks at all (Spanner needs them because it uses them to commit transactions). NTP gets you around ~100ms synchronization AFAIR (again, don't trust client devices tho), so honestly for "plain old" service stuff it's quite enough...
Fun reading... none of this is a requirement for "normal boring 1 second resolution" deadlines for best effort request cancellation, but it's a very nice read: