Pitch: [stdlib] Error recovery hook, PR #12025

Hi S/E,

I’ve raised a rather speculative PR suggesting a hook be added to stdlib
that would allow users to experiment with error recovery in their apps.
I’ve been asked to put it forward on Swift Evolution to gather opinions
from the wider community about such a design.

https://github.com/apple/swift/pull/12025

Ultimately, it comes down to being able to do something like this:

            do {
                try Fortify.exec {
                    var a: String!
                    a = a!
                }
            }
            catch {
                NSLog("Caught exception: \(error)")
            }

This was primarily intended for user in "Swift on the server" but could also
help avoid crashes in the mobile domain. The rationale and mechanics
are written up at the following url:

http://johnholdsworth.com/fortify.html

I'll accept this won’t be everybody’s cup of tea but at this stage this is
only an opt-in facilitating patch. Developers need not subject their apps
to this approach which requires a separate experimental implementation.

The recovery is reasonably well behaved except it will not recover
objects and system resources used in intermediate frames, It’s not
as random as something like, for example, trying to cancel a thread.

The debate about whether apps should try to soldier on when something
is clearly amiss is a stylistic one about which there will be a spectrum of
opinions. The arguments weigh more in favour in the server domain.

Over to you,

John

Thanks for raising this topic! Graceful partial recovery is important and useful, and although we've tended to invoke "actors" as the vague savior that will answer all the questions in this space, I think we can provide useful functionality with a smaller scope that won't interfere with future directions. At a language level, questions to answer include:

- What can we guarantee about the process state after a trap?
- What does the interface for setting up a trap handler look like?

Instead of thinking of a trap as completely invalidating and ending the program like we do today, we can think of it as deadlocking the current execution context (setting aside for a moment the question of what "execution context" means), as if it got stuck in an infinite loop. As you noted, this means we can't reclaim any memory, locks, or other resources currently being held by the trapped context, but other contexts can continue executing. In fantasy actor land, the definition of "execution context" would ideally be "current actor"; in the world today, we have a few choices. We could say that a trap takes down the current thread, though that might be a bit too much for single-threaded or workqueue-based architectures. Another alternative is to delimit the scope affected by a trap with a setjmp/longjmp-like mechanism, sort of like what you have, though that then requires care to ensure that state "above" the trap line isn't entangled with the invalidated state "below" the trap line.

That leads into the question of what the interface for handling a trap should look like. Personally, I don't think trying to turn fatal errors into exceptions is the right answer, since that makes it way too easy to do the kinds of harmful things people do with Java runtime errors, SEH, etc. to swallow and ignore serious problems. I think it'd be better to have an interface that's clearly tuned toward supervisory reaction to unexpected failure, rather than one for routine handling of expected errors. It also potentially creates safety problems for the ownership model. It's tempting to think of the block passed to your `Fortify.exec` as nonescaping, but that's problematic with inouts:

  var x: Int
  do {
    try execWhileTurningTrapsIntoErrors {
      foo(&x)
    }
  } catch {
    print(x)
  }

  func foo(x: inout Int) { fatalError() }

The compiler will reason that x is statically exclusively held only during the call to `foo`, but that's not really the case—foo trapped and deadlocked in the middle of the access, and we essentially left it hanging and went and ran our catch handler with the inout access still active.

If we were to say that a trap takes down the current thread, then we could have a signal-like interface for installing a handler that's the last thing to run on the thread before taking it down, like this:

func ifTrapOccursOnCurrentThread(_ do: @escaping () -> ())

ifTrapOccursOnCurrentThread {
  supervisor.notifyAboutTrap(on: pthread_self())
}
doStuff()

A scoped handler could still be made to work, with an interface something like this:

func run(_ body: @escaping () -> (), withTrapHandler: () -> ())

run({
  doStuff()
}, withTrapHandler: {
  supervisor.notifyAboutTrap(on: pthread_self())
})

The `@escaping` annotation on the body would prevent the compiler from making invalid static assumptions about lifetimes in the body that would be violated if it traps. IMO, the handler block also shouldn't receive any information about the trap other than that one happened—the runtime ought to handle logging the reason for a trap, and anything you do to plan your shutdown or continue running unrelated subtasks should have no other business knowing why the trap occurred.

At an implementation level, enabling trap handling would also require us to standardize on an ABI for handlable runtime errors. We currently don't have a standard mechanism here. Failures don't necessarily funnel through any fixed set of runtime entry points; the compiler also directly generates @llvm.trap() calls, and LLVM doesn't make any guarantee about how llvm.trap is implemented. It's also an open question whether things like C's abort(), null pointer dereferences, segfaults, etc. should be treated as runtime failures that can be handled by this mechanism.

-Joe

···

On Sep 21, 2017, at 12:14 AM, John Holdsworth via swift-evolution <swift-evolution@swift.org> wrote:

Hi S/E,

I’ve raised a rather speculative PR suggesting a hook be added to stdlib
that would allow users to experiment with error recovery in their apps.
I’ve been asked to put it forward on Swift Evolution to gather opinions
from the wider community about such a design.

https://github.com/apple/swift/pull/12025

Ultimately, it comes down to being able to do something like this:

            do {
                try Fortify.exec {
                    var a: String!
                    a = a!
                }
            }
            catch {
                NSLog("Caught exception: \(error)")
            }

This was primarily intended for user in "Swift on the server" but could also
help avoid crashes in the mobile domain. The rationale and mechanics
are written up at the following url:

http://johnholdsworth.com/fortify.html

I'll accept this won’t be everybody’s cup of tea but at this stage this is
only an opt-in facilitating patch. Developers need not subject their apps
to this approach which requires a separate experimental implementation.

The recovery is reasonably well behaved except it will not recover
objects and system resources used in intermediate frames, It’s not
as random as something like, for example, trying to cancel a thread.

The debate about whether apps should try to soldier on when something
is clearly amiss is a stylistic one about which there will be a spectrum of
opinions. The arguments weigh more in favour in the server domain.