[Pitch] Syntactic tail calls (STC)

SlugFiller · February 8, 2024, 1:54pm

Motivation

This is going to be a bit long, but bare with me, because this is the important part.

This use case came to me while reading WAMR (WebAssembly Micro Runtime)'s interpreter code, and wondering if a similar interpreter can be implemented in Swift.

WAMR's interpreter actually comes in 3 versions, or rather, "speeds".

The "slowest speed" is a very basic while-and-switch. It loops infinitely (or rather, until the program exits), reading the opcode at the current instruction pointer, and performing a switch to jump to the implementation of the opcode. Then it loops and processes the next opcode. A simplified version of it would look like this (Some, but not all, macros expanded for brevity):

    while (frame_ip < frame_ip_end) {
        opcode = *frame_ip++;
        switch (opcode) {
            case WASM_OP_F64_MIN:
            {
                float64 a, b;

                b = POP_F64();
                a = POP_F64();

                PUSH_F64(f64_min(a, b));
                continue;
            }

A port to Swift, or any other language, would be trivial.

The faster version uses a non-standard features of certain C compilers (namely, GCC and LLVM) called "labels as values" and "computed goto". First, instead of switch cases, opcodes are organized into goto labels:

HANDLE_WASM_OP_F64_MIN:
            {
                float64 a, b;

                b = POP_F64();
                a = POP_F64();

An array holds a mapping of opcodes to labels:

static const void *handle_table[256] = {
  &&HANDLE_WASM_OP_UNREACHABLE,
  &&HANDLE_WASM_OP_NOP,

Then, the switch is replaced with a goto referencing labels from the array:

goto *handle_table[*frame_ip++];

So far, this is just a glorified switch with a very specific implementation. Additionally, at the end of each opcode, instead of looping back and then switching, it immediately jumps to the next instruction:

HANDLE_WASM_OP_F64_MIN:
            {
                float64 a, b;

                b = POP_F64();
                a = POP_F64();

                PUSH_F64(f64_min(a, b));
                goto *handle_table[*frame_ip++];
            }

This essentially a glorified reswitch, which is a concept that already came up a few times. Additionally, this is absolutely something a Sufficiently Smart™ optimizer can do, detecting the while-and-switch pattern, and converting the continue to a reswitch.

So far, nothing that warrants a new language feature. But it's the third, "fastest speed" that is the interesting one. This version expands on the use of labels as values, by not merely having a static constant array, but actually creating a dynamic array that directly maps instruction pointer locations to labels, without having to go through an integer opcode in between (Some macros and methods are expanded for brevity):

    while (p < p_end) {
        opcode = *p++;
        p_org = p;
        disable_emit = false;
        do {
            uint32 label_addr = (uint32)(uintptr_t)handle_table[opcode];
            /* emit uint32 label address in 32-bit target */
            if (ctx->p_code_compiled) {
                STORE_U32(ctx->p_code_compiled, value);
                ctx->p_code_compiled += sizeof(uint32);
            }
            else {
                increase_compiled_code_space(ctx, sizeof(uint32));
            }
            LOG_OP("\nemit_op [%02x]\t", opcode);
        } while (0);

The actual opcode handling then becomes:

HANDLE_WASM_OP_F64_MIN:
            {
                float64 a, b;

                b = POP_F64();
                a = POP_F64();

                PUSH_F64(f64_min(a, b));
                do {
                    const void *p_label_addr = *(void **)frame_ip;
                    frame_ip += sizeof(void *);
                    goto *p_label_addr;
                } while (0);
            }

This technically only saves a single memory lookup per instruction. But this tiny save adds up pretty quickly, once you realize it's for each instruction in interpreted code.

So, when considering whether this algorithm could be implemented in a high-level language like Swift, the question arose as to how you could replace "labels as values" as a feature. And the obvious answer was "functions as values":

// Can't use a typed pointer here, because the type would recursively reference the method parameters
func wasm_op_f64_min(frame_ip:UnsafeRawPointer, stack:WasmStack) -> Void {
    let a:Double = stack.pop_f64()
    let b:Double = stack.pop_f64()
    stack.push_f64(f64_min(a,b))
    let label_addr:((UnsafeRawPointer, WasmStack) -> Void) = frame_ip.assumingMemoryBound(to:((UnsafeRawPointer, WasmStack) -> Void).self)
    let next_frame_ip:UnsafeRawPointer = UnsafeRawPointer(label_addr + 1)
    // But this is a call instead of a goto
    label_addr(next_frame_ip, stack)
}

The difference between a function and a label, however, is that a function is called, not jumped into, which creates a call stack. If this repeats every instruction, this quickly causes runaway memory allocation, or simply crashes with a stack overflow.

An alternative might be to use a trampoline, and return label_addr and next_frame_ip instead of making a call. But doing this removes the performance advantage of having one memory lookup less, by adding an entire stack allocation and deallocation in its place. What is needed is a way to explicitly perform a goto into a function instead of a call.

Proposal

The suggestion is to add a new syntax for performing a goto instead of a call. There are various options for the exact keyword used, but for simplicity, the remainder of this pitch will use return_call, named after the matching WebAssembly opcode.

Naturally, this syntax carries some restrictions.

The callee function must match the caller's return type:

func callee_void() -> Void {
}

func callee_int() -> Int {
    return 123
}

func good_caller_void() -> Void {
    return_call callee_void() // Unlike return, this is usable even if the return type is void
}

func good_caller_int() -> Int {
    return_call callee_int() // Returns 123
}

func bad_caller_void() -> Void {
    return_call callee_int() // Error: Cannot discard the value, not even if @discardableResult is used
}

func bad_caller_int() -> Int {
    return_call callee_void() // Error: Cannot create a value, either
}

func bad_caller_int_manip() -> Int {
    return_call callee_int()+1 // Error: Cannot manipulate the value after it is returned
}

It must have the same throwing type as well:

enum ExampleError: Error {
    case mayThrow
}

func callee_nothrow() -> Int {
    return 123
}

func callee_maythrow() throws -> Int {
    if Bool.random():
        return 123
    throw ExampleError.mayThrow
}

func good_caller_nothrow() -> Int {
    return_call callee_nothrow()
}

func good_caller_maythrow() throws -> Int {
    return_call callee_maythrow() // No need to use try. The error is implicitly forwarded
}

func bad_caller_nothrow() -> Int {
    return_call callee_maythrow() // Error: Cannot discard the throw
}

func bad_caller_maythrow() throws -> Int {
    return_call callee_nothrow() // Error: Unlike a regular return, it's not possible to add the possibility of an exception
}

func bad_caller_maythrow_try() throws -> Int {
    return_call try callee_maythrow() // Error: Cannot use an expression. try is implicit when throwing, and should not be added
}

func bad_caller_nothrow_try() -> Int {
    return_call try! callee_maythrow() // Error: Cannot manipulate the return value in any way
}

Adding to the above, it's not possible for the caller to catch an error thrown by the callee. If return_call is wrapped in a do-catch, the do-catch is ignored:

enum ExampleError: Error {
    case shallThrow
}

func callee() throws -> Void {
    throw ExampleError.shallThrow
}

func caller() throws -> Void {
    do {
        return_call callee()
    } catch {
        print("Caught in caller")
    }
}

do {
    caller()
} catch {
    print("Caught in main")
}

The above would print Caught in main.

Likewise, unlike return, any defer blocks in a function that performs return_call will be executed before the function is called:

func callee() -> Void {
    print("1")
}

func caller() -> Void {
    print("2")
    defer {
        print("3")
    }
    print("4")
    return_call callee()
}

print("5")
caller()
print("6")

The above would print

The same applies to dereferencing local references. However, due to the fact that not every dereference is a deinit, examples related to ARC are left as an exercise to the reader.

Difference from Tail Call Optimization (TCO)

One might consider avoiding the addition of a new syntax, and simply transforming any return callee() into a return_call. However, this approach has several disadvantages:

An optimization is not a guarantee. Optimizations can be turned off, while a given algorithm may not work unless a return_call is used.
Even if the optimization is mandatory starting from a given Swift version, a program relying on said optimization will compile without warning on an earlier version of Swift, only to later fail at run time. When trying to compile a program that relies on a later version of Swift, using an earlier version of Swift, it is better to receive a syntax error at compile time, then hit an unexpected error at run time.
TCO, being an optimization, has to ensure the semantics of return are maintained. For instance, it cannot apply in the presence of defer or do-try. Not only that, it would "fail" without issuing any warning or error during compile time. The program would, instead, fail unexpectedly at run time.
Likewise, TCO cannot play well with ARC. The mere presence of a local reference variable makes reference handling that much more complex, if the dereference cannot occur after the callee returns.
While return_call takes less memory than return, it is not guaranteed to be faster. As such, it should be up to the developer to decide when to use it, and when not to.

Alternative syntax

Other possible names for the new syntax:

tail return
tail_call
tailcall
return continue
goto

Parameters

There is no semantic reason why the callee and caller would be required to have the same parameters. But it may simplify implementation if they are. This aspect is open for debate.

Joe_Groff · February 8, 2024, 3:50pm

Reliable tail calls in Swift would definitely require an explicit tail call syntax. However, to support cross-function and non-sibling tail calls, we would also need to know that a callee is intended to be tail-callable, since the usual platform calling convention ABI is not amenable to guaranteeing tail calls. With a @tailCallable attribute on the callee, we would be able to emit the callee with a convention that can guarantee tail calls. Maybe we can infer when a function is the callee of a tail call for local functions that are tail-called in the same file, but it would need to be explicit if a module wants to export public API as tail callable.

It'd be interesting to think about how a reswitch feature could be made to lower to the sort of array address lookup you're hoping for. It seems like basically you want the enum's case representations to directly be the case label addresses you want to jump to. If the enum was defined locally in the same scope as the single switch that dispatches over it, maybe an optimization pass could do that transformation.

SlugFiller · February 8, 2024, 4:04pm

I don't understand why the ABI of the callee is relevant. Rather, the caller should just deallocate its own stack frame, and then make the call in the context of the parent, just as if the parent had implemented a trampoline.

This still leaves a stack frame allocation and deallocation, but this can be optimized, in platforms that support it, to a goto which skips the stack frame allocation in the callee. This is where forcing the same parameter list can help implementation, because it means the caller's stack frame can be used as-is. Still, this is, to a certain extent, an implementation detail.

Obviously, adding a @tailCallable attribute makes things much simpler. The callee simply never allocates a stack frame, and can only be called via goto, thereby making it, for all intents and purposes, a label.

As for doing it using reswitch, I don't think it's doable. The reason is the disjoint between the storage of the opcodes (as labels), and the actual switch. Even if you represent the opcodes as an enum, and even if you allow the optimizer to replace the enum from an integer type to addresses to code, it would only be able to work under the assumption that the switch is the only switch in the entire program using the enum. Otherwise, it has no way to obtain the addresses.

Remember that the translation from opcodes to addresses is done outside the switch, potentially in a completely separate file (It's certainly a separate file in WAMR). So without some sort of explicit syntax to indicate "this is the switch from which addresses should be taken", there's no way to bridge the two.

P.S. Even if tail callable methods have to be constructed differently from normal ones, I don't think there's a reason why they'd be exported across module boundaries. Since they are, to a certain extent, a goto-equivalent, they are best used in a small contained algorithm, and not allowed to leak.

P.P.S. I should have probably added some consideration as to how any of this interacts with async. I do think that being able to "delegate" the return to a separate function without awaiting it is potentially useful. Even if I don't have an explicit use case off the top of my mind.

wadetregaskis · February 8, 2024, 4:16pm

It seems like you want guarantees beyond just (essentially) tail calling…? Because any use of the stack is going to be relatively expensive at the scale of simulating individual processor instructions (e.g. register spills, function arguments). You'd be heavily reliant on the register allocator and compiler in general to minimise stack usage (ideally to the point of ignoring ABI conventions and using distinct sets registers of for "instructions" that are frequently executed sequentially.

(tangentially, I'd love to see the general optimisation of non-overlapping register sets between call-connected functions - I realise stack pushes & pops are relatively cheap, but it still makes me sad seeing bajillions of them even in "fully" optimised code, when it's obvious that every function is basically using the same three or four and ignoring the other twenty-something available GPRs)

More broadly, what's the benefit here vs dynamic code generation, e.g. a JIT compiler? Surely it'd be much faster, in any case, to just compile to the real machine code?

Joe_Groff · February 8, 2024, 4:23pm

One fundamental problem is that arguments beyond those passed in registers are passed on the stack caller-pop, like C. A callee is allowed to reuse the argument area it received to make another call such as a tail call, but that's only possible when the new callee uses the same amount of stack space or less (a "sibling call" as GCC calls it). Otherwise you would need to allocate more stack space for arguments, but there's no way for the tail-caller's caller to know it needs to pop the added argument space. To guarantee tail calls, we'd want to switch to a Pascal-style callee-pop convention.

Swift also passes large values, or values of unknown size (such as those of generic type) indirectly using a pointer to a value, which is allowed to be anywhere in memory, including the middle of the caller's frame. We would have to ensure that such indirectly-passed arguments are passed on the stack in the argument area, so that they can be callee-popped by the tail callee.

You noted ARC as a potential tail call hazard, but it's a manageable one when we know a function is intended to be the subject of a tail call. Switching the default ownership passing for arguments to consuming would ensure that the cleanup responsibility of any values used as arguments is passed on to the callee. Tail-callable functions could still take explicitly borrowing parameters, maybe, but they would have to have been passed down from the tail-caller's caller.

SlugFiller · February 8, 2024, 4:24pm

Ability to run on iOS.

Alternately: Forward compatibility with processors not yet invented, which Swift/LLVM is likely to support, without needing to change anything in your own code other than updating Swift/LLVM and recompiling/cross-compiling to the new architecture.

wadetregaskis · February 8, 2024, 4:31pm

Outside the EU, specifically? Given the new provisions for custom browser engines (i.e. dynamic code generation).

I'm hopeful Apple will generalise that feature to the world.

But to clarify, you're saying you want to run WASM binaries in a custom app / engine on iOS, not through Safari? Can you elaborate on why; if you're using a custom app anyway, why not just compile to native code?

(not trying to argue your pitch here, I'm just generally curious - I'm not all that familiar with WASM beyond the concept)

So, wait, are you trying to write a WASM interpreter or WASM compiler? Because if you're just executing WASM binaries, what does LLVM's support for other architectures have to do with it?

SlugFiller · February 8, 2024, 4:49pm

That sounds like a potential argument for forcing the same argument list for a callee as a caller. For the specific use case I specified, that's not an issue. But it might be an issue for other use cases I did not consider, which would make stdcall needed.

WASM, specifically, is useful as a sandboxing tool for third-party code/plugins. i.e. You can have a "main program", which you compile, which can load libraries created by third parties (e.g. by downloading them off the internet, or reading from SD card), and run them without compromising the security of your app.

But this also applies to languages other than WASM, e.g. Lua, Python, GDScript, etc. Obviously, you could work around the whole issue by using an interpreter written in C as a library. But if you extend that argument, you could, in theory, write your entire app in C, which defeats the purpose of having Swift in the first place.

When you strip it down to its basics, WASM is like a DLL written for a processor that doesn't exist. It has imports and exports, but no inherent functionality other than being loaded by the host. (Although a "standard" host like WASI can impart such inherent functionality) Nor is there anything web-specific about it. It can be loaded from any application, and used for any purpose, provided you have an interpreter/VM for aforementioned "processor that doesn't exist".

Swift uses LLVM as a compilation backend. While implementation of it is still at its roots, there is work towards creating packages which would allow targeting any LLVM-triple from any Swift-running host (currently only Linux target, MacOS host is implemented).

Extending that logic farther, if a future processor architecture becomes mainstay, it is quite likely that LLVM would support it as a target, and Swift would as well, allowing the compilation of the exact same program, unmodified, to this new processor. As evidence: This is already the case for C (provided one stays within the confines of the standard library). Programs written in C decades ago can be compiled and executed on modern architectures (e.g x64, arm) which did not exist at the time they were written.

Joe_Groff · February 8, 2024, 5:19pm

Just for the sake of thought, what if you had an attribute on the enum that specified what function it takes on the label addresses for, like:

@labelEnum(for: Interpreter.run(_:))
enum Opcode {
  case add
  case sub
}

extension Interpreter {
  func run(_ opcodes: consuming UnsafeBufferPointer<(Opcode, Int8)>) {
    switch opcodes.popFront() {
    case (.add, let operand):
      self.value += operand
      reswitch opcodes.popFront()
    case .(sub, let operand):
       self.value -= operand
      reswitch opcodes.popFront()
    }
  }
}

Within the associated function, we'd export the label symbols for the switch so they can be used as the enum's representations, and within that switch dispatch on the enum can be done by indirect jumps on the enum representation. You could still use the enum elsewhere, though switching anywhere else would have to do the case-by-case comparison against label values of course.

SlugFiller · February 8, 2024, 6:04pm

Hmm, associating it with a function rather than a specific switch is still a tad iffy. What are the constraints for such a function? Can it be any function that contains exactly one switch over the enum type?

It does create an advantage in terms of keeping local variables. But I'm wondering at what point it becomes better to just have a dedicated state machine type/construct.

I'm also wondering if the deconstruction .add, let operand can work. It's certainly useful, but the .add is implemented by a goto address (and, hence, must be unique). How is the let operand implemented? i.e. how does it "pass" it when jumping to .add?

John_McCall · February 8, 2024, 6:04pm

Yeah, I think this is the way to go. You don't need to say that this is the only switch in the program over the enum; you just need to say that there's exactly one switch, defined somewhere in the current module, that's optimized by it. Other switches can work by just mapping the label back to a small integer using a lookup table (ideally sorted by address, although I don't think you can currently request LLVM to keep goto labels in a particular order in the function).

For what it's worth, if you're writing this kind of engine not just as a personal toy, you really need to be using some sort of CFI on the goto labels (such as pointer authentication) so that memory bugs in your engine aren't trivially exploitable. That protection can eat up a lot of the benefit of computed goto. If you use pointer authentication, you do still get the memory optimization benefits, and in principle the processor can hide the latency of authenticating the branches. That authentication work does still have to happen, though, and if you jump between a lot of small cases in a row, you can exhaust the processor's ability to schedule authentications and force a stall.

SlugFiller · February 8, 2024, 6:18pm

I think if an attacker is able to edit the compiled code memory, they already have access to almost everything. Even without editing to something invalid (Jump to something that's not an opcode case), they can alter the functionality of the module. If the OS supports it, the compiled code memory can be "locked" after it's initially written, so it cannot be edited by an attacker. The OS can also limit execution of arbitrary code by requiring code-containing memory pages to be marked for execution, or outright disallow the program to create executable pages (cough iOS cough). Either option at least limits jump targets to existing code locations.

Honestly, all of these issues exist in JIT as well. And a variety of JITs operate in the wild quite reasonably without having per-instruction checks.

Joe_Groff · February 8, 2024, 6:22pm

Having one switch, which must involve the enum, seems like a reasonable restriction to start with, yeah.

I think it can be made to work. When we emit a compound pattern match, we recursively select a subpattern to dispatch on and divide up the pattern space. We'd need a restriction that the top-level dispatch in the pattern match has to be the @labelEnum enum in order for the optimization to be possible. That should be possible though even with compound pattern matches, such as when the label enum appears in a tuple, or even if the label enum associates payloads directly with the cases. For an enum with associated values we'd use a representation where the tag takes up a full pointer to store the label alongside the payload value.

John_McCall · February 8, 2024, 6:53pm

It's unclear to me whether by "compiled code" you're referring to the statically-compiled code in your program text segments or the instruction listings your interpreter creates. No modern environment allows static program text to be directly written. JITs can create new program text, and a properly-written JIT has to treat that power with the seriousness it deserves; it certainly should not leave executable pages writable unnecessarily. Treating your instruction listings with a similar level of precaution would be wise; obviously they don't need and should never have execute permissions, but mapping them read-only after creation is definitely a good idea.

Security research is quite clear that this is a very weak form of protection. Even limiting jumps to the normal entry point of a function is pretty weak in practice; attackers can usually find useful sequences of functions to call.

These considerations absolutely all apply to JITs. In the year 2024, writing a JIT (or really any execution engine) for public use without taking exploit mitigation seriously is actively irresponsible.

Since you keep bringing up iOS, I will point out that CFI protection is an iOS platform requirement for JITs. I don't know off-hand if that policy currently applies to sub-JIT execution engines relying on things like computed goto, but it should, and I would strongly discourage anyone writing an interpreter from trying to sneak through the requirements.

tera · February 8, 2024, 7:04pm

Are there precedents in other languages?

Should it be an attribute on a function or a call site indication? The latter seems more general:

func foo(level: Int) -> Int {
    if condition { return level }
    else if anotherCondition {
        return 2 * foo(level: level + 1) // non tail call here
    }
    return tail_call foo(level: level + 1) // ✅
}

func bar() -> Int {
    if condition { return 1 }
    else if anotherConsition {
        return 2 * bar() // non tail call here
    }
    return 1 + tail_call bar() // 🛑 Can't use tail call here
}

Joe_Groff · February 8, 2024, 7:06pm

Off the top of my head, in Perl 5, goto &subroutine does a tail call to subroutine. Scala only supports self tail recursion and uses @tailrec on the function declaration to enable it.

To integrate it into Swift, I think you need both. The function attribute is necessary at least in some situations to ensure the function implementation is emitted in a way that tail call guarantees are possible. But even if you call a marked function, I think you'd want a marker on the return to ensure it's a proper tail call, since Swift has features that introduce implicit behavior after a return, such as defer. If the tail call involves borrowed or inout parameters, a tail return would also enforce those parameters come from the tail-caller's caller or somewhere else not dependent on the tail caller context, whereas a normal return would extend those parameters' lifetime across the call if necessary.

wadetregaskis · February 8, 2024, 8:04pm

Joe_Groff:

extension Interpreter {
  func run(_ opcodes: consuming UnsafeBufferPointer<(Opcode, Int8)>) {
    switch opcodes.popFront() {
    case (.add, let operand):
      self.value += operand
      reswitch opcodes.popFront()
    case .(sub, let operand):
       self.value -= operand
      reswitch opcodes.popFront()
    }
  }
}

Do we need a new keyword, or can the compiler deduce the desire (especially if the enum has a decoration pointing to the containing function)? Because the above seems logically equivalent to (with apparent typos fixed too):

extension Interpreter {
  func run(_ opcodes: consuming UnsafeBufferPointer<(Opcode, Int8)>) {
    while true {
      switch opcodes.popFront() {
      case .add(let operand):
        self.value += operand
      case .sub(let operand):
        self.value -= operand
      }
    }
  }
}

SlugFiller · February 9, 2024, 1:58am

Hmm, in this small example, yes. But it also makes a few restricting assumptions. First, it assumes there's no code after the switch, i.e. that the switch would always reswitch, and never break. But this part can be circumvented using labeled break on the while, so it's fine.

More importantly, it assumes the parameter to the switch/reswitch would always be the same. This doesn't really lend itself to state machines without holding a state variable. In the interpreter case, doing opcode.popFront() works for 99% of instructions, but makes branching, calls, and returns somewhat harder to grep.

I suppose an alternative is to have a state variable that is only read in the switch, and only written to right before exiting a case (or continueing the loop), and have the compiler optimize it out. This does make things quite a bit harder for the compiler.

This approach also gave me a couple of extra thoughts: Would it be possible to have multiple switchs for different @labelEnums in the same function? Could they be cascaded? i.e., could you have one inside the other, and perform a labeled reswitch?

SlugFiller · February 9, 2024, 11:39am

Strange thought: Maybe you can forgo the return completely, and make a "tail call only" function that has logic similar to a function that returns Never. As in, the mere call to it, without the need to prefix with return, is sufficient for the compiler to interpret the call site as an exiting block, allowing it as the only operation in guard or issuing "Unreachable code" for anything after it.

I also had another thought. If tail callable functions are stdcall, what happens when the tail caller is cdecl? The callee can fully remove its own stack frame on return (or prior to an additional tail call), but the caller can't remove its own arguments from the stack prior to the tail call, since those are managed by the caller's caller. I suppose the memory save applying from the second call onward is not too bad, since non-O(1) memory save requires a tail callee to make an additional tail call anyway.

tera · February 9, 2024, 12:32pm

Syntax suggestions for tail calls:

func bar() -> Int {...}
func foo() -> Int {
    ...
    return tail bar()
}

func bar() {...}
func foo() -> Int {
    ...
    tail bar()
    return 1
}

func bar() -> Int {...}
func foo() {
    ...
    tail bar()
}