Amend/Augment SE-0322: Temporary Uninitialized Buffers to return nil if heap allocation is required

Karl · January 9, 2022, 9:40am

Hi,

I've got a use-case for SE-0322, but as accepted, the API doesn't quite provide what I need.

As accepted, withUnsafeTemporaryAllocation will stack-allocate if possible, and otherwise fall back to the heap. The issue is that sometimes, if stack allocation isn't possible, you'd rather not allocate at all. This can happen if, for example, you are writing the contents of a lazy collection to a contiguous one.

Example Use-Case

Take percent-encoding and -decoding for example. These are horribly inefficient ways of escaping byte sequences - encoding a byte sequence may require as much as 3x the original storage size, and decoding may produce a collection with as few as 1/3 the bytes of the source. Currently, WebURL measures the exact size of the result, requiring one round of encoding/decoding, before allocating a String using String(unsafeUninitializedCapacity:) and writing it in a second pass.

Currently, these are the sort of numbers that I'm seeing for encoding/decoding.

name                                    time        std        iterations
-------------------------------------------------------------------------
URLEncoded.String.urlEncoded            8225.000 ns ±  18.39 %     139543
URLEncoded.String.urlDecoded            4261.000 ns ±  19.24 %     311997

Now, if I remove the measurement step and instead allocate the maximum possible result size, I can bring this down significantly. It makes sense - we're only doing one pass, so it takes almost half the time:

name                                    time        std        iterations
-------------------------------------------------------------------------
URLEncoded.String.urlEncoded            5068.000 ns ±  20.08 %     260771
URLEncoded.String.urlDecoded            2734.000 ns ±  25.52 %     468956

The problem is that this produces Strings which may be wasting as much as 2/3 of their allocated capacity. That's unacceptable. It's a cheap benchmark win, but you wouldn't want a bunch of these floating around in your program. Even if this was limited to relatively small-ish inputs, it's not great.

What I would like to do is this:

Allocate a stack buffer with the worst-case size (capped to some reasonable limit so we still have room to, you know, call functions)
Write the lazily-calculated contents to the stack buffer, until the source is exhausted or the buffer is full
If the source was exhausted, allocate a correctly-sized String, memcpy the results from the stack, and we're done in a single round of encoding/decoding
If the source still has elements, calculate the remaining length, allocate a correctly-sized String, memcpy the part we already have sitting on the stack, and take a second pass over just the remainder to add its contents to the String.

This should allow for performance that is much closer to the second group of results (as we'd be able to avoid taking two passes over most of the source contents), without over-allocating. I believe this would be useful quite generally for consumers of lazy collections, not just for my use-case.

The Problem with wUTA

The problem is the heap fallback in withUnsafeTemporaryAllocation really limits the effectiveness of this approach. This all rests on the idea is that the temporary allocation created in step (1) is free/much cheaper than a heap allocation. If it goes to the heap, we now need to decide whether the cost of a heap allocation which will later be discarded is less than the cost of making two passes through the collection, and IME heap allocations are so expensive that in many situations, it probably won't be. Moreover, if I knew this was going to the heap, I'd probably ask for a fair bit more than the relatively meagre cap I set in step (1).

Therefore, using this API would likely be a performance regression in many situations, or its performance would vary too greatly depending on how deep you happen to be in the stack or the platform you happen to be running on.

Proposed Solution

We need a greater ability to control the fallback behaviour if stack-allocation isn't possible.

So I'd like to pitch either that this function be changed to never heap-allocate, or augmented with a sister API. If it can't allocate on the stack, it would not execute the closure and instead return nil, mirroring APIs such as withContiguousStorageIfAvailable. The existing behaviour could be trivially composed from this more fundamental building-block, and augmented with more carefully-tuned fallbacks that potentially allocate an even larger buffer if they need to hit the heap anyway.

In my reading of the proposal, review thread, and decision notes, I can't see that this was brought up as an alternative. The suggestion was made that this function should always stack-allocate regardless of the available space, but that is not what I'm asking for here; I'd like it to stack-allocate or do nothing.

It's worth noting that SE-0322 has not yet shipped in an official release, so there is potentially still time to change it, should it be considered undesirable to have both the failable and heap-fallback functions.

rauhul · January 9, 2022, 9:49am

cc @grynspan

1-877-547-7272 · January 9, 2022, 10:15am

This seems sensible. +1.

IMO it would be better to add a new API for this. Heap allocation is still a sensible fallback in many situations.

tera · January 9, 2022, 3:09pm

Or add a parameter, with default value matching the current behaviour.

+1

grynspan · January 9, 2022, 3:33pm

As I recall @Karl, you and I discussed the existing design at length. Please keep in mind that the existing/implemented function behaves the way it does intentionally. It provides access to a temporary, scoped buffer that, because it is scoped, just so happens to be more aggressively stack-promoted at compile-time. The language doesn't actually guarantee that stack promotion occurs, and the fact that it does occur at all is an implementation detail only just as it is if you called malloc() or UnsafeMutableBufferPointer.allocate(capacity:) directly.

Changing the behaviour of the existing function to "maybe this will give you a buffer, maybe this won't" would be a problem for a number of envisioned use cases, and I'd object to such a change very strongly.

That said, I do recognize you have a use case here for a subtly different function. If there's a scenario where a heap allocation would be a pessimization, we could supplement the existing function with something like this one:

func withUnsafeTemporaryAllocationIfFast<T, R>(
  of type: T.Type,
  capacity: Int,
  _ body: (UnsafeMutableBufferPointer<T>?) throws -> R
) rethrows -> R {
  if canFastAllocate(...) {
    return try body(fastAllocate())
  } else {
    return try body(nil)
  }
}

(Naming is hard.) The difference being that this function skips over the heap allocation fallback (duh) and if it does it passes nil to body so that the caller can decide on an alternate behaviour.

The existing function is effectively implemented by calling a function that looks like this one, then falling back to UnsafeMutableBufferPointer.allocate(capacity:), so exposing such a function as an additional interface would be minimal engineering effort.

Karl · January 9, 2022, 4:15pm

Yes, I wanted this to be done by the optimiser (and I still do! )

But the optimiser would not be able to handle this particular use-case, but it's more of an algorithmic optimisation - essentially, if I'm able to pull a buffer out of thin air for ~free, I can use a very different algorithm to produce my result. It's more like withContiguousStorageIfAvailable.

Indeed, that is the problem I'm having with it. There are two paths, and the performance characteristics of each are so different that it is not feasible to reason about when this function may be beneficial. Having only one path means it is much simpler to reason about, and I don't need to accept the drawbacks of the slow path to take advantage of the fast path.

Effectively, there are always going to be these two paths. The current function checks at runtime to see if a stack allocation is safe, and the compiler would have to be very brave indeed to stack-promote the heap buffer despite the runtime's determination.

Yes, I'd be fine with a separate function. The current one would effectively be a variant with default fallback using the heap.

grynspan · January 9, 2022, 4:20pm

I just ask you to keep in mind that the following pattern (and equivalent) is common in C:

char stackBuf[1024];
if (sz <= sizeof(stackBuf)) {
  doStuff(stackBuf);
} else {
  char *heapBuf = malloc(sz);
  doStuff(heapBuf);
  free(heapBuf);
}

And, not to ignore your use case, this is the sort of use case that the existing function was designed to allow in Swift.

Karl · January 9, 2022, 4:23pm

Sure, I recognise that this is a different use-case. I added that to my comment (you were too quick for me!)

It doesn't have anything to do with C interop.

grynspan · January 9, 2022, 4:35pm

It's Sunday, traffic on the Internet Superhighway is light.

In the absence of a suitable stdlib function, would you be able to implement a C helper function in your code like this one?

void withStackBuffer(size_t size, void (^ body)(void *_Nullable buffer)) {
  if (size <= MY_LIMIT) {
    void *buffer = alloca(size);
    body(buffer);
  } else {
    body(NULL);
  }
}

(If blocks are a problem, there's equivalent attribute-heavy syntax that uses a regular old C function pointer but requires @_silgen_name at the call site.)

The stdlib implementation couldn't look like that because the module boundary between the stdlib and clients would prevent inlining/optimization, but that's presumably not an issue for you (C and Swift code can be optimized together in the same module AFAIK.)