Hi,
I've got a use-case for SE-0322, but as accepted, the API doesn't quite provide what I need.
As accepted, withUnsafeTemporaryAllocation
will stack-allocate if possible, and otherwise fall back to the heap. The issue is that sometimes, if stack allocation isn't possible, you'd rather not allocate at all. This can happen if, for example, you are writing the contents of a lazy collection to a contiguous one.
Example Use-Case
Take percent-encoding and -decoding for example. These are horribly inefficient ways of escaping byte sequences - encoding a byte sequence may require as much as 3x the original storage size, and decoding may produce a collection with as few as 1/3 the bytes of the source. Currently, WebURL measures the exact size of the result, requiring one round of encoding/decoding, before allocating a String
using String(unsafeUninitializedCapacity:)
and writing it in a second pass.
Currently, these are the sort of numbers that I'm seeing for encoding/decoding.
name time std iterations
-------------------------------------------------------------------------
URLEncoded.String.urlEncoded 8225.000 ns ± 18.39 % 139543
URLEncoded.String.urlDecoded 4261.000 ns ± 19.24 % 311997
Now, if I remove the measurement step and instead allocate the maximum possible result size, I can bring this down significantly. It makes sense - we're only doing one pass, so it takes almost half the time:
name time std iterations
-------------------------------------------------------------------------
URLEncoded.String.urlEncoded 5068.000 ns ± 20.08 % 260771
URLEncoded.String.urlDecoded 2734.000 ns ± 25.52 % 468956
The problem is that this produces Strings which may be wasting as much as 2/3 of their allocated capacity. That's unacceptable. It's a cheap benchmark win, but you wouldn't want a bunch of these floating around in your program. Even if this was limited to relatively small-ish inputs, it's not great.
What I would like to do is this:
- Allocate a stack buffer with the worst-case size (capped to some reasonable limit so we still have room to, you know, call functions)
- Write the lazily-calculated contents to the stack buffer, until the source is exhausted or the buffer is full
- If the source was exhausted, allocate a correctly-sized String, memcpy the results from the stack, and we're done in a single round of encoding/decoding
- If the source still has elements, calculate the remaining length, allocate a correctly-sized String, memcpy the part we already have sitting on the stack, and take a second pass over just the remainder to add its contents to the String.
This should allow for performance that is much closer to the second group of results (as we'd be able to avoid taking two passes over most of the source contents), without over-allocating. I believe this would be useful quite generally for consumers of lazy collections, not just for my use-case.
The Problem with wUTA
The problem is the heap fallback in withUnsafeTemporaryAllocation
really limits the effectiveness of this approach. This all rests on the idea is that the temporary allocation created in step (1) is free/much cheaper than a heap allocation. If it goes to the heap, we now need to decide whether the cost of a heap allocation which will later be discarded is less than the cost of making two passes through the collection, and IME heap allocations are so expensive that in many situations, it probably won't be. Moreover, if I knew this was going to the heap, I'd probably ask for a fair bit more than the relatively meagre cap I set in step (1).
Therefore, using this API would likely be a performance regression in many situations, or its performance would vary too greatly depending on how deep you happen to be in the stack or the platform you happen to be running on.
Proposed Solution
We need a greater ability to control the fallback behaviour if stack-allocation isn't possible.
So I'd like to pitch either that this function be changed to never heap-allocate, or augmented with a sister API. If it can't allocate on the stack, it would not execute the closure and instead return nil
, mirroring APIs such as withContiguousStorageIfAvailable
. The existing behaviour could be trivially composed from this more fundamental building-block, and augmented with more carefully-tuned fallbacks that potentially allocate an even larger buffer if they need to hit the heap anyway.
In my reading of the proposal, review thread, and decision notes, I can't see that this was brought up as an alternative. The suggestion was made that this function should always stack-allocate regardless of the available space, but that is not what I'm asking for here; I'd like it to stack-allocate or do nothing.
It's worth noting that SE-0322 has not yet shipped in an official release, so there is potentially still time to change it, should it be considered undesirable to have both the failable and heap-fallback functions.