Feature Request: Variable-length array with short-array optimization

MrMage · April 12, 2021, 9:43am

My app has a bunch of cases where many arrays are created. Most of these arrays are expected to be fairly short (<= 3 elements), but the length of the longest arrays is potentially unbounded.

It would be great to have a data structure that could automatically store the first few elements inline to avoid allocating the array buffer altogether. For example, up to 7 Int8 s could be stored in the 8 bytes occupied by Array<Int8> before an allocation would be necessary. With some bit fiddling, it might even be possible to store a single pointer in a regular array's 8 bytes. I am not an expert on that, however; not sure how many bits are "left" for such purposes. For arrays larger than that, the implementation could "fall through" to a regular buffer-backed array.

I would also happily give up some stack space in exchange for saving these allocations. If I could store up to 3 pointers in a 24-byte data structure with the possibility of automatically expanding to the heap when those 3 slots are exceeded, I would be very happy. Given that ArrayBuffer already takes some heap space of its own (not sure how much off the top of my head, but definitely >= 8), I would "lose" at most 8 bytes in the "up to 3 case", potentially zero if ArrayBuffer takes >= 16 bytes on the heap.

Given that this library already implements ArrayN , I was wondering whether such use cases would be in scope for this.

kylemacomber · April 14, 2021, 2:04pm

LLVM has a SmallVector in which the number of inlined elements can be specified via a generic value parameter, i.e. SmallVector<UUID, 3>. This seems like the ideal approach if/when Swift gets generic value parameters.

In the meantime, it could make sense to have a SmallArray that uses an inline tuple of a hardcoded number of elements (e.g. 3). One reason to take this more naive, hardcoded approach, is to guarantee that Element is de-initialized if it's a non-trivial type. A tuple knows how to de-initialize itself. If we were to dynamically write an arbitrary number of elements into the space otherwise occupied by the Array pointer, we don't have a deinit hook today to go back over and de-initialize those elements.

Types, like String, that use this optimization in the standard library today have (the equivalent of) a known hard-coded, trivial Element type, which makes it easier to implement this optimization.

If you're interested in working on this, consider opening a PR against the Swift Collections package!

tyoung · April 15, 2021, 3:26pm

The inline tuple is an interesting approach... having worked extensively with llvm::SmallVector in my C++ code, it's a real pain to work with in the debugger due to the casting that goes on there. An inline tuple would be much easier to inspect.

One nice thing about LLVM's implementation is that there's a shared base class for all SmallVector derivatives, so you can have a function that takes any SmallVector (via type erasure to SmallVectorImpl), regardless of whether it pre-allocates 1, 2, ..., or n elements. If SmallArray did go down the path of hard-coded sizes and opted to have multiple sizes (e.g., SmallArray4, SmallArray8, ... SmallArrayN), a shared base would be nice so that could be passed interchangably.

lorentey · April 23, 2021, 1:09am

I think we already have Sequence (or RandomAccessCollection) for the shared base -- we don't need something that's specific to inline buffers.

I'm not sure we want to implement multiple sizes. The stdlib has a tuple-based FixedArray implementation that used to use gyb to stamp out any size we needed; we've now cut it back to a 16-element tuple. (And we're only using it for String stuff, with UInt8/UInt16 Elements, so it's not pulling its weight at all.)

Small array is one of those types that are useful in practice and desirable to have but (1) we can't do a really good job of implementing it in the language we currently have (because we cannot flexibly stack-allocate n items) and (2) it is also somewhat prone to misuse (because instances have a remarkably large constant memory footprint, even if they spill over into a heap-allocated buffer; so ideally these would only be used in local variables, and never copied).

Not being able to do a really good implementation means that while we could (and probably should) add a small array to the Collections package, it will be on a lesser tier than most other data structure implementations there: we won't have the expectation that it'll be ready to go through Swift Evolution any time soon. (It would be a little like UnsafeAtomics in the Atomics package, although with fewer deathly traps.)

Using tuples for the inline storage takes care of deinitialization, which is the most troublesome obstacle for implementing inline storage with an unconstrained element type. But there are also two lesser technical hurdles:

Representing the array with an enum value with smol and large cases will complicate in-place mutations, perhaps to the extent that we'd need to end up going with a purely struct-based approach -- which may waste some memory. (Giving up on enums is a sadly a frequent theme in this kind of code.)
Tuple elements always need to be initialized, so the array will probably need to take a default value to fill unused slots in its inline storage. (The alternative is to use an Array of optionals, which would probably be a bad idea.) _FixedArray16 works around this by only providing an initializer where Element: FixedWidthInteger, which obviously won't work in a general-purpose array.

I wonder if the ideal implementation for SmallArray would be a generic move-only type, using something like alloca & a custom deinit, with the inline capacity coming from an integer-typed generic argument value. Getting to this ideal state would take a bit of time; these language features don't exist yet.

It is not entirely obvious to me that we can implement generic smol arrays in a way that would be measurably faster than a regular Array. (Inline storage trades allocation costs for having to initialize the entire capacity, and if array values are copied around, they may trade one retain/release pair with n separate ones.) We'll need to carefully compare benchmark results and we need to be mentally prepared to discard the code if we cannot make it work.