Short Array optimisation

tera · October 26, 2023, 9:13am

We don't optimise short array storage (e.g. "arrays up to 15 bytes could be stored inline") like we do with strings. Ditto for the quite frequent case of empty arrays, which optimisation would work for all array types if it was in place. It's not immediately obvious was it worth optimising or not; do we have a relevant discussion / white paper / rationale on that? This question is interesting to me both from a historic perspective and as a hint if this optimisation is worth considering in my custom array-like type implementation.

Karl · October 26, 2023, 10:45am

Empty arrays, dictionaries, and sets use singleton storage objects (example), so they don't actually allocate memory. It is marked an immortal object, so reference-counting operations should be extremely cheap for it.

As for small arrays, I guess it just wasn't worth it. String is twice the size of Array (16 bytes vs 8 bytes). I remember there being some debate about this, but it was considered very important and desirable that Array not be any larger than a single pointer (I think the reason might have had something to do with multi-dimensional arrays -- i.e. arrays of arrays).

String is not generic - native strings always contain UTF-8 text, and given the way UTF-8 works, it can actually store a reasonable amount of text in its 16 bytes. Array is generic, and for common types like Array<Int>, it would't even be able to store a single element.

tera · October 26, 2023, 11:07am

On this forum?

Array of strings doesn't look a less common use case compared to array of arrays.

OTOH a single Int / Double or a class reference would fit into 16 byte storage (an array of 1 is worth optimising – we do have CollectionOfOne after all). So would fit 15 UInt8 values.

Karl · October 26, 2023, 2:54pm

Well, back on ye olde mailing lists. I tried looking for it again, but couldn't find anything. One of the standard library authors might remember why it was considered important for Array to be a single pointer. I'm quite sure I remember that was a deliberate design decision.

We can ask @dabrahams if he's around and can recall.

The way to think about this (IMO) is that String in particular has a very compelling use-case justifying its larger inline size.

For a general Array<X>, generally speaking it's not really going to fit much data in 8 or even 16 bytes. Maybe you can fit a couple of small enums, but for structs the internal padding between data members means a lot of the time they won't fit at all, and if they do maybe you can get like 1 element inline.

There is a cost to this kind of thing. For String, I think it's very compelling and clearly worth the cost. For Arrays of arbitrary data types, it's not so obvious.

tera · October 26, 2023, 5:39pm

Most enums without payload would contain fewer than 256 elements – individual constants could be represented as bytes.

Also, inline storage could work particularly well with a Set<payload-less enum>.

Indeed.

wadetregaskis · October 27, 2023, 4:55am

I also would be surprised if there were not many applications of this today; lots of arrays are of UInt8, for example. Not everything that goes in an array is big.

I don't think I'd ever thought about this, but now that @tera has raised it, I find I am a bit surprised this optimisation isn't already implemented. It does seem like an obvious one, especially given the close parallels between Array and String… perhaps there is a particular reason it has not been?

jrose · October 27, 2023, 5:02am

I’m afraid I don’t remember the details at this point, but Array can’t change. Not just its size but a huge part of the representation are baked into the Swift ABI via @inlinable, in order to make it reliably competitive with C. Whether this was a good choice is up for debate, and I would love if someone else who was around when I was can remember why we ended up with it being one word instead of two, but we are stuck with it as is. There is an escape hatch for an “opaque” representation, kind of like custom NSArray subclasses back in ObjC, but that immediately comes with a performance hit, at least when backwards-deploying, and still doesn’t get us any more space.

asdf_bro · October 27, 2023, 2:37pm

You guys should check out _TinyArray from swift-certificates:

_TinyArray.swift

scanon · October 27, 2023, 3:04pm

The first-principles analysis is that one word instead of two is probably a pretty dubious optimization (and likely a pessimization), but even with a two-word (or bigger) Array, as "short array" representation probably wouldn't carry its weight.

You really want general-purpose array subscript to be a bounds check and a memory dereference and nothing else for performance reasons. You don't want to have to dispatch into different logic depending on the array representation. Even if that branch is hoistable, there are lots of situations where you have scattered isolated subscript accesses and you end up doing the check for each one. That sort of access pattern is a lot less common with strings (and short strings are relatively more common), which provides some intuition for why this makes sense for one type but not for the other.

There definitely are use patterns where short-array optimizations make good sense (especially the "zero/one/many" pattern), but we wouldn't want to impose that overhead on all array uses. There's a bunch of stuff I might change about Array if I had a time machine, but this isn't one of them.

taylorswift · October 28, 2023, 1:56am

i don't know why no one has mentioned it, but string normalization is one possible use-case for [UInt8]. since unicode normalization didn’t make it into SE-0405, you might want to use [UInt8] to store a normalized string.

bbrk24 · October 28, 2023, 2:09am

See also: this same discussion in a more general context (not Swift-specific) on StackExchange: implementation - Why would Short String Optimization not apply to dynamic arrays? - Programming Language Design and Implementation Stack Exchange

Karl · October 28, 2023, 3:07am

String is perfectly able to store a normalized string. It doesn't have an API to perform normalization, but as a storage type there's no issue.

dabrahams · November 15, 2023, 10:24pm

It was @Chris_Lattner3 's idea IIRC, and it seemed like a good one to me, especially because the small array optimization was unimplementable: Swift doesn't have copy constructors, so you can't even implement it without using enums to store the buffer, and then the elements won't be contiguous in memory. The size of the array would always depend on the element size, which could be a massive pessimization if the element size was massive. There's no obviously correct small buffer size for general-purpose use... so you want it to be a generic parameter. Experience with LLVM shows that developers guess and thereby introduce a lot of noise into the code for something they haven't measured the efficiency of and might be a pessimization. And Swift doesn't have numeric generic parameters, so…

dabrahams · November 15, 2023, 10:26pm

Why do you say it's likely a pessimization? What would you want to store in the additional word, and why?

scanon · November 15, 2023, 11:19pm

It's not what I want to store in the additional word, it's that I would want to separate the header and body allocations to be separate in the array storage, and that necessitates another word (and maybe two!) This potentially slows down array construction a little bit, but puts different types of data in different allocations, which makes hardened implementations that track the type of allocations easier to implement and makes it so these two things, which typically have distinctly different access patterns, are on different cachelines, maximizing the cacheline/page reach of the body buffer (and making it more likely that you hit the all-aligned fast path for vector code with smaller arrays).

Separating the body allocation from the ownership sigil and metadata would also allow for some API patterns that are impossible with tail allocation that would unlock some further optimizations (taking ownership of a buffer originally from another type without copying).

It's not at all obvious that this would have been a better design overall, but it definitely would have been better in some ways.

dabrahams · November 16, 2023, 1:34am

@scanon
I don't think I understand what you have in mind.

It's not what I want to store in the additional word

I asked the question because I thought it would help me understand what you have in mind.

To me "the array storage" implies the dynamically allocated block (ContiguousArrayStorage), but IIUC you are talking about putting some data directly in the MemoryLayout<Array<T>>.size bytes of the body.

puts different types of data in different allocations, which makes hardened implementations that track the type of allocations easier to implement

...or maybe you want two dynamic allocations per array instance? That sounds like a performance killer, since allocations are costly. It also means twice the ARC traffic when arrays are copied.

these two things, which typically have distinctly different access patterns

One of the things in the header is the array length, which is accessed on every (safe) indexing operation. Isn't that really the same access pattern as the elements?

are on different cachelines, maximizing the cacheline/page reach of the body buffer (and making it more likely that you hit the all-aligned fast path for vector code with smaller arrays).

We could have achieved that with a single allocation and a single pointer by using some internal padding between the header and the elements. Of course Swift has probably baked the array layout into the ABI already.

Don't you simply have to accept either alignment problems for vecotr code or wasted storage as long as there's a reference count?

ownership sigil and metadata

Sorry, the what now?

scanon · November 16, 2023, 2:10am

That's exactly right. But one of the allocations (the ref-counted metadata thing I tend to call the "ownership sigil", because that's really all it does--it is a refcounted object by which the lifetime and ownership of the dumb buffer is tracked) is fixed-size and tiny, so comes out of a fast and cheap allocator in practice, and the other thing doesn't need to be refcounted, because it's just dumb storage.

No, because we can amortize accessing the length across indexing operations most of the time. Sometimes we actually do the comparison on every operation, because the access pattern is too complex for the optimizer to reason about, but we can always² hoist the actual load of the length.

Yes, but you wouldn't get the other benefits I mentioned (single-type allocations for hardened allocators and the ability to pass ownership of buffers into/out-of Array), and the padding is wasted space (not a huge deal, but not great either).

You generally have to accept that there will be code capable of handling misalignment, but you can avoid using that code (and paying the performance penalty for it, which can be pretty large in the case of more naive vectorization and autovectorization--it's not uncommon to have "scalar alignment fixup" take more time than the main vectorized body when trip counts are just a few tens. Skilled vectorizers can avoid this, but there's a lot of mediocre vector code out there). If the data is ~always aligned in practice because Array always gives you aligned buffers, naively vectorized code goes a lot faster in the common case.

¹ "would propose" in the sense of "if I had a time machine I would go back and experiment with this". This ship has sailed, so this is all pretty theoretical.

² so long as the compiler can prove that writes to the array storage cannot alias the length field; but if anything this layout makes hinting that to the compiler much simpler.

dabrahams · November 16, 2023, 6:19am

Yeah, but you see, I have a new standard library to write for a different language, so I'm interested in these details ;-)

tera · November 16, 2023, 10:18am

These are the use cases I have in mind that could benefit from the compact array representation, in all those examples Array elements (plus length) are stored "inline" don't use separate heap allocated storage and don't cause ARC traffic. The two numbers are given for 64 and 128 bit storage respectively:

up to 57 / 120 Booleans
up to 29 / 60 tri state enum values (2 bits per value)
up to 19 / 40 UIDeviceOrientation's (3 bits per value)
up to 7 / 15 UInt8's
up to 5x5 / 7x7 tic tac toe board (2 bits per value)
0, 1 / 0, 1, 2 reference objects

dabrahams · November 16, 2023, 11:25pm

Sure, that would be great, if you can figure out how to do it. You need to determine that Element is trivially copyable at the very least, and you need a way to map Element values into/out of the bits of storage. And IIUC this "way" would need to be extensible, because the standard library doesn't know about all of these types you want to optimize.