Add an extra byte into Data.InlineData.Buffer for 32bit

etwmc · February 3, 2020, 5:26am

After a discussion about the underlying memory structure of Data on Twitter (https://twitter.com/khaost/status/1224158483039174656?s=21), it looks like for 32 bit system, Data.InlineData would be 7 bytes, with a 6 bytes Data.InlineData.Buffer and an extra byte for length(UInt8). But Data.LargeSlice is 8 bytes (composite of 2 pointers). So there is a byte unused when the Data._representation is inline.
Would it make sense to add another byte in Data.InlineData.Buffer so Data.InlineData match the size of Data.LargeSlice? Or is there a performance issue I haven’t considered?

itaiferber · February 3, 2020, 2:55pm

tl;dr: The layout of InlineData allows the Swift optimizer to get enum _Representation to fit in two words (8 bytes on 32-bit), but in some cases, it will not be able to do this, and enum _Representation will be two words + 1 byte (9 bytes on 32-bit)

Note: since you describe 32-bit here, I'll be using 32-bit sizes throughout the post, but this applies to 64-bit as well.

The size of Data.InlineData.Buffer appears to be one byte "short" of what it could be, but this is actually necessary for the representation of all enum _Representation in two words. enum _Representation is the combination of the following four cases:

enum _Representation {
    case empty
    case inline(InlineData)
    case slice(InlineSlice)
    case large(LargeSlice)
}

Except for .empty, each of these cases has an associated payload, which means that the size of _Representation has to include both the size of the payload, and the size of a discriminator which stores which case the enum actually represents. This discriminator is actually hidden from us (it's implicit in the definition of the enum), but since we have 4 cases here, we just need two bits to store it somewhere.

The Swift compiler and optimizer are pretty smart: given enough information about the various cases and payloads here, they are able to actually "steal" two bits from each of the payload cases (+ .empty) to store the discriminator inline in the payloads, without requiring a whole additional byte for the representation, assuming that this is actually valid to do.

So, in what cases is this valid to do?

When the compiler has enough room in each payload to perform this inlining
The discriminator would be inlined into the same location (bitwise) in each payload case

We can get pretty detailed into exactly the bitpatterns the optimizer uses here, but I'll just summarize (general) layouts:

.empty: bits can be placed anywhere since there's no payload
InlineSlice: word-size struct followed by a class pointer
LargeSlice: two class pointers
InlineData: TBD (answered below)

The key here is that the optimizer is really smart about pointers to class objects. Specifically, on pretty much all platforms, not all bitpatterns are valid pointers (e.g. on x86_64, lower byte is all zeros for most objects)*****, and the compiler is smart enough to be able to steal some bits for its own use. So, if we align a class pointer in the same place across all payloads, the optimizer can actually store the bits we need inline.

However, the optimizer cannot know this about an arbitrary collection of bytes: if we have an 8-byte buffer, any bitpattern is valid anywhere inside of that buffer. So we cannot align an 8-byte buffer (7 bytes + length) with a class pointer so they overlap where the optimizer can steal some bits. What about a 7-byte buffer (6-bytes + length), though? It turns out, we can do this, thanks to optimizer rules:

dd: data byte
ll: length byte
ss: struct
cc: class

.empty:  [__, __, __, __, __, __, __, __] <- can steal anywhere
.inline: [dd, dd, dd, dd, dd, dd, ll, __] <- last byte unused
.slice:  [ss, ss, ss, ss, cc, cc, cc, cc]
.large:  [cc, cc, cc, cc, cc, cc, cc, cc]

We designed the InlineData case to leave that last byte unused so the optimizer could "see" it has free rein and steal bits from the last byte across all cases. Without this, the discriminator would not fit in the same place across all four cases, and we'd need an extra byte for the discriminator, leading to a 9-byte enum _Representation.

*****Here is the big disclaimer: this does not work for platforms where the optimizer doesn't guarantee being able to steal bits from a class pointer, either because it doesn't have that ABI guarantee, or otherwise. In those cases, it falls back to laying out an additional byte for the discriminator. So the work here enables the optimizer to inline wherever possible, but if it doesn't work out, then we have to work with 9 bytes. If we added one more byte to the representation though, we would always be 9 bytes.

The comment on Data.InlineData.Buffer alludes to this but doesn't explain it: we need Buffer to be two-bytes short of two pointer sizes in order to fit.

beccadax · February 4, 2020, 7:35pm

Looking at the code, I think that, with a little help from the standard library, you might have been able to eke out another data byte by taking three bits for the size and leaving five for the discriminator (three of those would be unused). Unfortunately, this is all frozen now, so it's become ABI and can't be changed on Darwin at least.

itaiferber · February 5, 2020, 12:18am

Curious. Does the stdlib have a way of indicating "I'm only using x bits from this byte", letting the optimizer use the rest? (I don't believe this was available to us at the time, and still wouldn't be available to Foundation if it relies on a builtin.)

Or am I misunderstanding?