After a discussion about the underlying memory structure of Data on Twitter (https://twitter.com/khaost/status/1224158483039174656?s=21), it looks like for 32 bit system, Data.InlineData would be 7 bytes, with a 6 bytes Data.InlineData.Buffer and an extra byte for length(UInt8). But Data.LargeSlice is 8 bytes (composite of 2 pointers). So there is a byte unused when the Data._representation is inline.
Would it make sense to add another byte in Data.InlineData.Buffer so Data.InlineData match the size of Data.LargeSlice? Or is there a performance issue I haven’t considered?
tl;dr: The layout of InlineData
allows the Swift optimizer to get enum _Representation
to fit in two words (8 bytes on 32-bit), but in some cases, it will not be able to do this, and enum _Representation
will be two words + 1 byte (9 bytes on 32-bit)
Note: since you describe 32-bit here, I'll be using 32-bit sizes throughout the post, but this applies to 64-bit as well.
The size of Data.InlineData.Buffer
appears to be one byte "short" of what it could be, but this is actually necessary for the representation of all enum _Representation
in two words. enum _Representation
is the combination of the following four cases:
enum _Representation {
case empty
case inline(InlineData)
case slice(InlineSlice)
case large(LargeSlice)
}
Except for .empty
, each of these cases has an associated payload, which means that the size of _Representation
has to include both the size of the payload, and the size of a discriminator which stores which case the enum
actually represents. This discriminator is actually hidden from us (it's implicit in the definition of the enum
), but since we have 4 cases here, we just need two bits to store it somewhere.
The Swift compiler and optimizer are pretty smart: given enough information about the various cases and payloads here, they are able to actually "steal" two bits from each of the payload cases (+ .empty
) to store the discriminator inline in the payloads, without requiring a whole additional byte for the representation, assuming that this is actually valid to do.
So, in what cases is this valid to do?
- When the compiler has enough room in each payload to perform this inlining
- The discriminator would be inlined into the same location (bitwise) in each payload case
We can get pretty detailed into exactly the bitpatterns the optimizer uses here, but I'll just summarize (general) layouts:
.empty
: bits can be placed anywhere since there's no payloadInlineSlice
: word-sizestruct
followed by aclass
pointerLargeSlice
: twoclass
pointersInlineData
: TBD (answered below)
The key here is that the optimizer is really smart about pointers to class
objects. Specifically, on pretty much all platforms, not all bitpatterns are valid pointers (e.g. on x86_64, lower byte is all zeros for most objects)*****, and the compiler is smart enough to be able to steal some bits for its own use. So, if we align a class
pointer in the same place across all payloads, the optimizer can actually store the bits we need inline.
However, the optimizer cannot know this about an arbitrary collection of bytes: if we have an 8-byte buffer, any bitpattern is valid anywhere inside of that buffer. So we cannot align an 8-byte buffer (7 bytes + length) with a class
pointer so they overlap where the optimizer can steal some bits. What about a 7-byte buffer (6-bytes + length), though? It turns out, we can do this, thanks to optimizer rules:
dd: data byte
ll: length byte
ss: struct
cc: class
.empty: [__, __, __, __, __, __, __, __] <- can steal anywhere
.inline: [dd, dd, dd, dd, dd, dd, ll, __] <- last byte unused
.slice: [ss, ss, ss, ss, cc, cc, cc, cc]
.large: [cc, cc, cc, cc, cc, cc, cc, cc]
We designed the InlineData
case to leave that last byte unused so the optimizer could "see" it has free rein and steal bits from the last byte across all cases. Without this, the discriminator would not fit in the same place across all four cases, and we'd need an extra byte for the discriminator, leading to a 9-byte enum _Representation
.
*****Here is the big disclaimer: this does not work for platforms where the optimizer doesn't guarantee being able to steal bits from a class pointer, either because it doesn't have that ABI guarantee, or otherwise. In those cases, it falls back to laying out an additional byte for the discriminator. So the work here enables the optimizer to inline wherever possible, but if it doesn't work out, then we have to work with 9 bytes. If we added one more byte to the representation though, we would always be 9 bytes.
The comment on Data.InlineData.Buffer
alludes to this but doesn't explain it: we need Buffer
to be two-bytes short of two pointer sizes in order to fit.
Looking at the code, I think that, with a little help from the standard library, you might have been able to eke out another data byte by taking three bits for the size and leaving five for the discriminator (three of those would be unused). Unfortunately, this is all frozen now, so it's become ABI and can't be changed on Darwin at least.
Curious. Does the stdlib have a way of indicating "I'm only using x
bits from this byte", letting the optimizer use the rest? (I don't believe this was available to us at the time, and still wouldn't be available to Foundation if it relies on a builtin.)
Or am I misunderstanding?