SR-14273: Byte Code Based Value Witnesses

gmittertreiner · July 19, 2021, 9:54pm

Hi Everyone,

I've been working on SR-14273 for the last little bit. I've gotten to the point of having a working proof of concept (See this github branch: GitHub - gmittert/swift at upstream_runtimevw) and I'm hoping to get some feedback from the community so I can begin stabilizing and hopefully upstreaming the work. Many thanks to @Arnold who has been helping me along with this.

SR-14273 "Working toward a byte code representation for value witnesses" is an approach to reducing the code size of Swift binaries. When the Swift compiler is forced to create a value witness function for a type, these functions can become very large, especially for large nested value types. These functions tend to have very similar set of operations that they do. Some examples being: "compute the address and alignment of a field", "free a reference at a given address", and "check if an enum has a payload".

Rather than generating code for the types, we propose instead to compute a compile time type layout, then create a runtime function that interprets the type layout, computing alignments and addresses at runtime to copy or free them.

For a simple example, given a struct like

struct S {
    let a: SomeClass
    let b: UInt32
    let c: SomeOtherClass
}
enum E {
    case A(c: SomeClass)
    case B(s: S)
    case C(i: UInt32)
    case D
}

Normally, we would see a value witness function to destroy it look something like:

define $s1E1AVwxx(addr, metadata): // value witness destroy for E
  gep to field containing extra inhabitants
  compute enum case based on extra inhabitants
  if A:
     release A
  if B:
     destroy B

With Byte Code Value Witnesses, we would at compile time, compute a string representative of the type, then pass that to a runtime function to interpret. To make up a format for clarify:

// Sample format for clarity, see below for actual byte based format
string = enum {
   emptyCases: 1,
   cases: [
     Reference,
     Struct: [ Reference, POD32, Reference ],
     POD32,
   ]
}
define $s1E1AVwxx(addr, metadata): // value witness destroy for E
    call runtime_destroy(addr, string, len(string), metadata)

Our code that had to do all the work is now a call to a runtime function with a precompiled string. This allows us to make a code size/runtime cost trade off. A complete implementation should allow the replacement of at the minimum, destroying, as well as the assign/initialize with copy/take value witness functions functions. It likely also extends to the buffer and enum operations, though my focus has mainly been on copying/destroying so far.

Metrics

I don't currently have good metrics on size improvements in a large app, but it in reduced test case of defining some nested structs/enums, I was seeing about a 6% text section decrease by replacing the destroy and copy functions. I don't at all have numbers on how big of a performance decrease to expect from interpreting the strings. As is the proof of concept is anything but optimized, so unfortunately I don't think getting performance numbers now would be all that useful.

Some Logistics

As for enabling and disabling this, the proof of concept is behind a global flag, but I think this would be more useful with something that is a little more fine grain. Being able to annotate a specific problematic large type would be more useful than having to globally enable and disable the behaviour. My understanding is that an underscored attribute doesn't require too much process to add.

The Byte Code

The rest of this post is about the byte code format, which is the main thing I'd like feedback on. I'd expect it to be changeable, in the future, but somewhat cumbersome to do so, so I'd like to make sure it's ship shape before committing to something. Both Archetypes and Multi Enums have a lot of complex cases, and I while I did my best to understand the current implementation, there are not a lot of documentation on them, so it's not unlikely my understanding is incorrect or incomplete in places.

Overview

The byte code leverages and adds onto the current Type Layout infrastructure (see lib/irgen/TypeLayout.cpp). The type layout infrastructure has several types of types that we need to handle:

Scalars (Non composite types, ints and References mostly)
Single Enums (enums with only one payload case)
Multi Enums (enums with two or more payload cases)
AlignedTypes (Structs, also enum cases that have multiple fields)
Archetypes (the T in struct Foo<T> {} where we don't know what T is until runtime)
ResilientTypes, which we need enough information to call their value witness functions

VALUE := SCALAR | ALIGNED_GROUP | SINGLE_ENUM | MULTI_ENUM | ARCHETYPE

Most of the below mnemonics aren't there for any particular reason other than it was what was easy for me to remember and debug, so I'm not particularly attached to them if anyone has alternate suggestions.

General Design

As far as the over design goes, types are encoded about in the order they've been defined in the source file. I've leaned towards length prefixing the byte code of composite types, which allows the implementation to jump to which chunk of the string it needs (e.g. once it figures out which enum case it needs to free) as well as recursively call itself by passing the recursive call a pointer to somewhere in the byte string, as well as a string length without too much iteration over the string scanning for a closing brace. This also avoids having to copy the passed in byte string, we can just work with substrings of it by passing around a start pointer and a length.

For said length encodings, I've typically used 32bits in big endian (no qualms against using little endian, this was just easier to read when dumping memory). IRGen uses uint32_ts/unsigneds for most things (enum cases, number of struct fields), but I think there's reason to be had for just using 8 or 16 bits for lengths. The vast majority of the time, the top 3 bytes will be empty, and we could fall back to the existing non runtime implementation if the representation isn't able to handle it (65535 enum cases should be enough for anyone, right?). One alternate approach would be to default to only a single byte, then reserve 0xFF to indicate a larger amount:

LENGTH := UInt32
// or
LENGTH := UInt8 | '0xFF' UInt32

I haven't really attempted to do any sort of extensive bit hacking to really trim down the size, so suggestions there are well appreciated.

Scalars

SCALAR := 'c'|'s'|'l'|'L'|'C'|'r'|'N'|'n'|'W'|'u'|'w'|'b'|'B'|'o'|'f'|'x'

For our purposes, scalars are types that are not composed of other types. In swift, this is mostly POD types such as ints, and reference types. For a scalar, we need to know its size, and how it should be copied and released.

We define POD types for types that just represent data. These do need to be copied, but do not need to be released or retained. We also need to track them to figure out address and alignment in our parent types.

I8 = 'c',   // char
I16 = 's',  // short
I32 = 'l',  // long
I64 = 'L',  // long long

(perhaps b[yte],w[ord],d[ouble],q[uad word] would also be appropriate?)

We also have reference types. While they are all 64bit sized, we need to differentiate between them because they have different ways of being released/retained.

ErrorReference = 'r'
NativeStrongReference = 'N'
NativeUnownedReference = 'n'
NativeWeakReference = 'W'
UnknownUnownedReference = 'u'
UnknownWeakReference = 'w'
BlockReference = 'b'
BridgeReference = 'B'
ObjCReference = 'o'
ExistentialReference = 'x',

Closures, aka ThickFunctions are 128 bits. The first 64 bits is a function pointer, and the last 64 bits is an optional reference counted pointer if the closure is storing data. I tried to lower these instead to an aligned group, but I wasn't able to figure out how to do that without causing a bunch of crashing, so they remain here for now. These could likely be removed in a complete implementation.

ThickFunc = 'f',

Aligned Groups

Structs (and Enum cases with multiple fields) are expressed as a group of values that have required alignments.

For structs, we're mainly interested in having enough information to determine the offset of each field, which we can derive from the alignment, and using the nested byte string to determine the field size.

   SIZE := uint32
   // a numFields (alignment,fieldLength,field)+
   ALIGNED_GROUP:= 'a' SIZE (ALIGNMENT SIZE VALUE)+
   ALIGNMENT := '1'|'2'|'4'|'8'

The Alignment attached to the structs indicates the number of bytes the struct should be aligned on.

Note that because the alignment of an aligned group is the largest alignment of its subfields, aligned groups need to nest and sadly cannot be flattened.

For example:

   struct {
      i: UInt8
      struct {
         j: UInt8
         n: SomeClass
      }
   }

Will align 64-bit align j, while

   struct {
      i: UInt8
      j: UInt8
      n: SomeClass
   }

Will 8-bit align j.

For an example of the layout, given the struct

   struct {
      i: UInt8
      struct {
         j: UInt8
         n: SomeClass
      }
   }

We produce the type layout:
a<0x00000002>1<0x00000001>b8<0x00000011>a<0x00000002>1<0x00000001>b8<0x00000001>N
// Which is

a                // Aligned group
<0x00000002>     // 2 fields in the group
    1            // Field is byte aligned
    <0x00000001> // Byte string length is 1
    b            // Byte string for UInt8

    8            // Field is 8 Byte Aligned
    <0x00000011> // Byte string is 17 bytes long
    a            // We have a nested aligned group
    <0x00000002>  
        1<0x00000001>b  // byte aligned byte
        8<0x00000001>N  // 64bit aligned native reference

Single Payload Enums

Single Enums are encoded as a

   SIZE := uint32

   // e numEmptyPayloads lengthOfPayload payload
   SINGLEENUM := 'e' SIZE SIZE VALUE

For single payload enums we need enough information to determine the overall size of the enum, where its extra inhabitants are if any, and how to release/retain it. Even though we don't need to release or retain the no payload cases, we still need to know the number of no payload cases, because we may have to add extra tag bytes to the enum if we don't have enough extra inhabitants to represent them.

For example, the following enum would produce:

// A single enum with 3 no payload cases, a payload length of 1, and a payload
// of a single Native Pointer
   enum E {
     case a(c: SomeClasee)
     case b
     case c
     case d
   }

e<0x00000003><0x00000001>N

e            // enum idicator
<0x00000003> // 3 No payload cases
<0x00000001> // Payload bytestring length of 1
N            // Payload is a single native reference

Multi Payload Enums

   // E numEmptyPayloads numPayloads lengthOfEachPayload payloads
   MULTIENUM := 'E' SIZE SIZE SIZE+ VALUE+

For multi payload enums we need enough information to determine the overall size of the enum from each payload and how to release/retain each payload. We can use the the nested payload byte strings to determine the largest payload for the overall size of the enum, as well as the location of the spare bits of each payload so we can determine if we need to add extra tag bytes.

For example with the following enum:

   struct MyStruct {
     let a: SomeClass
     let b: SomeClass
   }
   enum E {
     case a(c: SomeClass)
     case b(c: MyStruct)
     case c
     case d
     case e
   }

// A Multi enum with 3 no payload cases, two payloads, one of a struct, the
// other of just a Native pointer
'E<0x00000003><0x00000002><0x00000001><0x00000011>Na<0x00000002>8<0x00000001>N8<0x00000001>N'

E                // Multi enum indicator
<0x00000003>     // 3 no payload cases
<0x00000002>     // 2 Payload Cases
    <0x00000001> // Payload 1 is len 1
    <0x00000011> // Payload 2 has len 17

    N            // SomeClass is a Native Reference    

    a<0x00000002>8<0x00000001>N8<0x00000001>N // Struct becomes an aligned group

Archetypes

Archetypes are the type variables passed in through types. For example, in the struct

   struct Example<T> { a: T}

T is the archetype. We don't know T's size statically unless it gets specialized, but we can find out through the passed in type metadata pointer.

INDEX := UINT32
ARCHETYPE := 'A' INDEX

The associated index represents the index into the generic argument vector for the passed in type. From there, we can get the type metadata for the type which gives us the size, alignment, and extra inhabitants which we need for copying and destroying.

Archetypes in particular I'm less certain about. GenericTypeParamDecl also contains a depth as well as an index, and I'm not sure if there's a case in which I need that in addition to the index. How Archetypes get lowered in the current IRGen is somewhat opaque (pun somewhat intended).

Resilient Types

For Resilient Types, we need a way to know how to call their value witness.
Given their mangled name, we can figure that out using the runtime function
swift_getTypeByMangledName.

// 'R' mangledNameLength mangledName
RESILIENT := 'R' UINT32 mangledName

Michael_Gottesman · July 22, 2021, 6:47pm

Nice! Great to see this work progress!

Joe_Groff · July 22, 2021, 7:18pm

This is great! There's a lot of potential here to improve code size and runtime performance for generic code. A few comments:

I think your bytecode design preserves more structure than it needs to—to be able to perform value witness operations, we don't generally need to preserve the precise structure of the type, we only need to know the offsets of where nontrivial parts of the value are that need retaining, releasing, or other custom treatment, as well as where enum tags are that conditionalize the presence of nontrivial parts. Value witnesses can assume they work on well-aligned values, so they shouldn't need to worry about incrementally reconstructing the layout of the type at a field-by-field level; we only need to know the offsets of the nontrivial parts. A more abstract representation should allow for a shorter bytecode representation, improve the performance of interpreting bytecode strings, allow for more opportunities for sharing layouts between types with the same effective layout, and hopefully need less expansion in the future, as it's unlikely we'll add many more kinds of retain/release, and a catch all "use a value witness table here" escape hatch ought to cover new cases in the future.
Instead of trying to represent generic parameters and resilient types indirectly, have you tried instantiating the layout bytecode for generic and resilient types when their metadata is instantiated? Another benefit of a bytecode representation is that we can generate new ones at runtime, and one of the big performance sinks for unspecialized code is deeply-nested unspecialized value witness calls through many layers of generic or resilient type expansion. We could instead flatten all of that indirection at metadata instantiation time.
A type layout computation is also useful for a bunch of compile-time code size optimizations, since we can use the type layout as the uniquing key for the specialized value witness tables we do generate, allowing us to more effectively share value witness tables among same-layout types. We currently special-case a handful of common situations where value witness tables can be shared, but we can be much more systematic about it with a compile-time type layout representation mirroring the runtime bytecode representation. We could also potentially cache generated value witness tables in a lookup table for the runtime, so that when it instantiates type metadata, it can pick up a specialized value witness table the compiler generated that matches the layout of the type it just instantiated.

Arnold · July 22, 2021, 7:47pm

I think this is a very desirable long term goal. I think the need for a template representation that references the generic parameters/resilient types remains in that world. An initial implementation does not need to be able to instantiate byte code at runtime and should be extendable to do so in the future.

For runtime allocation reasons you might not always want to allocate a specialized witness table/byte strings. I think the generic implementation will not be throw away work and a good start to get this off the ground. What do you think?

Joe_Groff · July 22, 2021, 8:06pm

Having a template representation would allow for an easier shared instantiation function to be implemented, but I think you could still construct a value witness bytecode string from components during type layout even if you didn't have it.

OTOH, we already do a significant amount of allocation during metadata instantiation, including allocating the value witness table, and including a string that should be fairly small in the common case may not add that much additional overhead. Instantiating the layout string gives us an opportunity in some circumstances to avoid allocating a new value witness table if there's one to share. I agree we can do it later, but I think the performance benefits of instantiating the witness table have a good chance of making it always worth doing once we can.

George · July 22, 2021, 8:57pm

This effort seems like it could form a good basis for eventually supporting static structural reflection (since that will need a way to represent the layout of a type in a binary), which is why I disagree with @Joe_Groff's point here:

It may make sense to do a quick design pass of that to make sure we don't end up with two similar-but-incompatible things.

Joe_Groff · July 22, 2021, 9:00pm

We already have enough reflection information to recover the structure of types, such as what fields they have. Value witnesses could in theory just interpret that, but it would be very slow, because the information is too detailed and not really designed for the purpose. That reflection metadata is also designed to be optional, and by its nature, it cannot be shared among types that would otherwise have the same layout from a value witness perspective. The reason to design a bytecode is to make something more efficient and sharable between same-layout types.

gmittertreiner · July 22, 2021, 11:39pm

@Joe_Groff:
I think your bytecode design preserves more structure than it needs to—to be able to perform value witness operations, we don't generally need to preserve the precise structure of the type, we only need to know the offsets of where nontrivial parts of the value are that need retaining, releasing, or other custom treatment, as well as where enum tags are that conditionalize the presence of nontrivial parts.

Agreed that it ought to have as minimal structure as possible. Did you have something specific in mind? Else some possibilities:

Structs really ought to be a list/mapping of [offset,referenceType] pairs. That way we can drop encoding PODs altogether. The should also allow us to flatten nested structs in many cases.
Enums could be a mapping of [index -> [offset referencetype] pairs] plus some spare bit information. In cases where the payload is a single nullable reference or we know the spare bits are in a place that doesn't conflict with retain/releases, we can do a bit better or elide the enum all together. I can't think of a way to completely flatten the enum structure -- we'd need some hierarchical information about spare bits for each nested enum.

The difficulty in my mind is representing spare bits or having enough information to derive their location. There's not a lot of documentation that I can find on the precise layout algorithm for spare bits so I'm a little uncertain on what kinds of patterns and guarantees they have as well as what information we need to compute/represent them.

Are we able to represent spare bits as an offset/mask/numBits triple? Are spare bits guaranteed to be grouped together in an at most 32bit stretch?

re: runtime instantiation: I think that would be a useful long term goal that once the base machinery is added and working, would be a useful following step.

re: witness table sharing/caching: This piggybacks off of the typelayout work @Arnold did a while back which to my understanding aimed to at least partially tackle some of those situations, as it does do some uniquing and caching as is. Right now the bytecode mostly a serialization of the existing typelayout generation. This is definitely an opportunity to take it further.

Arnold · July 23, 2021, 12:03am

Hmm interesting, yes, I had not considered that you could just chain together (and then further optimize) the individual byte code strings for the type layout fields (type layout as in TypeLayout metadata that metadata instantiation uses) -- which I think you are alluding to.

Arnold · July 23, 2021, 2:06pm

We only need spare bit information insofar it is used to support the single-payload value witnesses/extra inhabitants. The max number of extra inhabitants is capped at 0x7FFFFFFF.

Extra-inhabitants can come from:

Spare bits of a value e.g. Bool has 7 spare bits because its storage size is bigger than the bits required to represent the value (see IRGenModule::getSpareBitsForType).
Spare bits because of padding in structs — the only way we get to use those spare bits are when the struct is a payload of a multi-payload enum. Multipayload enums can only use spare bits if they are fixed size in all resilience domains.

% cat t.swift
struct Inner {
    var u8: UInt8 = 0
    var u16: UInt16 = 0
}

struct Outer {
    var a = Inner()
    var b = Inner()
}

enum MultiPayload {
    case a(Outer)
    case b(Outer)
    case c(Outer)
    case d
}
% swiftc -Xllvm -debug-only="enum-layout" t.swift -c
Layout for enum MultiPayload:
  no-payload mask:      11111111_11111111_11111111_11111111_11111111_11111111_11111111_11111111
  spare bits mask:      00000000_00000000_00111111_00000000_00000000_00000000_11111111_00000000
  no-payload case d:    00000000_00000000_11000000_00000000_00000000_00000000_00000000_00000000
  payload tag bits:     00000000_00000000_11000000_00000000_00000000_00000000_00000000_00000000

special treatment of pointers and the like, types that have known dis-allowed bit patterns

A struct can only use the extra-inhabitants of one of its fields.

num-extra-inhabitants = max(f1, ... fn) // if there is a tie select the first field that is a max
struct A {
    var f1 ...
    var fn
}

Joe_Groff · July 23, 2021, 5:14pm

Yeah, for non-enum types, I think that's exactly what we want—either a list of [offset,operation] pairs, or maybe a sequence of [skip N bytes] [operation] [skip N bytes] [operation] opcodes, where the operation opcodes imply advancing N bytes. That would make it so that types with their reference fields front-loaded would be able to just encode [operation] [operation] and then skip the POD tail.

Enums are interesting. Here, too, I think we want to figure out the least specific encoding that allows for the value witness operation to encoded accurately and efficiently, without having to fully precisely replicate the compiler's current implementation model for enums. If we can do that in a way that keeps normalizing the state machine decidable, so we can still compare type layouts for equality, that'd be a nice bonus too, but that's of course tricky to do in the face of conditional computation.

For extra inhabitants, we know they're only ever used to optimize single-payload enums, so I think the primary operation we need for them is, "if this is an extra inhabitant value, then skip N operations". It might be sufficient for the opcode to be "if the [8/16/32/64]-bit value at [offset] is [greater than/less than] [8/16/32-bit immediate], then skip N opcodes". You could then AND multiple conditions, such as a less-than and greater-than check for a range, or a test against a 64-bit immediate (which should be rare in practice), by doing multiple checks in sequence. Another optimization we could do here is make it so that the refcounting operations used by bytecoded witnesses ignore pointer extra inhabitant values, so that the common case of an Optional<ClassType>, possibly with trivial additional matter doesn't require any additional encoding.

With spare bits, you potentially need more intricate branching, since you can have different sets of value witness operations for different payloads, and you need to be able to match an arbitrary number of bits at an arbitrary bit offset, instead of at byte resolution. However, we at least never use spare bits when there is dynamic layout, so generic parameters or resilient types never have spare bits. That means we should always know exactly where the spare bits are and how they are used at compile time. Possibly we could have some sort of switch opcode that takes a bit offset, an N-bit immediate, and a jump table of 2^N destinations for the corresponding value witness operations for each tag value.

I was also thinking about how you all could make incremental progress on bytecoded witnesses, and realize some of the gains, without us having to commit to a perfect finalized bytecode all at once. Within the framework of the ABI, I think it should be possible to deploy a value witness bytecode for a binary without any help from the OS runtime, by linking the bytecode interpreter statically into the binary, and emitting value witness tables that include a slot for a bytecode string and witness implementations that all call into the bytecode interpreter. That approach would mean that we lose some of the code size win from statically linking the interpreter, but hopefully it isn't that big to begin with, and it would allow us to iterate on the bytecode design and experiment with different approaches before committing to a format to integrate into the system runtime.

jrose · August 8, 2021, 12:54am

A bit of a late thought, but the compact format doesn’t necessarily have to implement everything. You could say multi-enums call a function that returns their current value-layout, rather than encoding the discrimination and masking of the multi-enum in the format. It’s not as tidy but it’s way simpler, and allows for more format changes in the future.