Generic Specialization Binary Size Cost

NuriAmari · December 20, 2023, 4:39pm

I've been measuring the Swift binary size of one of our mobile apps, and have been specifically investigating the costs of specializing generic functions. Approximately 5% of Swift binary size is comprised of specializations of generic functions defined in the standard library, and another 2% is specializations of our own generic functions.

In reading the SIL pass, aside from respecting @_semantic annotations, it doesn't seem to me that specialization behaves much differently under -Osize vs -O.

For example, it seems like every concrete type conforming to OptionSet and calling any of these @inlinable methods, end up generating a specialized version of each method. I experimented in a small example marking some of these methods with @_semantics("optimize.sil.specialize.generic.size.never"), but incurred a small binary size regression, I believe as a result of a new lazy protocol witness table cache variable and accessor.

It seems to me that this cost can be minimized in the large scale, especially for stdlib generics. The cache variables themselves are in __bss and as far as I've seen the accessor functions all do the same thing for different conformances:

define linkonce_odr hidden ptr @"$s4main12MyConcreteOptionSetVACs9OptionSetAAWl"() local_unnamed_addr #3 {
entry:
  %0 = load ptr, ptr @"$s4main12MyConcreteOptionSetVACs9OptionSetAAWL", align 8
  %1 = icmp eq ptr %0, null
  br i1 %1, label %cacheIsNull, label %cont

cacheIsNull:                                      ; preds = %entry
  %2 = tail call ptr @swift_getWitnessTable(ptr nonnull @"$s4main12MyConcreteOptionSetVs9OptionSetAAMc", ptr nonnull getelementptr inbounds (<{ ptr, ptr, i64, ptr, i32, [4 x i8] }>, ptr @"$s4main12MyConcreteOptionSetVMf", i64 0, i32 2), ptr undef) #10
  store atomic ptr %2, ptr @"$s4main12MyConcreteOptionSetVACs9OptionSetAAWL" release, align 8
  br label %cont

cont:                                             ; preds = %cacheIsNull, %entry
  %3 = phi ptr [ %0, %entry ], [ %2, %cacheIsNull ]
  ret ptr %3
}

In particular, they check a cache variable, load from it if non-null, and make a call to swift_getWitnessTable otherwise. I've seen the function merger consolidating some of these accessors, but perhaps we can do better. Even a single master lazy protocol witness table accessor parameterized on the conformance and cache variable, and stored in the runtime might work.

Does this seem like a reasonable direction? I've been looking at pretty contrived examples, so perhaps I'm missing the big picture.

I assume this will come at some performance cost, but I'm not sure we are too concerned, and could even selectively specialize only hot functions.

cc: @Erik_Eckstein @Joe_Groff

Joe_Groff · December 20, 2023, 5:01pm

Introducing a common implementation function for boilerplatey accessors is a good idea, since these metadata accessors are on already-slow paths, and it's something we do in IRGen for similar cases where we emit a thunk function to reduce some repetitive code patterns.

I'm sure you have other examples, but at least in the case you linked to for OptionSet.union and related methods, it seems like a bigger problem for code size is that those methods aren't getting always inlined, since the net effect in the caller should be almost zero compared to invoking the underlying mutating method once inlining and specialization in-context occurs. All they're doing is copying the value (in a weird way, I'm not sure why it's var r = Self(rawValue: self.rawValue) instead of just var r = self), applying a mutating method, and returning it.

NuriAmari · December 20, 2023, 8:55pm

Introducing a common implementation function for boilerplatey accessors is a good idea

Ok I'll see what I can figure out, thanks. If you have the time, and know of a code pointer to get me started, that'd be appreciated.

In this case, it does look like failing to inline the specializations is the problem. In fact this small test program compiles to a smaller binary under -O than under -Osize, at least partially due to inlining differences.

public struct ConcreteOptionSet: OptionSet {
    public let rawValue: UInt8
    public init(rawValue: UInt8) {
        self.rawValue = rawValue
    }
    public static let optionA = ConcreteOptionSet(rawValue: 1 << 0)
    public static let optionB = ConcreteOptionSet(rawValue: 1 << 1)
}

let options1: ConcreteOptionSet = .optionA
let options2: ConcreteOptionSet = .optionB
let mergedOptions = options1.union(options2)
print(options1.rawValue)

Comparing the symbols, there's all sorts of specializations that -O manages to inline away, but -Osize can't:

~/scratch/generics-investigation/option-set-repro diff size-symbols.txt perf-symbols.txt
68,76c68
< _$sSa13_adoptStorage_5countSayxG_SpyxGts016_ContiguousArrayB0CyxGn_SitFZyp_Tgm5
< _$ss10SetAlgebraPs7ElementQz012ArrayLiteralC0RtzrlE05arrayE0xAFd_tcfC4main014ConcreteOptionA0V_Tgq5
< _$ss10SetAlgebraPsE10isDisjoint4withSbx_tF4main014ConcreteOptionA0V_Tgq5
< _$ss10SetAlgebraPsE10isSuperset2ofSbx_tF4main014ConcreteOptionA0V_Tgq5
< _$ss10SetAlgebraPsE11subtractingyxxF4main014ConcreteOptionA0V_Tgq5
< _$ss10SetAlgebraPsE7isEmptySbvg4main014ConcreteOptionA0V_Tgq5
< _$ss10SetAlgebraPsE8isSubset2ofSbx_tF4main014ConcreteOptionA0V_Tgq5
< _$ss10SetAlgebraPsE8subtractyyxF4main014ConcreteOptionA0V_Tgq5
< _$ss12_ArrayBufferV7_buffer19shiftedToStartIndexAByxGs011_ContiguousaB0VyxG_SitcfCyp_Tgm5
---
> _$ss10SetAlgebraPs7ElementQz012ArrayLiteralC0RtzrlE05arrayE0xAFd_tcfC4main014ConcreteOptionA0V_Tgq5Tf4gd_n
78,90d69
< _$ss2eeoiySbx_xtSYRzSQ8RawValueRpzlF4main17ConcreteOptionSetV_Tgq5
< _$ss9OptionSetPs7ElementQzRszrlE6insertySb8inserted_x17memberAfterInserttxF4main08ConcreteaB0V_Tgq5
< _$ss9OptionSetPs7ElementQzRszrlE6removeyxSgxF4main08ConcreteaB0V_Tgq5
< _$ss9OptionSetPs7ElementQzRszrlE6update4withxSgx_tF4main08ConcreteaB0V_Tgq5
< _$ss9OptionSetPsE12intersectionyxxF4main08ConcreteaB0V_Tgq5
< _$ss9OptionSetPsE19symmetricDifferenceyxxF4main08ConcreteaB0V_Tgq5
< _$ss9OptionSetPsE5unionyxxF4main08ConcreteaB0V_Tgq5
< _$ss9OptionSetPss17FixedWidthInteger8RawValueRpzrlE16formIntersectionyyxF4main08ConcreteaB0V_Tgq5
< _$ss9OptionSetPss17FixedWidthInteger8RawValueRpzrlE23formSymmetricDifferenceyyxF4main08ConcreteaB0V_Tgq5
< _$ss9OptionSetPss17FixedWidthInteger8RawValueRpzrlE9formUnionyyxF4main08ConcreteaB0V_Tgq5
< _$ss9OptionSetPss17FixedWidthInteger8RawValueRpzrlExycfC4main08ConcreteaB0V_Tgq5
< _OUTLINED_FUNCTION_0
< _OUTLINED_FUNCTION_1

Even if inlining is the solution here, I wonder if we can apply that as a solution generally. Inlining only really works for size in this case because the specialized function is small. It feels like there are still likely to be cases where inlining can't save you from over eager specialization. Particularly since the generic implementation is in the stdlib, but the user application has to pay for the specialized version.

Again, if you could point me to where SIL inlining happens, that'd be great.

in a weird way, I'm not sure why it's var r = Self(rawValue: self.rawValue) instead of just var r = self

The former looks to me to be an explicit deep copy, where the latter looks like it could be a shallow copy if Self is a reference type. I tried changing the stdlib and they behaved the same though, so not sure either

Joe_Groff · December 20, 2023, 9:18pm

I totally agree. There unfortunately isn't any one magic solution because the tradeoffs of inlining vs. specialization vs. sharing one unspecialized implementation are so situational. There are a lot of different opportunities for improving generic code size. The way we currently generate unspecialized function unfortunately tends to lead to relatively huge code, as well as the side effect of additional metadata getting forced to remain in the program. That tends to push the overall balance toward eating the cost of the specializations more often than I'd like. One of my blue-sky ideas for a project would be to adopt a bytecode for unspecialized code, since it's going to be slow no matter what so should favor compactness. On a smaller scale, looking for opportunities to eliminate redundant code generation like you're doing seems worthwhile. You can look at irgen::emitGenericTypeMetadataAccessFunction as an example of somewhere we emit helper thunks to reduce duplication in the final machine code for type metadata accessors as an example.

That was the only justification I could come up with in my head too, but it doesn't seem particulary compelling in the case of OptionSet where pretty much every conformer in practice is going to be a wrapper around an integer type used as a bitfield. The folks on the standard library I asked couldn't remember why it was written this way either. If it doesn't make a difference in the compiler's behavior, I guess we can leave it alone though.

Erik_Eckstein · December 21, 2023, 11:37am

In Osize the compiler doesn't inline into thunks: swift/lib/SILOptimizer/Transforms/PerformanceInliner.cpp at d6871edc839adec2fad42412d33de80023e6ee28 · apple/swift · GitHub

There was probably a good reason that we added this restriction at that time. But it doesn't help in this case. When enabling inlining into thunks, the Osize result is smaller than the O result.

Joe_Groff · December 21, 2023, 5:44pm

IIRC, the intent of that rule was to prevent methods with multiple entry point thunks from getting duplicated wholesale, like if you have a public method that overrides a superclass method and fulfills a protocol requirement, so you have the direct entry point, a thunk for the vtable, and a thunk for the protocol witness table. Maybe the criteria for when we apply that rule could probably be tuned to apply to specific kinds of thunk only?

NuriAmari · December 21, 2023, 8:09pm

I have to spend some time to read the implementation, but I think we've at least experimented with inlining heuristics that consider call site counts. We'd maybe only be able to operate on functions with module-internal SIL linkage, but perhaps in single call site cases, the callee could be inlined into all callers, thunk or otherwise.

Joe_Groff · December 21, 2023, 8:13pm

Yeah, if there's only a single call site for a non-externally-visible function, that seems like a stronger heuristic that the function ought to be inlined than most, if not all, heuristics against inlining.