I've been measuring the Swift binary size of one of our mobile apps, and have been specifically investigating the costs of specializing generic functions. Approximately 5% of Swift binary size is comprised of specializations of generic functions defined in the standard library, and another 2% is specializations of our own generic functions.
In reading the SIL pass, aside from respecting @_semantic
annotations, it doesn't seem to me that specialization behaves much differently under -Osize
vs -O
.
For example, it seems like every concrete type conforming to OptionSet
and calling any of these @inlinable
methods, end up generating a specialized version of each method. I experimented in a small example marking some of these methods with @_semantics("optimize.sil.specialize.generic.size.never")
, but incurred a small binary size regression, I believe as a result of a new lazy protocol witness table cache variable and accessor.
It seems to me that this cost can be minimized in the large scale, especially for stdlib generics. The cache variables themselves are in __bss
and as far as I've seen the accessor functions all do the same thing for different conformances:
define linkonce_odr hidden ptr @"$s4main12MyConcreteOptionSetVACs9OptionSetAAWl"() local_unnamed_addr #3 {
entry:
%0 = load ptr, ptr @"$s4main12MyConcreteOptionSetVACs9OptionSetAAWL", align 8
%1 = icmp eq ptr %0, null
br i1 %1, label %cacheIsNull, label %cont
cacheIsNull: ; preds = %entry
%2 = tail call ptr @swift_getWitnessTable(ptr nonnull @"$s4main12MyConcreteOptionSetVs9OptionSetAAMc", ptr nonnull getelementptr inbounds (<{ ptr, ptr, i64, ptr, i32, [4 x i8] }>, ptr @"$s4main12MyConcreteOptionSetVMf", i64 0, i32 2), ptr undef) #10
store atomic ptr %2, ptr @"$s4main12MyConcreteOptionSetVACs9OptionSetAAWL" release, align 8
br label %cont
cont: ; preds = %cacheIsNull, %entry
%3 = phi ptr [ %0, %entry ], [ %2, %cacheIsNull ]
ret ptr %3
}
In particular, they check a cache variable, load from it if non-null, and make a call to swift_getWitnessTable
otherwise. I've seen the function merger consolidating some of these accessors, but perhaps we can do better. Even a single master lazy protocol witness table accessor parameterized on the conformance and cache variable, and stored in the runtime might work.
Does this seem like a reasonable direction? I've been looking at pretty contrived examples, so perhaps I'm missing the big picture.
I assume this will come at some performance cost, but I'm not sure we are too concerned, and could even selectively specialize only hot functions.