Definite initialization: accidentally or essentially quadratic?

jamieQ · November 30, 2024, 4:30pm

consider code of the following form:

class C {
  var prop1: Int
  var prop2: Int
  var prop3: Int
  // ...
  var propN: Int
}

now suppose we implement the init method to assign default values to all the properties as so:

init() {
  prop1 = 0
  prop2 = 0
  prop3 = 0
  // ...
  propN = 0
}

if we let the number of properties grow – how does the compile time scale? here are some empirical results for examples of this form, doubling the number of properties at each step^[1]:

property count	2	4	8	16	32	64	128	256	512	1024	2048
avg compile time (s)	0.17	0.17	0.18	0.18	0.19	0.21	0.26	0.41	1.2	23	272

if we profile the compiler process while compiling one of these long-running examples, it looks like this:

which indicates that the vast majority of compilation time is spent within the getLivenessAtInst method. digging into the source reveals that this method resides in DefiniteInitialization.cpp, which (unsurprisingly) appears responsible for the 'definite initialization' logic that ensures all values are initialized before use.

in the case of a class initializer with many stored properties, the getLivenessAtInst method seems to perform a 'look back' that loops backward over all the instructions produced by SILGen within the initializer's 'basic block'. this inner loop doesn't appear to terminate until reaching the very beginning of the block. so in the limiting case, it makes sense that the compilation time would become dominated by this inner looping logic, since it presumably confers quadratic complexity in its current form.

to test the theory that the compilation time scales based on the initializer's maximum basic block length, we can alter the initializer code to force additional basic blocks to be produced. one way i found of doing this is by introducing a switch statement that only has a default case, e.g.

init() {
  switch () {
    default:
    prop1 = 0
  }
  switch () {
    default:
    prop2 = 0
  }
  // ...
  switch () {
    default:
    propN = 0
  }
}

in applying this technique to the 2048-property example from above, the average compile time was reduced from ~4.5 minutes to ~1.8 seconds.

so this finally brings us to my (jumble of) questions:

what is the backward scanning logic in getLivenessAtInst doing exactly?
if that logic is necessary, could it be altered to improve the runtime performance?
if it cannot be meaningfully improved, what's the best way to convince SILGen to emit new basic blocks so that the performance bottleneck can be worked around?
what/where is the best documentation on the definite init algorithm? i tried searching the codebase for design docs or something that discuss it in more depth, but have yet to find anything super relevant-seeming.

data was collected via hyperfine -w 1 -r 4 'swiftc <file>' using the Swift 5.10 compiler shipped with XCode 15.4. ↩︎

ksluder · November 30, 2024, 5:23pm

This seems like a classic dynamic programming problem. It should be straightforward to memoize the liveness of any value V at instruction I. But your stats show the juice likely isn’t worth the squeeze until you get to several hundred instance variables.

Slava_Pestov · November 30, 2024, 7:41pm

Given a storage location, we're trying to determine if the location is stored before being loaded from, within this basic block. If this is not the case, then the value is "live" on entry to the block, meaning it must be known to be initialized. After computing this "local" information, we run an analysis on the control flow graph to determine if indeed, the value is initialized on entry to the block, regardless of which control flow path was taken to get there.

There are several ways to improve it, I think. The fundamental issue here is that instead of computing the liveness of all storage locations simultaneously, we're performing a separate dataflow analysis for each location. So even if the dataflow analysis requires no fixpoint iteration (so it's O(n) in the number of basic blocks) you're still going to get O(mn) run time where n is the number of blocks and m is the number of storage locations. This is completely unnecessary and could be eliminated with better book-keeping.

A smaller change would be to perhaps amend LiveOutBlockState to record the indices of the store instructions that appear in the block, instead of a single HasNonLoadUse boolean. After all, we scan each basic block to compute the LiveOutBlockState originally, and then we're scanning each basic block again, for each value.

As you observed, in the limiting case where each 'basic block' is a single instruction there's no bottleneck, so there's nothing inherent about this algorithm that forces a quadratic amount of computation. It's just an artifact of the implementation.

As far as I know, neither the implementation nor the user-visible behavior of definite initialization is documented anywhere except for the source code. However, it is an instance of a Dataflow analysis, which is a topic covered in any compiler textbook, such as Engineering a Compiler. Dataflow analyses are often used to eliminate redundant loads and stores, detect loads of uninitialized memory, and so on.

tera · November 30, 2024, 8:49pm

FWIW the numbers presented look much worse than quadratic, probably closer to O(x^4).

jamieQ · December 2, 2024, 2:01pm

thank you (as usual) for the particularly informative reply! i'm going to try and 'talk through' some of the aspects you've touched on as i understand them to relate to the specific class of examples presented here.

the SILGen output for the init on an instance with 4 properties looks like this:

// C.init()
// Isolation: unspecified
sil hidden [ossa] @$s2di1CCACycfc : $@convention(method) (@owned C) -> @owned C {
// %0 "self"                                      // users: %2, %1
bb0(%0 : @owned $C):
  debug_value %0 : $C, let, name "self", argno 1  // id: %1
  %2 = mark_uninitialized [rootself] %0 : $C      // users: %44, %43, %33, %23, %13, %3
  %3 = begin_borrow %2 : $C                       // users: %12, %8
  %4 = integer_literal $Builtin.IntLiteral, 0     // user: %7
  %5 = metatype $@thin Int.Type                   // user: %7
  // function_ref Int.init(_builtinIntegerLiteral:)
  %6 = function_ref @$sSi22_builtinIntegerLiteralSiBI_tcfC : $@convention(method) (Builtin.IntLiteral, @thin Int.Type) -> Int // user: %7
  %7 = apply %6(%4, %5) : $@convention(method) (Builtin.IntLiteral, @thin Int.Type) -> Int // user: %10
  %8 = ref_element_addr %3 : $C, #C.prop1         // user: %9
  %9 = begin_access [modify] [dynamic] %8 : $*Int // users: %11, %10
  assign %7 to %9 : $*Int                         // id: %10
  end_access %9 : $*Int                           // id: %11
  end_borrow %3 : $C                              // id: %12
  %13 = begin_borrow %2 : $C                      // users: %22, %18
  %14 = integer_literal $Builtin.IntLiteral, 0    // user: %17
  %15 = metatype $@thin Int.Type                  // user: %17
  // function_ref Int.init(_builtinIntegerLiteral:)
  %16 = function_ref @$sSi22_builtinIntegerLiteralSiBI_tcfC : $@convention(method) (Builtin.IntLiteral, @thin Int.Type) -> Int // user: %17
  %17 = apply %16(%14, %15) : $@convention(method) (Builtin.IntLiteral, @thin Int.Type) -> Int // user: %20
  %18 = ref_element_addr %13 : $C, #C.prop2       // user: %19
  %19 = begin_access [modify] [dynamic] %18 : $*Int // users: %21, %20
  assign %17 to %19 : $*Int                       // id: %20
  end_access %19 : $*Int                          // id: %21
  end_borrow %13 : $C                             // id: %22
  %23 = begin_borrow %2 : $C                      // users: %32, %28
  %24 = integer_literal $Builtin.IntLiteral, 0    // user: %27
  %25 = metatype $@thin Int.Type                  // user: %27
  // function_ref Int.init(_builtinIntegerLiteral:)
  %26 = function_ref @$sSi22_builtinIntegerLiteralSiBI_tcfC : $@convention(method) (Builtin.IntLiteral, @thin Int.Type) -> Int // user: %27
  %27 = apply %26(%24, %25) : $@convention(method) (Builtin.IntLiteral, @thin Int.Type) -> Int // user: %30
  %28 = ref_element_addr %23 : $C, #C.prop3       // user: %29
  %29 = begin_access [modify] [dynamic] %28 : $*Int // users: %31, %30
  assign %27 to %29 : $*Int                       // id: %30
  end_access %29 : $*Int                          // id: %31
  end_borrow %23 : $C                             // id: %32
  %33 = begin_borrow %2 : $C                      // users: %42, %38
  %34 = integer_literal $Builtin.IntLiteral, 0    // user: %37
  %35 = metatype $@thin Int.Type                  // user: %37
  // function_ref Int.init(_builtinIntegerLiteral:)
  %36 = function_ref @$sSi22_builtinIntegerLiteralSiBI_tcfC : $@convention(method) (Builtin.IntLiteral, @thin Int.Type) -> Int // user: %37
  %37 = apply %36(%34, %35) : $@convention(method) (Builtin.IntLiteral, @thin Int.Type) -> Int // user: %40
  %38 = ref_element_addr %33 : $C, #C.prop4       // user: %39
  %39 = begin_access [modify] [dynamic] %38 : $*Int // users: %41, %40
  assign %37 to %39 : $*Int                       // id: %40
  end_access %39 : $*Int                          // id: %41
  end_borrow %33 : $C                             // id: %42
  %43 = copy_value %2 : $C                        // user: %45
  destroy_value %2 : $C                           // id: %44
  return %43 : $C                                 // id: %45
} // end sil function '$s2di1CCACycfc'

after adding some debug print statements into the DI logic, the control flow for processing this block appears to be something like:

each assign instruction ends up within the handleStoreUse method
the 'liveness' at that instruction is then computed
the BB is scanned backward starting at the instruction under consideration looking for stores
most instructions are ignored in this loop, as they are deemed 'unrelated' to the memory being accessed
this process eventually terminates (without finding any stores) when it reaches the beginning of the BB
the handleStoreUse method considers the storage location for these instructions to each be 'fully uninitialized', and determines that this memory access should be classified as an 'init'
after all stored properties are handled, there are 2 additional calls to getLivenessAtInst – one from isInitializedAtUse and one from processNonTrivialRelease. these checks both end up terminating once reaching the first assign instruction as they hit this branch.

within this context, what exactly is a 'storage location'? in this example, the first instruction that is checked for liveness is:

assign %7 to %9 : $*Int

which (i think) corresponds to the self.prop1 = 0 assignment. is the storage location 'prop1' in this case?

where in particular does the dataflow analysis occur here? is it within this getOutAvailability() method? that method never appears to be reached in these examples.

is the idea here that we could alternatively do an 'up front' computation of the liveness, perhaps in the code where the initial data structures are configured? curious if you have more specific thoughts on how the book-keeping might be improved.

doing this would then allow us to just consider the non-load instructions in the backward liveness scan, rather than iterating over everything? or am i misinterpreting the implications of what that would enable us to do differently?

jamieQ · January 13, 2025, 1:31pm

i've put together an initial draft with changes to precompute and cache some of the intra-block availability information, the re-calculation of which seemed primarily responsible for the eventual slowdown in the larger examples shared above. unsurprisingly, it greatly improves the compile time for those examples^[1]. it's less clear if it would be an improvement (or at least not a regression) for smaller cases or code with different structures. if someone with appropriate permissions would be kind enough to trigger the compiler performance smoke tests^[2], i would appreciate it.

the approach taken is not the more holistic solution Slava outlined that would avoid redundantly running the dataflow algorithm, as i haven't quite figured out how to alter the existing code to implement something like that.

though a bit unexpectedly, it doesn't appear to be quite as fast as the 'trick' of restructuring the code with switch statement assignments. ↩︎
IIUC, by commenting on the PR with "@swift-ci Please smoke test compiler performance". ↩︎