Checking out-of-bounds access on compile time in SIL

suyashsrijan · July 17, 2019, 2:26pm

I was working on diagnosing out-of-bounds subscript access as part of parsing subscripts inside array literals, however @jrose told me it would be better to have a more general, dataflow based check in SIL that could catch out-of-bounds access in other cases, for example:

func test() {
  let a = [1, 2, 3]
  print(a[3])
}

This is only for subscripting array literals.

Is this something worth adding to the compiler? Also, is DataflowDiagnostics the best place to do this or should there be a special pass added to SILOptimizer? cc @Michael_Gottesman @Joe_Groff any thoughts?

Example of code that should be flagged as a warning (or error if source-compat isn't an issue):

let array1 = [1, 2, 3]
print(array1[3]) // warn that 3 is out of bounds

let array2 = [[1, 2, 3][4]] // warn that 4 is out of bounds

I think this compile-time diagnostic would be really helpful in catching mistakes/bugs at compile-time, rather than discovering it at run-time.

nnnnnnnn · July 17, 2019, 2:40pm

~~One challenge with this is that array literals can resolve to multiple types, which may have different semantics attached to their~~

Hit send too soon, sorry! Your commit addresses this correctly, never mind the above. I think this is a valuable warning to add.

jrose · July 17, 2019, 3:27pm

cc @ravikandhadai

Joe_Groff · July 17, 2019, 4:49pm

We could do something similar to what we do with overflow checks, where if we see that a literal expression overflows the resolved concrete type at compile time during transparent inlining (or eventually constexpr evaluation). That seems like a good general model to me, turning constant-evaluated assertion failures into compiler errors.

John_McCall · July 17, 2019, 5:29pm

This is definitely a valuable diagnostic if it's not limited to immediate subscripts of literals. I agree with Jordan and Joe about the right technical approach being SIL-based and reliant on transparent inlining and the like.

ravikandhadai · July 17, 2019, 5:52pm

Like mentioned by @Joe_Groff this is useful as a SIL level check. I think the constexpr evaluator is a good candidate for this. There is a long standing PR on constant evaluator const evaluator: array values and initialization by marcrasi · Pull Request #22772 · apple/swift · GitHub that tries to add array support to it. There are some comments there which needs to be addressed, but I haven't prioritized it yet but can be done soon. It can model array literals. (Note that creating an array literal expression actually becomes multiple SIL instructions with indirect stores at the SIL level.) The constant evaluator can handle this already. So it can take away a lot of complexity from a data flow analysis that has to do this. It may be worthwhile checking if that will be useful. My guess is, if the constant evaluator is used, the diagnostic pass has to identify SIL code for array literal creations, run the evaluator see if it can extract the array as a constant, and run the evaluator on the SIL instructions corresponding to array subscript expression, which itself will return an error when there is an array out-of-bound, and emit diagnostics for it.

suyashsrijan · July 17, 2019, 6:34pm

Would it be possible to implement it in SIL for now and switch to constant evaluator when it's stable enough for use with arrays? If so, then what would be the steps to take to implement this diagnostic in SIL? I suppose we would need two key bits of information - the instruction accessing the array literal (subscript) and the instruction pointing to the array literal (storage). Will this be done inside MandatoryInlining or be part of DataflowDiagnostics?

(sorry for all the questions, I have very limited experience with SIL)

ravikandhadai · July 17, 2019, 6:49pm

I would think it is best to put it in a new SIL compiler pass in the diagnostic pipeline. It would also enable it to progressively become more powerful so that it can go beyond, array literals for example, or array subscript (In principle, we only need to have an upper bound on the size of the array and not its contents). It will also have to go after mandatory inlining as you may rely on some function inlined.

As far the algorithm itself goes. There can be many approaches here. The dataflow analysis approach would have a data-flow "state" (e.g. a mapping from SILValues of array type to their constant length). On seeing array initialization instructions, it will add a new entry to the map. On seeing an array subscript, it will check if the array is in the map (i.e, its length is known) and if the index is greater than the array bounds, it will detect. The map needs to be maintained by removing elements from it when the length of an array that is tracked cannot be kept track of anymore e.g. if some (unknown) mutating function is called on it or it is passed inout to another function etc. A conservative data-flow "merge" would have to be defined to merge states coming along two different branches. In the simplest case, the merge could just drop all entries in the map that are not identical along both branches.

One complexity here is analyzing the array init instructions as they will be a sequence of instructions in SIL trying to update an internal pointer in the array. The constant evaluator PR may give some ideas on how to handle this.

In principle, you can progressively make this more powerful my modeling array.append, making merge smarter etc.

Michael_Gottesman · July 17, 2019, 7:21pm

If we want to use the constant evaluator infrastructure like this, it may be good /not/ to invent a new pass to emit diagnostics. My hesitation here is around compile time. If the constant evaluation infrastructure will be complete within the next six months to a year, it is not worth the time/complexity to create a whole new pass (even if in the short term it is easier).

suyashsrijan · July 17, 2019, 9:04pm

Okay, I guess it's better to wait for constant evaluator to be ready then!

ravikandhadai · July 18, 2019, 1:12am

So here is some more information to clarify my initial post on using the constant evaluator. The evaluator is more of a utility that needs to be used by another pass in the compiler. The constant evaluator itself is not a compiler pass and will never be used unless some compiler pass uses it. So we need a driver for this array out-of-bounds check, even if the constant evaluator is ready.

Second, the constant evaluator is not an alternative for a data-flow analysis, at least not in a straightforward way. It certainly can be useful in a dataflow analysis. Off the top of my head, the way one could use it for this analysis is as follows: whenever the "array creation" instructions are seen during the analysis, the constant evaluator could be invoked to interpret the instructions and extract the length of the array. Similarly, when the "array subscript" instructions are seen, the evaluator can be used to compute the index as a constant, if possible, and check for out-of-bounds. In a way, the constant evaluator can be seen as a "transfer function", i.e., a small-step state-transformation function for the data flow analysis. And the dataflow analysis will run the fix-point loop, invoking the evaluator in selected code portions within the loop.

Having said that, the constant evaluator can be used to create a more lightweight, somewhat ad hoc analysis. More in style of the experimental "pound_assert" check that is implemented as a part of DataflowDiagnostics.cpp file. IMO, for an analysis that is as important/open-ended as "array-bounds" checking, a dataflow analysis provides a very general extensible framework.

So, as @suyashsrijan had commented earlier, it is possible to create a dataflow analysis first, and it would be possible to migrate later to the constant evaluator. But, the evaluator would certainly save some time (and reduce code duplication) especially with respect to analyzing array-literal initialization instructions. (I do think the PR can be merged in due time and the code can be used by this pass, if we get to that. So it is more of an implementation detail at this point).

To summarize, I think a SIL diagnostics pass will be needed for catching array out of bounds statically. If we are going do a dataflow analysis, the constant evaluator is only an implementation detail that can be used to make the dataflow analysis better. So, I would say the first question that needs to be resolved is whether this pass is important enough to justify the increase in the compilation time as raised by @Michael_Gottesman , and whether we think a dataflow analysis is the right approach here. (The Constant evaluator neither addresses the first problem nor the second, though it could open up some more lightweight, ad hoc solutions.)

Joe_Groff · July 18, 2019, 2:35am

This might also be interesting for an optimization pass that automatically inserts appropriate reserveCapacity calls when we can pre-determine the size a result collection will be.

suyashsrijan · July 19, 2019, 11:35pm

How badly is this going to affect compile time? If the literal is immutable then we don’t have to do much work and yeah if we see any mutation we can invalidate the check and bail out. We can keep this check lightweight and simple.

In fact we can add a simple check now only for immutable literals for a start and then build on top of it.

Will that work?

Michael_Gottesman · July 21, 2019, 10:16pm

That is not what I am getting at. What I am getting at is that I am worried about adding new passes without good reason especially if any work would naturally fit into another piece of work, obviating the extra loop over the IR. That being said, I would be fine if you were to put this into Diagnostic Constant Propagation in the short term and move it later to use the constant evaluator when it is available. The constant propagation pass is in a place where we generally already perform these types of analyses so I think it would fit well.

How does that sound?

suyashsrijan · July 22, 2019, 6:48pm

Okay understood! Would it be okay to do this in constantFoldCompare()? Because we perform the comparison with the count when subscripting:

%12 = struct_element_addr %11 : $*_SwiftArrayBodyStorage, #_SwiftArrayBodyStorage.count
%13 = struct_element_addr %12 : $*Int, #Int._value
%14 = load %13 : $*Builtin.Int64
%15 = builtin "assumeNonNegative_Int64"(%14 : $Builtin.Int64) : $Builtin.Int64
%16 = builtin "cmp_slt_Int64"(%7 : $Builtin.Int64, %15 : $Builtin.Int64) : $Builtin.Int1

We could emit a diagnostic here if the comparison fails and if we're comparing with the count (need to figure out how to check if the count comes from the array literal - maybe follow the parent instructions until I find one related to an array?).

Michael_Gottesman · July 22, 2019, 8:13pm

@suyashsrijan Something like that. I would need to read the code. Put up a PR?

suyashsrijan · July 22, 2019, 10:09pm

I will do that shortly - I am still trying to figure out a way to detect when an instruction is (indirectly) referring to _SwiftArrayBodyStorage.count. I don't want to hardcode the order of instructions (compare < assumeNonNegative < load < addr < addr) to do this. Do you have any tips? Is there a helper method that I don't know about?

Michael_Gottesman · July 22, 2019, 10:37pm

I think you would pattern match at a higher level using semantics function. Lets anchor our conversation to some examples. Consider the following slightly modified swift (I just returned the value since it makes the SIL more compact):

func test() -> Int {
  let a = [1, 2, 3]
  return a[3]
}

This gives me the following SIL at -Onone:

// test()
sil hidden @$s4main4testSiyF : $@convention(thin) () -> Int {
bb0:
  %0 = metatype $@thin Array<Int>.Type            // user: %16
  %1 = integer_literal $Builtin.Word, 2           // user: %3
  // function_ref _allocateUninitializedArray<A>(_:)
  %2 = function_ref @$ss27_allocateUninitializedArrayySayxG_BptBwlF : $@convention(thin) <τ_0_0> (Builtin.Word) -> (@owned Array<τ_0_0>, Builtin.RawPointer) // user: %3
  %3 = apply %2<Int>(%1) : $@convention(thin) <τ_0_0> (Builtin.Word) -> (@owned Array<τ_0_0>, Builtin.RawPointer) // users: %5, %4
  %4 = tuple_extract %3 : $(Array<Int>, Builtin.RawPointer), 0 // user: %16
  %5 = tuple_extract %3 : $(Array<Int>, Builtin.RawPointer), 1 // user: %6
  %6 = pointer_to_address %5 : $Builtin.RawPointer to [strict] $*Int // users: %9, %11
  %7 = integer_literal $Builtin.Int64, 1          // user: %8
  %8 = struct $Int (%7 : $Builtin.Int64)          // user: %9
  store %8 to %6 : $*Int                          // id: %9
  %10 = integer_literal $Builtin.Word, 1          // user: %11
  %11 = index_addr %6 : $*Int, %10 : $Builtin.Word // user: %14
  %12 = integer_literal $Builtin.Int64, 2         // user: %13
  %13 = struct $Int (%12 : $Builtin.Int64)        // user: %14
  store %13 to %11 : $*Int                        // id: %14
  // function_ref specialized Array.init(arrayLiteral:)
  %15 = function_ref @$sSa12arrayLiteralSayxGxd_tcfCSi_Tg5 : $@convention(method) (@owned Array<Int>, @thin Array<Int>.Type) -> @owned Array<Int> // user: %16
  %16 = apply %15(%4, %0) : $@convention(method) (@owned Array<Int>, @thin Array<Int>.Type) -> @owned Array<Int> // users: %22, %17, %26
  debug_value %16 : $Array<Int>, let, name "a"    // id: %17
  %18 = integer_literal $Builtin.Int64, 2         // user: %19
  %19 = struct $Int (%18 : $Builtin.Int64)        // user: %22
  %20 = alloc_stack $Int                          // users: %23, %24, %25
  // function_ref specialized Array.subscript.getter
  %21 = function_ref @$sSayxSicigSi_Tg5 : $@convention(method) (Int, @guaranteed Array<Int>) -> Int // user: %22
  %22 = apply %21(%19, %16) : $@convention(method) (Int, @guaranteed Array<Int>) -> Int // user: %23
  store %22 to %20 : $*Int                        // id: %23
  %24 = load %20 : $*Int                          // user: %27
  dealloc_stack %20 : $*Int                       // id: %25
  release_value %16 : $Array<Int>                 // id: %26
  return %24 : $Int                               // id: %27
} // end sil function '$s4main4testSiyF'

I think this is most likely how you are going to see these sorts of things. We should pattern match on semantic functions. Here is how I would do this with this example:

If constant propagation manages to constant fold the array access down to a constant, it will call constant fold instruction on Array.subscript.getter. That is I think your entrance point. (In this example, it will process %18 and then %22 for sure).
Once I found that, I would track back up the guaranteed argument to see if it comes from an Array.init(arrayLiteral:). Then I would look for the allocateUninitializedArray from that which should give me the Builtin.Word, 2. Then I would see if I had a constant for the other parameter and compare that against the size. Then if it is within the size, I would maybe try to find what the actual value was and constant propagate that.

All of this can I think be done with the PatternMatch infrastructure like this:

IntegerLiteralInst *capacity;
if (!match(I,
m_Apply(m_SemanticsCall("array.init.literal"),
               m_TupleExtract(
                     m_SemanticsCall("array.allocateUninitializedArray",
                                                  m_IntegerLiteral(capacity)),
                     1)))) {
   return ... // fail
}

NOTE: I am not sure if we have semantics for these specific calls now, but I imagine it wouldn't be hard to add. Also I forgot if I added the semantics matcher. But if I didn't I can show you how to do it.

suyashsrijan · July 22, 2019, 10:51pm

Thank you for a great explanation! I can't find m_SemanticsCall, but there's ArraySemanticsCall which ABCAnalysis uses, might be handy here?

Seems like ArraySemanticsCall doesn't support Array.init(arrayLiteral:) and Array.subscript.getter but that should be easy to add. It does support array.uninitialized though but not allocateUninitializedArray.

Michael_Gottesman · July 22, 2019, 11:42pm

You want something like this:

https://github.com/gottesmm/swift/commit/1ea59e54087f1354917a652ca8ffedc929846307