Pitch: enable bounds-checking for BufferPointers

scanon · January 30, 2024, 8:38pm

Currently Unsafe[Mutable]BufferPointer only does bounds-checks in Debug builds. They are elided in Release builds. @glessard and others have been experimenting with enabling them unconditionally, and the performance impact of doing so is pretty small in most cases.

Most usage of BufferPointer is as a "contiguous memory" abstraction, or for the purposes of transient interoperation with C and C++ or low-level memory management API, where the performance impact of bounds checks is usually acceptable (either because bulk operations are being done which can amortize the check, or the access pattern is regular so the optimizer can eliminate most of them) and the memory safety wins are significant. However, we also know that UnsafeBufferPointer is occasionally used in a performance-sensitive context specifically to avoid bounds checks on Array, and we want to be sure that this continues to be possible. Therefore I would like to propose:

Make all existing BufferPointer API (in particular, the Collection conformances) enforce bounds-checks in release as well as debug when building in Swift 6 mode.
Provide new [unchecked: Int] subscripts that do not do the check, and back-deploy them to provide an escape hatch when the performance impact of checking is onerous but you know that the access is safe.

tera · January 30, 2024, 8:47pm

That's good.

Alternative syntax for unchecked escape hatch: pointer[index, .unchecked] where unchecked is the only case of some SubscriptMode enumeration. Or similar but simpler: pointer[index, checked: false] with "checked: true" would match the new default behaviour.

Paul_Cantrell · January 30, 2024, 8:50pm

+1 to the goal, the approach, and the syntax.

glessard · January 30, 2024, 8:51pm

Predicating the checking on a parameter makes it difficult to eliminate the branch. This is in fact one of the issues with the debug-mode "only" approach, where in complex interactions of framework+stdlib+user-code, the branch isn't eliminated.

The static approach we are pitching eliminates that branch entirely.

scanon · January 30, 2024, 8:53pm

This is an important note--making the check unconditional actually makes debug code faster!

But, in @tera's example, that API would only support .unchecked, so it would also be branch-free.

Douglas_Gregor · January 30, 2024, 9:11pm

I love this pitch; it closes an unnecessary bounds-safety hole.

I think it would be best to stage this in via an upcoming feature (UnsafePointerBoundsSafety?) that is also enabled in Swift 6, to help with incremental adoption.

Doug

ksluder · January 30, 2024, 9:14pm

Do we have reliable numbers about how small, and in what circumstances it isn’t so small?

Douglas_Gregor · January 30, 2024, 9:26pm

Here's the standard benchmark suite: [donotmerge] always bounds check in UnsafeBufferPointer by glessard · Pull Request #70794 · apple/swift · GitHub

Doug

Karl · January 30, 2024, 9:30pm

I'm strongly opposed to this, and to the idea of trying to make the explicitly-unsafe pointer types "half-safe".

The unsafe types exist as lowest-level abstractions. People do overuse them, and so we should explore safer alternatives such as the upcoming BufferView (which will include both lifetime guarantees and, I suspect, bounds-checking by default) and steer people to those instead, but still, having this lowest-level type is still extremely important.

Moreover, I think there is a significant danger that developers come away believing that the unsafe pointer types are safe now, which is flat-out wrong. They do not guarantee memory lifetimes.

The problem with this attempted mitigation is that generic collection algorithms use the unlabelled subscript which is part of the collection protocol.

Developers who need to opt-out of bounds-checking will essentially be forced to either rewrite the Unsafe*BufferPointer types as they are now (with a greater opportunity to make mistakes and introduce their own memory safety issues), or to rewrite basic collection algorithms like map to use the unchecked subscript.

The actual problem is overuse of the unsafe types, not the unsafe types themselves. Rather than taking away this valuable, low-level tool, we should address their overuse as general-purpose contiguous memory abstractions and in other places where developers don't really intend to opt-out of bounds checking:

scanon · January 30, 2024, 9:30pm

The worst regression we've observed, by wide margin, is in the rawBufferInitializeMemoryExecute benchmark, which is 65% slower with this change (it does nothing except load through an UBP). However, if we do this, we will also change the way the stdlib function that it ends up calling is implemented, which makes the regression go away entirely.

More generally, the regressions that we expect to not be trivially resolvable are when you have an access pattern that skips around within a UBP, which you can prove always stays in bounds but the compiler may not be able to (something like searching a densely-stored binary tree or heap comes to mind). In some of these cases, it would probably be necessary to adopt the new subscript, but most other usage should not need it.

scanon · January 30, 2024, 9:44pm

We are also doing this.

This is already necessary when optimizing for performance above everything else, because the generic collection algorithms cannot take full advantage of what we know about contiguous layout of UBP except in the simplest cases. I'm open to being convinced, though; do you have examples where the optimizer is able to produce very good output from generic [RandomAccess]Collection algorithms, but introducing bounds-checks would have unacceptable overhead? I'm interested to study them to better understand what other affordances we need (which might just take the form of a tweaked RAC implementation).

scanon · January 30, 2024, 9:48pm

None of this should be construed as making them "safe". They are still explicitly UnsafeBufferPointers, and do not control memory lifetime or exclusivity. That does not mean that we can't make them safer. We should make every API as safe as we can without unacceptable performance tradeoffs. Having spent a good chunk of time studying the problem, we are convinced that we can do that here.

ksluder · January 30, 2024, 10:12pm

Are these from a different benchmark run than the one that @Douglas_Gregor linked to? That one seems to show a 76% regression in SIMD performance.

Karl · January 30, 2024, 10:13pm

except types which explicitly opt-out from those safety checks, is my point.

This isn't a regular API. It isn't something everybody should use, and it is regrettable that Swift's history has led it to be used far too often in places where it isn't appropriate. We are in full agreement that that should be corrected, but I do not think it should come at the cost of this tool.

Over the many years I've spent analysing and micro-optimising the output of the Swift compiler, I've seen plenty of examples where bounds-checking introduces unnecessary overhead in generic collection algorithms. Even the simple cases (firstIndex(where:), lastIndex(where:), etc) are valuable - you don't need to be doing anything particularly fancy with contiguous storage to see benefits.

I don't have the time to go hunting for examples, but I don't even think that's a relevant consideration. This is a valuable tool even for the fact that it allows engineers to test the overhead of bounds-checking. Sometimes there will be no benefit, or it will vary based on nebulous other factors, but you don't know that unless you have this kind of tool.

And if that makes it sound like a very specialised API for very specific expert use-cases, well that's because it is! And in my opinion that needs to be the basis of any discussion about the API of the unsafe pointer types. Ease of use is not (should not be) the primary design consideration in this particular instance.

David_Smith · January 30, 2024, 10:22pm

Since that only occurred at -O and not at -Osize I strongly suspect that's something we can win back by tweaking the implementation or inliner. My first guess would be that the tiny bit of extra code pushed something above the threshold for being inlined.

(Ideally we'll discover we can apply the same tweak to get a huge perf win at -Osize, but I'll try not to pile too many layers of speculation up here)

scanon · January 30, 2024, 10:31pm

It’s a regression in the benchmark, which isn’t generally a regression in SIMD performance. I.e. in a synthetic benchmark we’re interested in isolating one thing as best we can, and that might require adopting the unchecked: label, but that doesn’t necessarily correlate to any regression we would see in “more normal” SIMD code.

scanon · January 30, 2024, 10:36pm

Those are great examples where there’s no good reason why the optimizer can’t eliminate most bounds checks, though, even in the most generic collection case! Those are exactly the sort of API that should not be effected by this change at all, and if it is, that’s a bug to be fixed, either in the optimizer or the stdlib or both, and fixing it can make all Collections faster.

ksluder · January 30, 2024, 11:26pm

I get nervous about relying on the optimizer. If you make a change that is suddenly too complex for the optimizer, you’re stuck guessing what memory accesses you need to augment with unchecked to get your performance back. In pointer-heavy code this can be a non-trivial guessing game.

There’s something very attractive about a model where UnsafePointer and friends always have a consistent cost that doesn’t rely on keeping the optimizer happy.

scanon · January 30, 2024, 11:31pm

Everything in Swift relies on the optimizer. Even without this change, you’re depending on the optimizer to turn the subscript from a function call into a load or store. We can tweak heuristics to make this more dependable if we have to, but we need to define the semantics we want first.

ksluder · January 30, 2024, 11:47pm

Isn’t that determined by the @_semantics anttribute attached to the subscript? The implementation of @_semantics may happen to live in the optimizer, but that’s very different from relying on the optimizer to detect a loop invariant and determine it can elide bounds checks within my loop.

I pretty much never have to worry that a C compiler will compile foo[i] to something more complex than a load and an indexed load. The rare exception is when foo is an extern, in which case I have to worry about a potential chain of loads, and it is obvious from the disassembly which symbols are being dereferenced.

Systems programming is Swift is sorely missing such reliable baseline performance characteristics.