`MutableSpan` comes with performance penalties compared to `UnsafePointer`s

Mostly to redirect some eyes to this issue: Don't impose a performance penalty when using `MutableSpan` · Issue #89954 · swiftlang/swift · GitHub

What happened is that I created this PR, and there I noticed usage of MutableSpan makes the IPAddressBenchs:IPv4_String_Encoding_Localhost_15M (encode localhost 15 million times) benchmark go from ~200ms to ~350ms which is a +75%.

From what I've noticed, MutableSpan is in all cases essentially unusable if you really want to go for maximum performance and compete with C. There are also no escape hatches from what I can tell.
To be clear Span is perfectly fine. I'm talking only about MutableSpan.

Optimally we can make sure even Swift 6.4 doesn't come with this penalty when it's released so we don't have to wait another 6-9 months.
I can work on the changes if the compiler folks agree with me.

1 Like

Just so the issue is more clear, here's a link to the commit which successfully reverted the perf degradation:

Neither this post nor the issue say anything about what these expensive unnecessary checks actually are.

Ah right, sorry. For once I was trying to go for a quick issue, but then I managed to get myself into a forums thread.

What I bumped into is when you call mutableSpan on Unsafe Pointer types, is when you bump into those checks. But also from a quick look at stdlib, it appears pretty much every place that returns such a MutableSpan goes through those (not 100% sure).

Here's a link to what exactly my code was calling:

which calls:

I can already see the checks do not really look that expensive, but well, in practice they're adding 75% to the runtime of the benchmark.
I'd presume MutableSpan is supposed to be the safe alternative to mutable buffer pointers, but then I'd expect it does also perform the same as well. Like Span does.

And again, someone else also bumped into the same issue some months ago when working on some crypto package, so I know I'm not the only one that is not happy with this penalty.

The precondition seems important for correctness, though. Otherwise you can create misaligned spans that (on some CPUs) crash later on, in “safe” code, when dereferenced.

I would hope that the check gets optimized out for single-byte types like UInt8, which is by far the most common type of span I use.

I expected more detailed explanations and concrete evidence with such a title and proclamation. Using MutableSpan instead of unsafe pointers in your project did not affect benchmark performance on my local Linux machine (7.0.11-arch1-1).

I use MutableSpan in my high performance libraries extensively, which include a league scheduling engine and networking library, with no runtime backdraws to such an extent, if any, you observe.

1 Like

I can provide actual benchmark cases if asked to, but I thought (think?) the bottleneck is observable to a compiler engineer already, so hopefully I can save me some time.
Again, if there is a need, I can provide that. I've already mentioned the benchmark name, and the commit with which the degradation is reversed. So the info is already here, although I understand I should not expect people to go after them themselves and optimally just clearly provide them.

The backdraw isn't actually too much, the library is over-magnifying it.

It's only noticeable when you're pixel-peeping into the performance of your library, and that there are no other visible performance bottlenecks already. Specially none of the more important types, such as algorithmic bottlenecks, or not having chosen the correct data structures etc...

If you think about it, the whole performance degradation is a couple light (e.g. no division) integer arithmetic (although there might be more to the performance issue that MutableSpan causes, that I haven't noticed).
The benchmark which it degraded, is also only a bunch of light arithmetic for the most part, to parse IPv4 from the bytes it's been handed.

2 reasons for the fact that I'm noticing this at all are:
1- The library has benchmarks to compare against native C libraries like Darwin/glibc, and it performs even better than them in most cases. So you know the implementation is very (too?) optimized.
2- There were already benchmarks with better performance, so it was easy to notice there is a degradation.

So essentially If you're not aiming for very high performance or comparing to a previous baseline, you likely won't notice it.

Right. What I'm hoping is that perhaps due to semantics of specific types or where they're being used, the compiler engineers can already infer that the checks are redundant and they can skip those via an unchecked initializer that simply just puts the type together.

Also in any case, I expected stdlib to provide escape hatches for such a type, even if they are annotated with a hundred "unchecked" "unsafe"s. Sometimes, e.g. when you've just allocated memory for a number of UInt8s, you already know you don't need such checks.

I'd expect that too. Makes me think if there is possibly any other thing going on in here that we haven't noticed.

So the exact performance overhead you're observing is mutable span initializers that use _precondition, which will impact performance. So you are right, but the title does misrepresent the impact of using MutableSpan.

side note regarding "very high performance"

Furthermore, this quote may be generalizing to help the average reader, which is understandable, however, you're assuming I don't know how to properly benchmark or write objectively optimal code.

Looking at your library again, I've optimized your writeTextualRepresentation to perform objectively better (since I won't try rewriting the whole project :wink: ), if you actually want very high performance:

    @inlinable
    @inline(__always)
    func writeTextualRepresentation2(into buffer: UnsafeMutableBufferPointer<UInt8>) -> Int {
        var resultIdx = 0
        var addressByte = address.byteSwapped
        UInt8(addressByte & 0xFF).asDecimal(
            writeUTF8Byte: {
                buffer[resultIdx] = $0
                resultIdx &+= 1
            }
        )
        for _ in 1..<4 {
            buffer[resultIdx] = .asciiDot
            resultIdx &+= 1

            addressByte >>= 8
            UInt8(addressByte & 0xFF).asDecimal(
                writeUTF8Byte: {
                    buffer[resultIdx] = $0
                    resultIdx &+= 1
                }
            )
        }
        return resultIdx
    }
1 Like
Response to "side note regarding "very high performance""

I appreciate the time, but I don't think the code matters too much here. As a matter of fact, GodBolt shows that the 2 different codes are compiled to almost the same code anyway:

The code is too simple for the compiler!

Also IPv4 parsing is around the simpler things that the package does.
Performant IPv6 parsing is exponentially more complex due to the compression sign (::), and even more complicated than that we have the IDNA-compatibility functions where simple integer arithmetic and stack allocations are no longer enough, and off of memory, I remember UniqueArray/RigidArray/custom-"TinyArray"/"SmallArray"-implementations/lifetime-annotations had to be used (short of using raw UnsafePointers) to reach the desired performance which was to be able to compete against ICU.

Hmm I can't decide if it does or it doesn't which means It should be more clear anyway. I'll try to adjust the title. To be fair I am using "relatively" in the title.

In my mind it was/is justified as I had already seen other people struggle with it, and also my expectations of Swift are high, which means off-hand I wouldn't expect MutableSpan to come with the aforementioned penalty.

I can see most codes are much more tolerant to such checks compared to swift-endpoint or a crypto package, but then I also expect MutableSpan to be the type for performance-sensitive code and have no downsides compared to just using UnsafePointers.

Edit: title is now "`MutableSpan` comes with performance penalties compared to `UnsafePointer`s".
Previously (IIRC): "`MutableSpan` comes with relatively big performance penalties"

1 Like

I agree that’s a good direction for (Mutable)Span. Specifically, is there a particular code snippet where you’d expect the compiler to use a type’s “semantics” to eliminate a precondition check?

1 Like

What I meant by "semantics", is that in a type like Array, I think Array already guarantees the buffer is aligned, so for Array the precondition is not necessary and stdlib should be able to just use an "unchecked" initializer of MutableSpan.

More generally, it appears all places in stdlib where there is a MutableSpan initialization (outside MutableSpan's own implementation), such as in most implementations of public var mutableSpan, the stdlib code goes through those preconditions that could be skipped (some might be optimized out already, not sure).

Search for `public var mutableSpan`
rg -lF 'public var mutableSpan' ./stdlib | xargs basename | nl
     1  UnsafeBufferPointer.swift.gyb
     2  ContiguousArray.swift
     3  ArraySlice.swift
     4  Array.swift
     5  UniqueBox.swift
     6  InlineArray.swift
     7  OutputSpan.swift
     8  CollectionOfOne.swift
     9  UniqueArray.swift

All aside from UnsafeBufferPointer should be able to skip the alignment checks I'd assume.

For my specific case I expect the compiler to be able to optimize-out that precondition for 1-byte types like UInt8/Int8/Bools as there can't be any misalignment issues for them.

UInt8 itself is the most important one. Int8 might also help when working with CChars, or maybe someone just prefers to use Int8 instead of UInt8, but I personally haven't bumped into that.
I could propose a few ways to possibly/maybe achieve this, for example using the @_specialize attribute, but then I know you'd already know better than me which way to go and what reasonable ways there are to skip that precondition.

Outside my specific case, I expect the compiler to be smart enough to totally optimize-out that specific precondition anyway. But I'm not a compiler engineer so I'm not sure what is blocking such an optimization and what could be done for that (hopefully without making it take another year to propagate changes through LLVM or such. Again, not a compiler engineer, these are just my vague understandings).

1 Like

I appreciate the title change, however:

"very high performance" rebuttal

Regardless on how the assembly looks, the real-world performance is the only thing that objectively matters. My improvement uses 180 less instructions (~5%) to do the same work (at least on my machine), which improves throughput by 2 samples using the project's throughput benchmark (30 million more operations in the same time-frame on my machine). The same methodology can apply to other parts of the project, but as I previously stated, I won't rewrite it :wink: .

When I say very high performance, I am talking about reducing total instructions executed and improving runtime performance for everything.

Unfortunately, this is another case of "skill issue". :waving_hand:

Response to ""very high performance" rebuttal"

My improvement uses 180 less instructions (~5%)

180? I haven't checked but that sounds suspiciously too high considering godbolt shows essentially the same number of assembly lines.

at least on my machine

Well, that's the keyword. You need to prove the significance of your machine. Does it use a CPU with similar properties compared to a typical server? Or ...? I know swift-endpoint's CI benchmarking machine does.

which improves throughput by 2 samples using the project's throughput benchmark (30 million more operations in the same time-frame on my machine).

I had already tried your changes, yesterday. It didn't make a visible difference in CI: Try Swift Forums suggestion by MahdiBM · Pull Request #21 · swift-dns/swift-endpoint · GitHub

Some months ago, I tried fully inlining that loop altogether. It didn't make a visible difference either, so that's why the loop is still there. Not because I didn't think it's faster, but because I didn't want such a weird-looking code for such low perf gains (to be clear, weird-looking is not true for your code). So the thing is, it needs to be visible in the benchmarks. Otherwise the change isn't sustainable and the minor improvements can be accidentally reverted in the future.

I'm assuming the reason for "less instructions" in your code is likely because of the byteSwapped, which means you don't need to reverse and index each time. If the inlined-loop was there, the byteSwapped would have only been some additional instructions and wouldn't have helped.

Also so we're clear, swift-endpoint has a custom UInt128 implementation (backed by 2xUInt64 to be able to dodge Swift's UInt128 macOS 15 requirements for IPv6's storage), so you can't be thinking that I didn't know of existence of byteSwapped or such.

I give it to you that your idea was good. But I'm still not convinced that it's worth a change.
Also you already had a pretty decent code, if I may say so myself, to start with. You didn't have to work it out from the beginning, which could have possibly resulted in you going a totally different/worse way. So even if you're proud of your code there, I'm inclined to take part of the credit :)

After all, there is also a compiler in between the code that we write. At any point it could have taken a decision to inline the whole loop. Or maybe it does in a future version. Then the current code will be looking better since it skips one byteSwapped. You're also using bitshifts compared to simply loading the values from the memory which the current code does. Now the compiler is likely smart enough to know what's going on in here, but I doubt the bitshifts have anything to do with a performance improvement, considering what the current code does is to directly load from memory.

When I say very high performance, I am talking about reducing total instructions executed

less instructions executed does not mean better performance

The point here is that we can argue about this all-day long. I think there has been a misunderstanding about what actually I meant by "very high performance". In these kinds of situations, I always take at least part of the blame for not being clear enough. For the most part, it was referring to the fact that the implementations beat the counterpart C implementations.

Can they be improved? yes. But how far should I go? Do I go for inlining assembly in Swift files and call it "Swift"? Or how far do I go for looking into what the compiler likes to produce given some code? I have gone pretty deep into this, but still, there is a point that I need to stop.

Realistically too, I'm also under real-world restrictions. For example about how to benchmark the code. The CI machine already costs $18-20/month out of pocket. I can't simply just add all possible benchmark types to the CI, that'd take too long, and will also take more time to maintain. I had already considered adding instructions count to the CI since it's a fine unit of measurement. I can't exactly recall why that didn't happen, but probably it was to save some CI time. I remember at some point I was struggling with benchmark times and that's partially why swift-idna was moved to its dedicated repository.

I see @ Alejandro has already done the work in [stdlib] Collection span getters can use the unchecked initializer by Azoy · Pull Request #90002 · swiftlang/swift · GitHub. Very appreciated. I'm assuming it'll get cherry-picked into 6.4 as well.

I wonder what it takes to have public unchecked initializers too. Wouldn't want to be forced into not using MutableSpan in the future just because I can't express that I'm already sure the alignment is all-good. I think we'll need an "amendment" to the proposal (Including its implications, for example a new evolution forums thread)?
This is not without precedence, for example UTF8Span has init(unchecked: Span<UInt8>).

I wonder what folks here think about such a public unchecked initializer.

2 Likes
More response to ""very high performance" rebuttal"

I just did a check and while it makes sense now, I didn't know that a mem load is slower than a bitshift.

Again, not that I hadn't thought of using bitshifts (I did not think of byteSwapped-ing though), the initial code likely was using bitshift before I move it to mem load (it's likely somewhere in the git as well), but there is the compiler in between which is probably why the 2 codes don't have much of a difference in performance anyway.

So I guess I should accept defeat in that sense, and accept the "skill issue" :wink:

1 Like