How to explain this counterintuitive increase in execution speed of a generic initializer?

The following program demonstrates that the generic initializer is slower when called "normally" than it is when passed as a closure to a nested function. I would have expected the opposite, if any, difference.

Is it an optimizer-bug / -opportunity for improvement or can it be understood in some other way?

import AppKit


extension FixedWidthInteger where Self: UnsignedInteger {
    
    /// Creates a value by mapping `source` from the full range of its type
    /// `S` to the full range of `Self`.
    ///
    ///
    /// ## Scaling Up
    ///
    /// The method used when scaling up to a higher bit width can be
    /// described by:
    ///
    ///     dstVal = (srcVal * dstMax) / srcMax
    ///
    /// (But it is efficiently computed without integer division and without
    /// the risk of overflowing.)
    ///
    /// So converting for example a (hypothetical) 2-bit type to a 4-bit
    /// type, ie mapping from the range [0, 3] to the range [0, 15]:
    ///
    ///         0       1       2       3
    ///     +-------+-------+-------+-------+
    ///         |       |       |       |
    ///      .--'      .'       '.      '--.
    ///      ↓         ↓         ↓         ↓
    ///     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ///      0 1 2 3 4 5 6 7 8 9 A B C D E F
    ///
    /// The implementation uses the fact that the bit pattern of the
    /// destination will always be the bit pattern of the source repeated,
    /// starting with the most significant bit (left). Here is a conversion
    /// from a (hypothetical) 2-bit type to 5-bit type:
    ///
    ///      2-bit              5-bit
    ///      source           destination
    ///     (0)  00    -->    00000   (0)
    ///     (1)  01    -->    01010   (9)
    ///     (2)  10    -->    10101  (21)
    ///     (3)  11    -->    11111  (31)
    ///
    ///
    /// ## Scaling down
    ///
    /// When convertig from a larger to a smaller bit width, the source is
    /// simply shifted to discard the lower bits. So the opposite conversion,
    /// from 4 to 2 bits looks like this:
    ///
    ///      0 1 2 3 4 5 6 7 8 9 A B C D E F
    ///     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ///      ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
    ///     +-------+-------+-------+-------+
    ///         0       1       2       3
    ///
    /// - Parameter source: The value to be mapped from the full range of
    ///   its type `S` to the full range of `Self`.
    ///
    init<S>(rangeConverted source: S)
        where S: FixedWidthInteger, S: UnsignedInteger
    {
        let srcByteCount = MemoryLayout<S>.size
        let dstByteCount = MemoryLayout<Self>.size
        if srcByteCount < dstByteCount {
            let dstBitCount = dstByteCount &* 8
            self = Self(truncatingIfNeeded: source)
            var shifts = srcByteCount &* 8
            while shifts < dstBitCount {
                self |= self << shifts
                shifts = shifts &* 2
            }
        } else if srcByteCount > dstByteCount {
            let shifts = (srcByteCount &- dstByteCount) &* 8
            let a = source >> shifts
            self = Self(truncatingIfNeeded: a)
        } else {
            self = Self(truncatingIfNeeded: source)
        }
    }
}

/*
// NOTE: I am including this specialized version only because it can be used
// to easily verify that the "Normal test" will be as fast as the "Faster test"
// if it uses this version. Simply change the "Normal test" so that it calls
// this version instead of the generic one (note that the "Faster test" should
// still use the generic initializer, and it will still be as fast as this
// specialized version).
extension UInt64 {
    init(rangeConvertedSpecialized value: UInt8) {
        self = UInt64(truncatingIfNeeded: value)
        self |= self << 8
        self |= self << 16
        self |= self << 32
    }
}
*/

func test() {
    let num = 100_000_000
    var randomBytes = [UInt8](repeating: 0, count: num)
    for trial in 0 ..< 5 {

        print("#\(trial)")

        // Set new random bytes for each trial:
        randomBytes.withUnsafeMutableBytes { (ptr) -> Void in
            let rc = SecRandomCopyBytes(nil, ptr.count, ptr.baseAddress!)
            precondition(rc == errSecSuccess)
        }

        // ---- First a test that calls the initializer normally ----
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for e in randomBytes {
            let dst = UInt64(rangeConverted: e)
            checksum = checksum ^ dst
        }
        let t1 = CACurrentMediaTime()
        print("  Normal test:", t1 - t0, "seconds (checksum: \(checksum))")

        // ---- Then the same test, only formulated differently ----
        func unexpectedlyFaster(initializer: (UInt8) -> (UInt64)) {
            var checksum = UInt64(0)
            let t0 = CACurrentMediaTime()
            for e in randomBytes {
                let dst = initializer(e)
                checksum = checksum ^ dst
            }
            let t1 = CACurrentMediaTime()
            print("  Faster test:", t1 - t0, "seconds (checksum: \(checksum))")
        }
        unexpectedlyFaster(initializer: UInt64.init(rangeConverted:))
    }
}

test()

Here is an example run on my MBP (with Xcode 9.3 beta 2, default toolchain, same result with recent dev snapshots):

› swiftc --version
Apple Swift version 4.1 (swiftlang-902.0.38 clang-902.0.30)
Target: x86_64-apple-darwin17.4.0
› swiftc -O -whole-module-optimization -gnone -static-stdlib test.swift 
› ./test
#0
  Normal test: 0.13397653499851 seconds (checksum: 13744632839234567870)
  Faster test: 0.0593490870087408 seconds (checksum: 13744632839234567870)
#1
  Normal test: 0.129204154014587 seconds (checksum: 14974415777481871311)
  Faster test: 0.0549142229720019 seconds (checksum: 14974415777481871311)
#2
  Normal test: 0.129486935969908 seconds (checksum: 14757395258967641292)
  Faster test: 0.054901993018575 seconds (checksum: 14757395258967641292)
#3
  Normal test: 0.1290586279938 seconds (checksum: 16131858542891098079)
  Faster test: 0.0576003830065019 seconds (checksum: 16131858542891098079)
#4
  Normal test: 0.133851692953613 seconds (checksum: 15263776468834178003)
  Faster test: 0.0581462570116855 seconds (checksum: 15263776468834178003)

Oh, and I noticed another counterintuitive thing, adding @_transparent to the nested func of the “Faster test” will make it slow down(!) to the same speed as that of the normal test, ie:

...
// ---- Then the same test, only formulated differently ----
@_transparent // <-- Adding this slows it down to speed of "Normal test"
func unexpectedlyFaster(initializer: (UInt8) -> (UInt64)) {
    var checksum = UInt64(0)
    let t0 = CACurrentMediaTime()
    for e in randomBytes {
        let dst = initializer(e)
        checksum = checksum ^ dst
    }
    let t1 = CACurrentMediaTime()
    print("  Faster test:", t1 - t0, "seconds (checksum: \(checksum))")
}
unexpectedlyFaster(initializer: UInt64.init(rangeConverted:))
...

And, as it turns out, there is an even simpler way of formulating the “Faster Test”. Just wrap the “Normal Test” in a func:

...
// ---- This is the same Normal Test as before ----
var checksum = UInt64(0)
let t0 = CACurrentMediaTime()
for e in randomBytes {
    let dst = UInt64(rangeConverted: e)
    checksum = checksum ^ dst
}
let t1 = CACurrentMediaTime()
print("  Normal test:", t1 - t0, "seconds (checksum: \(checksum))")

// ---- Faster Test by simply wrapping the above test in a func ----
func fasterTestByJustCopyPastingTheNormalTestInsideMe() {
    var checksum = UInt64(0)
    let t0 = CACurrentMediaTime()
    for e in randomBytes {
        let dst = UInt64(rangeConverted: e)
        checksum = checksum ^ dst
    }
    let t1 = CACurrentMediaTime()
    print("  Normal test:", t1 - t0, "seconds (checksum: \(checksum))")
}
fasterTestByJustCopyPastingTheNormalTestInsideMe()
...

So I guess this is demonstrating that the compiler fails to perform the optimization in the “Normal Test” that it is capable of performing on the exact same code within the nested func used in the “Fast Test” (for some reason)?

Reported this as SR-7023.

I'm bumping this thread since it's been a while and I'm (still) running into various forms of this issue (SR-7023) while writing performance sensitive code.

It's really hard to avoid running into, and to know if you have or have not done so, even though the performance hit is significant, because the only way (at least for me, in more complex scenarios) to get a hint that your code might be affected is to compare its performance to a version of your code that you know is unaffected, but the only way (at least for me, in more complex ...) ...

My code is full of workarounds and comments having to do with this optimizer-bug.

Also, more importantly, I'd be surprised if my code is the only one being affected by it (but I might possibly be the only one noticing it), so perhaps fixing it will result in performance improvements all over the place, including parts of the standard library.

I took a quick look at this. What is happening here is that the vectorizer is failing when the function has been "hand inlined". This is due to what appears to be a failing bounds check in the loop. When I run with checks disabled, the slow down is gone.

I am not sure which bounds check it is though yet. You can see it if you look at the two pieces of code in instruments.

1 Like

Thank you for looking into this! As I’d like to take the opportunity to educate myself, I’ll ask some questions, if you don’t mind:


Question 1:
I assume you referring to the demonstration program of the bug report (rather than any of the code in this thread)?


Question 2:

Are you referring to the (seemingly meaninglessly wrapping) function fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe() here? And by "hand inlined" you mean when the (same) code is not (seemingly meaninglessly) wrapped in a function like that?


Question 3:

By "the two pieces of code", do you mean this (The Normal Test):

    // ---- The Normal Test ----
    var checksum = UInt64(0)
    let t0 = CACurrentMediaTime()
    for e in randomBytes {
        let dst = UInt64(rangeConverted: e)
        checksum = checksum ^ dst
    }
    let t1 = CACurrentMediaTime()
    print("  Normal Test:", t1 - t0, "seconds (checksum: \(checksum))")

and this (The Faster Test):

    // ---- The Faster Test ----
    func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe() {
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for e in randomBytes {
            let dst = UInt64(rangeConverted: e)
            checksum = checksum ^ dst
        }
        let t1 = CACurrentMediaTime()
        print("  Faster Test:", t1 - t0, "seconds (checksum: \(checksum))")
    }
    fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe()

?


Question 4 (assuming the answer is yes to all previous questions):

From what I can understand by profiling The Normal Test and The Faster Test in Time Profiler (using high frequency), it looks like the UInt64(rangeConverted: e) call gets inlined (and further optimized) in The Faster Test, whereas in The Normal Test, the UInt64(rangeConverted: e) call is not getting inlined, and this is what makes The Normal Test slower.

If this analysis and conclusion is wrong, I'd appreciate to learn why that is.

A detailed description of what I did: I Time Profiled two slightly modified versions of the demonstration program in SR-7023 (using the default toolchain of Xcode 9.3.1, not disabling checks):

  • One version in which I commented out The Faster Test.
  • One version in which I commented out The Normal Test.

Here's an annotated comparison of them:

So, do you mean that the failing bounds check that you are referring to here:

is somewhere within init<S>(rangeConverted source: S)?

(Or perhaps you profiled the program with and without disabling safety checks, and that's how you noticed the failing bounds check?)

NOTE I hit enter too fast so if you read by email, you are going to miss most of this.

I am going to respond in more detail later, but the reason why I know it is an issue is by looking at the assembly itself. Keep in mind I did this with:

Apple Swift version 4.1 (swiftlang-902.0.48 clang-902.0.37.1)
Target: x86_64-apple-darwin17.4.0

Fast:

Slow:

Notice how in the slow version we can leave the loop by falling through to +0x337. This jumps to a ud2. Notice how in the fast loop, no such things exist.

When I'm looking at the disassembly for the unmodified program in SR-7023, I see the same pieces of asm as you, but some lines are underlined, and when hovering the mouse over them, a tooltip with a function name appears. What do they mean?

Below are the two screenshots corresponding to yours, but also showing those tooltips. Note that I've included two tooltips in the screenshot for Slow, because the last underlined line shows a different function than the rest. (In Fast, all underlined lines show the same function.)

Apple Swift version 4.1 (swiftlang-902.0.48 clang-902.0.37.1)
Target: x86_64-apple-darwin17.5.0

Fast:

SR-7023_asm-fast


Slow:

SR-7023_asm-slow


In Slow, all the underlined lines except the last one shows a tooltop containing my init<S>(rangeConverted source: S). Does this mean that what I speculated about in my previous post is true: That in The Slow Test, the optimizer never inlines the UInt64(rangeConverted: e) call, but it does in The Faster Test?

Also, why are some of the lines in Slow indented?

And why does your Fast start at +0x70 while mine starts at +0x520, and your Slow start at +0x300 while mine starts at +0x330? Is this because you profiled a modified version of the demo program, where you extracted the function of The Faster Test, while I profiled the unmodified demo program?

EDIT: I've verified that my above guess was the case, as I get the exact same numbers as you when I extract it like this:

func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe2(_ randomBytes: [UInt8]) {
    var checksum = UInt64(0)
    let t0 = CACurrentMediaTime()
    for e in randomBytes {
        let dst = UInt64(rangeConverted: e)
        checksum = checksum ^ dst
    }
    let t1 = CACurrentMediaTime()
    print("  Faster Test:", t1 - t0, "seconds (checksum: \(checksum))")
}

Did you see my comment in the bug report mentioning that if you remove the trials-loop (in the unmodified demo program), then The Normal Test becomes as fast as The Faster Test?

That's why extracting the wrapping function is also fast. But if you'd put all of the code of your extracted function within a for-in loop or a while loop (even if it just perform one iteration), it would make it slow(!), unless you wrap the code within the loop in a func or immediately called closure:

Variant A:

func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe2(_ randomBytes: [UInt8]) {
    // This single iteration loop magically reverses the results, making
    // this (The Faster Test) slow, and The Normal Test fast!
    for _ in 0 ... 0 {
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for e in randomBytes {
            let dst = UInt64(rangeConverted: e)
            checksum = checksum ^ dst
        }
        let t1 = CACurrentMediaTime()
        print("  Faster Test:", t1 - t0, "seconds (checksum: \(checksum))")
    }
}

Variant B:

func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe2(_ randomBytes: [UInt8]) {
    // This single iteration loop magically reverses the results, making
    // this (The Faster Test) slow, and The Normal Test fast!
    for _ in 0 ... 0 {
        let _ = { // <-- Immediately called closure dispels the above magic.
            var checksum = UInt64(0)
            let t0 = CACurrentMediaTime()
            for e in randomBytes {
                let dst = UInt64(rangeConverted: e)
                checksum = checksum ^ dst
            }
            let t1 = CACurrentMediaTime()
            print("  Faster Test:", t1 - t0, "seconds (checksum: \(checksum))")
        }()
    }
}

As eplained in the comments: Not only will variant A slow down The Faster Test, it will also make The Normal Test fast (ie reversing the results). And wrapping the contents of the loop in a func or an immediately called closure will remove the strange effect of the loop statement.

So, going back to the unmodified demo program of SR-7023. That program was meant to pose the following question: Given this exact program, how come the optimizer misses some optimization opportunitites (eg inlining the UInt64(rangeConverted: e) call and vectorizing) unless we wrap the relevant section of code in a func (or immediately called closure)?

And, as shown by my comment about removing the trials-loop, and variants A & B of your extracted wrapped function, it seems like this issue can perhaps be described more clearly like this:

Putting a loop statement around this particular code section will cause some otherwise performed optimizations to be missed, but (surprisingly) this can be worked around by wrapping the code section in an immediately called closure (or a func).

Question. Next time instead of editing the post, can you just do an additional post with the added changes? Otherwise, I don't get it in email so I can't answer your questions line by line. I am just going to copy/paste your post into a comment so I can do that.

Or actually maybe I figured out how to do it... sigh I am still learning how to use this software.

Ah, sorry, about adding so many edits. I tend to re-read and edit my posts some times, trying to improve my attempts at expressing myself clearly about topics I don't really understand, in English ...

No problem. = ).

I am just learning to use Discourse like many others. I found a way to get the output that I want to make it easier to make my points.

Michael

I was referring to a small test program that I wrote based off of your program.

Yes.

I used my own slightly changed version of this.

I noticed the failing bounds check by looking at the loop that I mentioned, noticing that it wasn't vectorized and then I looked to see why it wasn't vectorized. And I noticed that it had the jump to the ud2 from inside the loop. That being said, I wonder if this wouldn't be a good place to have an opt-remark that would tell you why the bounds check was broken or something like that.


Response to the next one.

Instruments is using debug info (which contains inlining information) to provide you with that pretty name of where the instructions came from /before/ inlining.

Instruments tries to determine loops. If it detects a loop, it will indent the assembly. I don't remember if it is Instruments itself that does this though.

I modified the program slightly so that I could eliminate extraneous variables.

I did see that.

Yes. There is pretty clearly a missing optimization here. Want to try to figure out why ; ). I can provide you with some pointers.

My head is spinning. Please do (unless your point was that you already did ...).