When I'm looking at the disassembly for the unmodified program in SR-7023, I see the same pieces of asm as you, but some lines are underlined, and when hovering the mouse over them, a tooltip with a function name appears. What do they mean?
Below are the two screenshots corresponding to yours, but also showing those tooltips. Note that I've included two tooltips in the screenshot for Slow, because the last underlined line shows a different function than the rest. (In Fast, all underlined lines show the same function.)
Apple Swift version 4.1 (swiftlang-902.0.48 clang-902.0.37.1)
Target: x86_64-apple-darwin17.5.0
Fast:

Slow:

In Slow, all the underlined lines except the last one shows a tooltop containing my init<S>(rangeConverted source: S). Does this mean that what I speculated about in my previous post is true: That in The Slow Test, the optimizer never inlines the UInt64(rangeConverted: e) call, but it does in The Faster Test?
Also, why are some of the lines in Slow indented?
And why does your Fast start at +0x70 while mine starts at +0x520, and your Slow start at +0x300 while mine starts at +0x330? Is this because you profiled a modified version of the demo program, where you extracted the function of The Faster Test, while I profiled the unmodified demo program?
EDIT: I've verified that my above guess was the case, as I get the exact same numbers as you when I extract it like this:
func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe2(_ randomBytes: [UInt8]) {
var checksum = UInt64(0)
let t0 = CACurrentMediaTime()
for e in randomBytes {
let dst = UInt64(rangeConverted: e)
checksum = checksum ^ dst
}
let t1 = CACurrentMediaTime()
print(" Faster Test:", t1 - t0, "seconds (checksum: \(checksum))")
}
Did you see my comment in the bug report mentioning that if you remove the trials-loop (in the unmodified demo program), then The Normal Test becomes as fast as The Faster Test?
That's why extracting the wrapping function is also fast. But if you'd put all of the code of your extracted function within a for-in loop or a while loop (even if it just perform one iteration), it would make it slow(!), unless you wrap the code within the loop in a func or immediately called closure:
Variant A:
func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe2(_ randomBytes: [UInt8]) {
// This single iteration loop magically reverses the results, making
// this (The Faster Test) slow, and The Normal Test fast!
for _ in 0 ... 0 {
var checksum = UInt64(0)
let t0 = CACurrentMediaTime()
for e in randomBytes {
let dst = UInt64(rangeConverted: e)
checksum = checksum ^ dst
}
let t1 = CACurrentMediaTime()
print(" Faster Test:", t1 - t0, "seconds (checksum: \(checksum))")
}
}
Variant B:
func fasterTestBySimplyCopyPastingTheAboveNormalTestInsideMe2(_ randomBytes: [UInt8]) {
// This single iteration loop magically reverses the results, making
// this (The Faster Test) slow, and The Normal Test fast!
for _ in 0 ... 0 {
let _ = { // <-- Immediately called closure dispels the above magic.
var checksum = UInt64(0)
let t0 = CACurrentMediaTime()
for e in randomBytes {
let dst = UInt64(rangeConverted: e)
checksum = checksum ^ dst
}
let t1 = CACurrentMediaTime()
print(" Faster Test:", t1 - t0, "seconds (checksum: \(checksum))")
}()
}
}
As eplained in the comments: Not only will variant A slow down The Faster Test, it will also make The Normal Test fast (ie reversing the results). And wrapping the contents of the loop in a func or an immediately called closure will remove the strange effect of the loop statement.
So, going back to the unmodified demo program of SR-7023. That program was meant to pose the following question: Given this exact program, how come the optimizer misses some optimization opportunitites (eg inlining the UInt64(rangeConverted: e) call and vectorizing) unless we wrap the relevant section of code in a func (or immediately called closure)?
And, as shown by my comment about removing the trials-loop, and variants A & B of your extracted wrapped function, it seems like this issue can perhaps be described more clearly like this:
Putting a loop statement around this particular code section will cause some otherwise performed optimizations to be missed, but (surprisingly) this can be worked around by wrapping the code section in an immediately called closure (or a func).