Here is a program that measures the time it takes to run an identity-closure and -func on a hundred million random input values. It measures them two times and they are equally fast the first time (as expected), but the second time the closure is slower. I'd like to understand why.
The program:
import AppKit
func measureExecutionTime(label: String, subject: (Int) -> Int) {
let operationsPerTrial = 100_000_000
var randomIntValues = [Int](repeating: 0, count: operationsPerTrial)
var minTime = Double.infinity
print("\(label):")
for _ in 0 ..< 5 {
randomIntValues.withUnsafeMutableBytes { (ptr) -> Void in
let r = SecRandomCopyBytes(nil, ptr.count, ptr.baseAddress!)
precondition(r == errSecSuccess)
}
var checksum = Int(0)
let t0 = CACurrentMediaTime()
for randomInput in randomIntValues {
let r = subject(randomInput)
checksum ^= r
}
let t1 = CACurrentMediaTime()
let time = t1 - t0
print(" time:", time, "seconds (checksum: \(checksum))")
minTime = min(minTime, time)
}
print(" minimum time:", minTime, "seconds\n")
}
let identityClosure: (Int) -> Int = { return $0 }
func identityFunc(input: Int) -> Int { return input }
// Note that both have the same execution time here:
measureExecutionTime(label: "1st use of closure", subject: identityClosure)
measureExecutionTime(label: "1st use of func", subject: identityFunc)
// Note that identityClosure is slower this time:
measureExecutionTime(label: "2nd use of closure", subject: identityClosure)
measureExecutionTime(label: "2nd use of func", subject: identityFunc)
// Why?
Wow, there are some really amazing (de-)optimization effects to be found here!
For example,
if you add @inline(never) to measureExecutionTime it actually makes the effect disappear by speeding up the second trial on the closure.
if you put all the top level code (starting with let identityClosure:…) inside func testme() { … } and then invoke that, it also makes the effect disappear.
If you pull the definitions of identityClosure and identityFunc out of the body of testme, both invocations of the closure are slow.
If you measure the speed of the func first, the closure will be slow.
The usual explanation for something like this is, “the compiler wraps some kind of abstraction penalty, such as a thunk, around identityClosure, and the differences in context of where/how it is called is just enough complexity that the optimizer gives up before eliminating this penalty.” As for the specifics, you can analyze the generated assembly and/or debug the optimizer, but the best way to find out more (and make the world a better place at the same time) would be to file a bug report. The people who work on the optimizer are wizards at this kind of analysis.
(If anyone can point me to some good beginner-level material on how to analyze the generated assembly, that'd be appreciated. I'm at the level where I can't even locate the relevant pieces of asm the few times I've tried. But it'd be very interesting to see/read about how someone who knows what they are doing would go about analyzing the asm for eg this program.)
Nice! But it fails to compile this particular program due to lacking libraries (the program imports AppKit, but could manage by only importing Security and QuartzCore, but that didn't work either.)
I guess there must be a way to accomplish something similar (highlighting the asm corresponding to the current line of Swift) using Xcode or other tools locally?
In Xcode, adding breakpoints and checking
Debug -> Debug Workflow -> Always show Disassembly
doesn't seem to work for optimized builds.
and I'd also like to ask a couple of more general questions (about performance issues and the benchmark suite) for which this particular issue serves as a good example.
First As noted in SR-6983, this issue has gotten worse (with the default toolchain of Xcode 9.4):
The "2nd use of closure" is now 10 instead of 3 times slower than the others:
Click to see the same example compile & run as before but with Xcode 9.4
So “2nd use of closure” is now 0.66 seconds. It was 0.18 seconds when I reported the issue (see compiler version in the example run in the report). The other ones are still 0.06 seconds.
Click to see test.swift (same as in OT, repeated here for completion)
import AppKit
func measureExecutionTime(label: String, subject: (Int) -> Int) {
let operationsPerTrial = 100_000_000
var randomIntValues = [Int](repeating: 0, count: operationsPerTrial)
var minTime = Double.infinity
print("\(label):")
for _ in 0 ..< 5 {
randomIntValues.withUnsafeMutableBytes { (ptr) -> Void in
let r = SecRandomCopyBytes(nil, ptr.count, ptr.baseAddress!)
precondition(r == errSecSuccess)
}
var checksum = Int(0)
let t0 = CACurrentMediaTime()
for randomInput in randomIntValues {
let r = subject(randomInput)
checksum ^= r
}
let t1 = CACurrentMediaTime()
let time = t1 - t0
print(" time:", time, "seconds (checksum: \(checksum))")
minTime = min(minTime, time)
}
print(" minimum time:", minTime, "seconds\n")
}
let identityClosure: (Int) -> Int = { return $0 }
func identityFunc(input: Int) -> Int { return input }
// Note that both have the same execution time here:
measureExecutionTime(label: "1st use of closure", subject: identityClosure)
measureExecutionTime(label: "1st use of func", subject: identityFunc)
// Note that identityClosure is slower this time:
measureExecutionTime(label: "2nd use of closure", subject: identityClosure)
measureExecutionTime(label: "2nd use of func", subject: identityFunc)
// Why?
Questions (about "non-benchmark-able" special cases):
We'll need some background in the form of a conversation that took place in another thread some time ago:
OK:
I'd say most of the performance issues I'm affected by are "non-benchmark-able special cases" as @moiseev calls them above, just like the one described in this thread.
I keep running into them, and most of the time it turns out to be one that I have already reported, but it takes time to isolate them, and it's impossible to build a strategy (as a programmer) for avoiding them, they're just too complex, especially since in real code there are often several of them interacting.
Also, if you are working on something where you don't know what the performance can be, then you have no idea if you are affected by these issues or not. A safe guess is that you probably are. So if you care about performance (I guess most people don't, but I have to, or the project will be unusable), you'll develop a habit of writing the same thing in many different ways, using less abstractions, etc. just so that you might discover that what you have now is actually 100 times slower than it could be, provided that the code is written in this magic way. And who knows, maybe you're still affected by some other issue and your computation could actually be 1000 times faster!
I am often unable to reason about or develop an intuition for the performance of my code, and these inconsistent performance issues easily eat up 80% of my work day.
And according to the conversation I quoted, the way forward seems to be to just continue like this: writing my performance sensitive code in multiple ways to see if I manage to hit or miss any of these issues, if so try to isolate them, see if I already reported them, if not report. And then (while I continue like this), I should trust that …
… or if it's non-benchmark-able, then …
(Please note that the following questions are sincere, even though I might seem a little bit upset :) .)
@palimondo: Is there any way that issues like this (ie SR-6983) can be (now or in the future) added to the benchmark suite? I suppose it would require that the test be compiled to -- and run as -- a stand alone executable (since a harness would inevitably modify the context of and possibly interfere with the tested code), and the test should interchange information (compiler flags to use and the result) with some test-driver (the program which compiles, runs and analyzes the test) via some protocol that is expressible enough for handling issues like SR-6983.
Also note that SR-6983 might afaics be affecting the benchmark suite itself, since the test is passed as a closure, depending on exactly how the harness and the test happens to be written (please correct me if I'm wrong).
One common theme I see between this test case and your other test case is that you're testing using top-level code, which is going to get emitted using global variables, and this likely leads to less predictable optimization behavior than wrapping the test case in a function. From @dabrahams' comment it sounds like this might be at play here too. You might get more predictable behavior, more representative of what you'd see in real code, writing these tests as functions instead of as top-level code. There's no good fundamental reason for top-level code to be different from a function, so it's still great that you're uncovering these issues, but if you're getting frustrated trying to understand the optimizer's behavior, avoiding top-level code might at least eliminate one variable for now.
Yes, I'm aware of that, and I usually avoid top level code, like for example in SR-7023, but as I just realized, an (ugly but interesting) workaround that worked for SR-7023, also works for SR-6983, so they are related, and they might well have originated from two different real world situations (followed by isolation/reduction-sessions) where the underlying issue was the same.
In the other test case, it doesn't matter for the result if you stick it in a function:
import AppKit
func test() {
let a = Double(1)
var checksum = UInt64(0)
for _ in 0 ..< 5 {
let t0 = CACurrentMediaTime()
for _ in 0 ..< 1_000_000_000 {
let v = a.nextDown
checksum = checksum &+ v.bitPattern
}
let t1 = CACurrentMediaTime()
print(t1 - t0, "seconds, checksum:", checksum)
}
}
test()
It is still slow. I think I reduced the original issue to this form first, and then tried if it was the same if I removed the enclosing func, which it was, so I reported it like that because it was an even simpler program.