Why is this closure slower the second time it is used?

Here is a program that measures the time it takes to run an identity-closure and -func on a hundred million random input values. It measures them two times and they are equally fast the first time (as expected), but the second time the closure is slower. I'd like to understand why.

The program:

import AppKit

func measureExecutionTime(label: String, subject: (Int) -> Int) {
    let operationsPerTrial = 100_000_000
    var randomIntValues = [Int](repeating: 0, count: operationsPerTrial)
    var minTime = Double.infinity
    print("\(label):")
    for _ in 0 ..< 5 {
        randomIntValues.withUnsafeMutableBytes { (ptr) -> Void in
            let r = SecRandomCopyBytes(nil, ptr.count, ptr.baseAddress!)
            precondition(r == errSecSuccess)
        }
        var checksum = Int(0)
        let t0 = CACurrentMediaTime()
        for randomInput in randomIntValues {
            let r = subject(randomInput)
            checksum ^= r
        }
        let t1 = CACurrentMediaTime()
        let time = t1 - t0
        print("  time:", time, "seconds (checksum: \(checksum))")
        minTime = min(minTime, time)
    }
    print("  minimum time:", minTime, "seconds\n")
}

let identityClosure: (Int) -> Int = { return $0 }
func identityFunc(input: Int) -> Int { return input }

// Note that both have the same execution time here:
measureExecutionTime(label: "1st use of closure", subject: identityClosure)
measureExecutionTime(label: "1st use of func", subject: identityFunc)

// Note that identityClosure is slower this time:
measureExecutionTime(label: "2nd use of closure", subject: identityClosure)
measureExecutionTime(label: "2nd use of func", subject: identityFunc)

// Why?

Compiling and running it:

> swiftc --version
Apple Swift version 4.1 (swiftlang-902.0.38 clang-902.0.30)
Target: x86_64-apple-darwin17.4.0
> swiftc -O -static-stdlib test.swift
> ./test
1st use of closure:
  time: 0.0631774560024496 seconds (checksum: -5969152766063222341)
  time: 0.0609615579887759 seconds (checksum: 1436576802479868292)
  time: 0.0627239710011054 seconds (checksum: 7897566598316776330)
  time: 0.0592733160010539 seconds (checksum: -1676386092480912224)
  time: 0.0597947819915134 seconds (checksum: 7356946459760327795)
  minimum time: 0.0592733160010539 seconds

1st use of func:
  time: 0.0596228250069544 seconds (checksum: 6666500747078896187)
  time: 0.059224077995168 seconds (checksum: 4253268501060393296)
  time: 0.0597734379989561 seconds (checksum: 3079996028968815475)
  time: 0.059408948989585 seconds (checksum: -39057913403395343)
  time: 0.0593238050059881 seconds (checksum: -1275339559973029455)
  minimum time: 0.059224077995168 seconds

2nd use of closure:
  time: 0.180851501005236 seconds (checksum: -297275842764426349)
  time: 0.191041134006809 seconds (checksum: -2297966356982257278)
  time: 0.181297669012565 seconds (checksum: 4242728420682360232)
  time: 0.19395082097617 seconds (checksum: 4169899910790849794)
  time: 0.183614853012841 seconds (checksum: -7725848402084691842)
  minimum time: 0.180851501005236 seconds

2nd use of func:
  time: 0.0590045459975954 seconds (checksum: -1243511994543017911)
  time: 0.0606653710128739 seconds (checksum: 4379620549680557500)
  time: 0.0616895709827077 seconds (checksum: -3147320019667911741)
  time: 0.0594128520169761 seconds (checksum: 3804868237876071121)
  time: 0.0593436420022044 seconds (checksum: 138671714869980974)
  minimum time: 0.0590045459975954 seconds

(Note that "2nd use of closure" is 0.18 seconds while all the other measurements are 0.06 seconds.)

Wow, there are some really amazing (de-)optimization effects to be found here!

For example,

  • if you add @inline(never) to measureExecutionTime it actually makes the effect disappear by speeding up the second trial on the closure.
  • if you put all the top level code (starting with let identityClosure:…) inside func testme() { … } and then invoke that, it also makes the effect disappear.
  • If you pull the definitions of identityClosure and identityFunc out of the body of testme, both invocations of the closure are slow.
  • If you measure the speed of the func first, the closure will be slow.

The usual explanation for something like this is, “the compiler wraps some kind of abstraction penalty, such as a thunk, around identityClosure, and the differences in context of where/how it is called is just enough complexity that the optimizer gives up before eliminating this penalty.” As for the specifics, you can analyze the generated assembly and/or debug the optimizer, but the best way to find out more (and make the world a better place at the same time) would be to file a bug report. The people who work on the optimizer are wizards at this kind of analysis.

6 Likes

Thanks! Filed [SR-6983] Unexpected drop in performance of (identity-)closure the second time it is used · Issue #49531 · apple/swift · GitHub .

(If anyone can point me to some good beginner-level material on how to analyze the generated assembly, that'd be appreciated. I'm at the level where I can't even locate the relevant pieces of asm the few times I've tried. But it'd be very interesting to see/read about how someone who knows what they are doing would go about analyzing the asm for eg this program.)

Not sure if this helps but godbolt is a nice way of looking at asm of swift code.

1 Like

Nice! But it fails to compile this particular program due to lacking libraries (the program imports AppKit, but could manage by only importing Security and QuartzCore, but that didn't work either.)

I guess there must be a way to accomplish something similar (highlighting the asm corresponding to the current line of Swift) using Xcode or other tools locally?

In Xcode, adding breakpoints and checking
Debug -> Debug Workflow -> Always show Disassembly
doesn't seem to work for optimized builds.

You don't have to depend on Security and QuartzCore. Here's a version that only depends on Foundation.

Ah, thanks! (Although I can conclude that I still fail to analyze the asm, even with this nice way of looking at it.)

I'm bumping this thread because

  • This is an interesting performance issue
  • which still affects me
  • and has gotten worse with Xcode 9.4,
  • and I'd also like to ask a couple of more general questions (about performance issues and the benchmark suite) for which this particular issue serves as a good example.

First As noted in SR-6983, this issue has gotten worse (with the default toolchain of Xcode 9.4):

The "2nd use of closure" is now 10 instead of 3 times slower than the others:

Click to see the same example compile & run as before but with Xcode 9.4
› swiftc --version
Apple Swift version 4.1.2 (swiftlang-902.0.54 clang-902.0.39.2)
Target: x86_64-apple-darwin17.5.0
› swiftc -O -static-stdlib test.swift
› ./test
1st use of closure:
  time: 0.0632649020117242 seconds (checksum: 581423683991283777)
  time: 0.0617133129999274 seconds (checksum: -7306424433621360502)
  time: 0.0625146849924931 seconds (checksum: 7081978004570742532)
  time: 0.0610871280077845 seconds (checksum: -4065774701487485870)
  time: 0.0624201809987426 seconds (checksum: -2784397570979284854)
  minimum time: 0.0610871280077845 seconds

1st use of func:
  time: 0.0610684939892963 seconds (checksum: 264423982952719367)
  time: 0.0615423239942174 seconds (checksum: 2237211590060124107)
  time: 0.061124823987484 seconds (checksum: -4267020249627113289)
  time: 0.0619161030044779 seconds (checksum: 1792111283221657060)
  time: 0.063269748003222 seconds (checksum: -1589300676560776483)
  minimum time: 0.0610684939892963 seconds

2nd use of closure:
  time: 0.699463352997554 seconds (checksum: -3918483693787300236)
  time: 0.667777561000548 seconds (checksum: 1137324794268155767)
  time: 0.664340454997728 seconds (checksum: -7811258734802726978)
  time: 0.661137435992714 seconds (checksum: 1048546319275511909)
  time: 0.666032235996681 seconds (checksum: 1052028149280526971)
  minimum time: 0.661137435992714 seconds

2nd use of func:
  time: 0.0614374390133889 seconds (checksum: 6639092839694586210)
  time: 0.0611787349916995 seconds (checksum: -8533491229019487789)
  time: 0.060995906998869 seconds (checksum: -4113116250669794077)
  time: 0.0605206390027888 seconds (checksum: -4439758878194388265)
  time: 0.0613778190017911 seconds (checksum: 1964471734991009996)
  minimum time: 0.0605206390027888 seconds

So “2nd use of closure” is now 0.66 seconds. It was 0.18 seconds when I reported the issue (see compiler version in the example run in the report). The other ones are still 0.06 seconds.


Click to see test.swift (same as in OT, repeated here for completion)
import AppKit

func measureExecutionTime(label: String, subject: (Int) -> Int) {
    let operationsPerTrial = 100_000_000
    var randomIntValues = [Int](repeating: 0, count: operationsPerTrial)
    var minTime = Double.infinity
    print("\(label):")
    for _ in 0 ..< 5 {
        randomIntValues.withUnsafeMutableBytes { (ptr) -> Void in
            let r = SecRandomCopyBytes(nil, ptr.count, ptr.baseAddress!)
            precondition(r == errSecSuccess)
        }
        var checksum = Int(0)
        let t0 = CACurrentMediaTime()
        for randomInput in randomIntValues {
            let r = subject(randomInput)
            checksum ^= r
        }
        let t1 = CACurrentMediaTime()
        let time = t1 - t0
        print("  time:", time, "seconds (checksum: \(checksum))")
        minTime = min(minTime, time)
    }
    print("  minimum time:", minTime, "seconds\n")
}

let identityClosure: (Int) -> Int = { return $0 }
func identityFunc(input: Int) -> Int { return input }

// Note that both have the same execution time here:
measureExecutionTime(label: "1st use of closure", subject: identityClosure)
measureExecutionTime(label: "1st use of func", subject: identityFunc)

// Note that identityClosure is slower this time:
measureExecutionTime(label: "2nd use of closure", subject: identityClosure)
measureExecutionTime(label: "2nd use of func", subject: identityFunc)

// Why?

Questions (about "non-benchmark-able" special cases):

We'll need some background in the form of a conversation that took place in another thread some time ago:

OK:

I'd say most of the performance issues I'm affected by are "non-benchmark-able special cases" as @moiseev calls them above, just like the one described in this thread.

I keep running into them, and most of the time it turns out to be one that I have already reported, but it takes time to isolate them, and it's impossible to build a strategy (as a programmer) for avoiding them, they're just too complex, especially since in real code there are often several of them interacting.

Also, if you are working on something where you don't know what the performance can be, then you have no idea if you are affected by these issues or not. A safe guess is that you probably are. So if you care about performance (I guess most people don't, but I have to, or the project will be unusable), you'll develop a habit of writing the same thing in many different ways, using less abstractions, etc. just so that you might discover that what you have now is actually 100 times slower than it could be, provided that the code is written in this magic way. And who knows, maybe you're still affected by some other issue and your computation could actually be 1000 times faster!

I am often unable to reason about or develop an intuition for the performance of my code, and these inconsistent performance issues easily eat up 80% of my work day.

And according to the conversation I quoted, the way forward seems to be to just continue like this: writing my performance sensitive code in multiple ways to see if I manage to hit or miss any of these issues, if so try to isolate them, see if I already reported them, if not report. And then (while I continue like this), I should trust that …

… or if it's non-benchmark-able, then …

(Please note that the following questions are sincere, even though I might seem a little bit upset :) .)

@moiseev: What are these other ways?

@palimondo: Is there any way that issues like this (ie SR-6983) can be (now or in the future) added to the benchmark suite? I suppose it would require that the test be compiled to -- and run as -- a stand alone executable (since a harness would inevitably modify the context of and possibly interfere with the tested code), and the test should interchange information (compiler flags to use and the result) with some test-driver (the program which compiles, runs and analyzes the test) via some protocol that is expressible enough for handling issues like SR-6983.

Also note that SR-6983 might afaics be affecting the benchmark suite itself, since the test is passed as a closure, depending on exactly how the harness and the test happens to be written (please correct me if I'm wrong).

1 Like

One common theme I see between this test case and your other test case is that you're testing using top-level code, which is going to get emitted using global variables, and this likely leads to less predictable optimization behavior than wrapping the test case in a function. From @dabrahams' comment it sounds like this might be at play here too. You might get more predictable behavior, more representative of what you'd see in real code, writing these tests as functions instead of as top-level code. There's no good fundamental reason for top-level code to be different from a function, so it's still great that you're uncovering these issues, but if you're getting frustrated trying to understand the optimizer's behavior, avoiding top-level code might at least eliminate one variable for now.

2 Likes

Yes, I'm aware of that, and I usually avoid top level code, like for example in SR-7023, but as I just realized, an (ugly but interesting) workaround that worked for SR-7023, also works for SR-6983, so they are related, and they might well have originated from two different real world situations (followed by isolation/reduction-sessions) where the underlying issue was the same.

In the other test case, it doesn't matter for the result if you stick it in a function:

import AppKit

func test() {
    let a = Double(1)
    var checksum = UInt64(0)
    for _ in 0 ..< 5 {
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
}
test()

It is still slow. I think I reduced the original issue to this form first, and then tried if it was the same if I removed the enclosing func, which it was, so I reported it like that because it was an even simpler program.


But it does make a difference for this test case.

1 Like

Hm, that test case completely constant-folds away with a top-of-tree compiler:

  ...
  %8 = tail call double @CACurrentMediaTime()
  %9 = add nuw nsw i64 %6, 1
  %10 = add i64 %7, 6917529026641081856
  %11 = tail call double @CACurrentMediaTime()
  ...

Yes, but not if the code is at top level, or if you just move let a = Double(1) out of test() to top-level.

I just posted about this in the other thread.