Missed optimization opportunity?

Joe_Groff · May 30, 2018, 9:04pm

Optimization of Swift global var/lets into static initializers is done at the SIL level, though, isn't it?

Erik_Eckstein · May 30, 2018, 9:14pm

Optimization of Swift global var/lets into static initializers is done at the SIL level, though, isn't it?

Yes (as it is done in this example)

Jens · May 31, 2018, 1:20am

I'm not sure if I misunderstand something but here's what I see (using the program below, modified and compiled according to the diagram):

//                         |    within test func    |       top-level        |
//                         |                        |                        |
//                         | .nextDown : .magnitude | .nextDown : .magnitude |
// ------------------------+-----------+------------+-----------+------------+
// Xcode 9.4 def toolchain |     9     :   4e-08    |     9     :   4e-08    |
// ------------------------+-----------+------------+-----------+------------+
// Dev Snapshot 2018-05-30 |   4e-08   :   4e-08    |    0.08   :   4e-08    |
// ------------------------+-----------+------------+-----------+------------+
//
// My interpretation:
//
// 9 seconds means something needs fixing.
// 0.08 seconds means it has been *partly* fixed.
// 4e-08 seconds means it has been fixed.

import AppKit

//func test() {
    let a = Double(1)
    
    var checksum = UInt64(0)
    var minTime = Double.infinity
    for trial in 0 ..< 5 {
        print("#\(trial)")
        let start = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown // <-- Replace with `a.magnitude`, see time diff.
            checksum = checksum &+ v.bitPattern
        }
        let stop = CACurrentMediaTime()
        let time = stop - start
        print("  time:", time, "seconds (checksum: \(checksum)")
        minTime = min(time, minTime)
    }
    print("\nmin time:", minTime, "seconds")
//}
//test()

Note that:

9 seconds is the time for the .nextDown case, using Xcode 9.4 def toolchain, no matter if top level or in func.
0.08 seconds is about 100 times faster than that (9 / 0.08 == 112.5).
4e-08 seconds is still millions of times faster than that (0.08 / 4e-08 == 2_000_000).

so 0.08 seconds (most recent snapshot, for the .nextDown case, code at top level) is still millions of times slower than what it should be if it was entirely fixed, ie the same as for the .magnitude case: 4e-08 seconds. (My guess is that: While the a.nextDown calculation is being hoisted out of the loop, the further optimization of turning all the &+ operations into a single multiplication isn't being performed, but it is in the .magnitude case.)

This is why I wrote:

And by "equally fast", I mean both should be like .magnitude, which is 4e-08 seconds, not 0.08 seconds

Jens · May 31, 2018, 8:06am

The above (and everything having to do with these issues) are too complex to communicate effectively.

So I'll try to simplify the current situation by starting from scratch, and only focus on:

the .nextDown case, and
the most recent Development Snapshot.

So make sure you are using Development Snapshot 2018-05-30 and follow these exact steps, because as you'll see, every little detail matters.

Step 1

This program will be fast (around 4e-08 seconds for me):

import AppKit

func test() {
    for _ in 0 ..< 5 {
        let a = Double(1)
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
}
test()

Step 2

Now, with the following modification, the program will still be fast (4e-08 s), which might lead us to believe that the issue has been fixed, even for the top level case! (remember that we are only talking about the most recent snapshot here).

import AppKit

// func test() {
    for _ in 0 ..< 5 {
        let a = Double(1)
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
// }
// test()

Step 3

But, if we move let a = Double(1) out of the (outer, trials-) for-in loop, then it will be millions of times slower (0.08 seconds for me).

import AppKit

// func test() {
    let a = Double(1) // <-- moved this 
    for _ in 0 ..< 5 {
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
// }
// test()

Step 4

If we now enclose it in the func again, it will be fast (4e-08 s).

import AppKit

func test() {
    let a = Double(1) // <-- this is still here, but moving it back inside the for-in loop doesn't matter when enclosed in func.
    for _ in 0 ..< 5 {
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
}
test()

Step 5

However: If we move let a = Double(1) to top level, it will be millions of times slower again (0.08 s)!

import AppKit

let a = Double(1) // <-- constant moved to top level

func test() {
    for _ in 0 ..< 5 {
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
}
test()

And it doesn't matter if a is declared as a static constant within a type, ie having it like this:

struct S {
    static let a = Double(1)
}

will still be slow (0.08 s).

This demo (Step 5), especially with a static constant in a type like that, is representative of situations that are very common in real code. Yes, the billion iteration loop is looking a bit contrived, but it's just to make the demo short, and you'll have to trust me that the underlying issue(s) come(s) up again and again (in different shapes) in non-contrived non-stupid real code.

So, Step 3 and Step 5 demonstrates that some optimization is still missed for the .nextDown case (while not being missed for eg .magnitude).

Hopefully, this makes it clear that the issue still remains in dev snapshot 2018-05-30.

EDIT: I've since noted that there is a tiny modification that can be made to the program of Step 5 in order to make it fast: Replace let a = Double(1) with a formulation that avoids the initializer and integer literal, ie any of these:

let a = 1 as Double
let a: Double = 1
let a = 1.0

I think I will try to sum all of this up into a new simplified bug report since SR-7094 has become rather messy.

Btw, you can also use eg .binade instead of .nextDown to get an even bigger difference in time when comparing to .magnitude, so this is not specific to .nextDown. The issue is meant to be just one specific example of a more general issue, and it would be nice if it ended up being perceived and fixed as such, so that "resolved" would mean that I don't have to repeat this effort of bumping into, isolating, and reporting every single one (.nextUp, .nextDown, .binade, etc.). This is also why I'm still wondering if/how issues like this is being tested/benchmarked, in order to prevent regressions. Just like @xwu wondered some time ago:

Joe_Groff · May 31, 2018, 5:20pm

Jens:

EDIT: I've since noted that there is a tiny modification that can be made to the program of Step 5 in order to make it fast: Replace let a = Double(1) with a formulation that avoids the initializer and integer literal, ie any of these:
let a = 1 as Double
let a: Double = 1
let a = 1.0
I think I will try to sum all of this up into a new simplified bug report since SR-7094 has become rather messy.

Yeah, that looks like a different issue, that the SIL constant folder doesn't look through the Double(Int) constructor when it could, so we don't optimize the global initializer. I think there's another optimizer limitation worth its own issue here: Even if a global let doesn't get its initializer constant-folded, we still ought to treat loads from the global as invariant, since the compiler can assume that a let doesn't change values. That would make the performance behavior less brittle.

Would you be able to share examples with more context, if not in this thread, then another thread? My gut feeling is that this particular example with nextDown/magnitude is too reduced to be representative, since the checksum loop in its ideal form is completely constant-foldable, and as Erik noted, the variations in behavior are largely LLVM's fault and not Swift, since LLVM is doing most of the floating-point analysis work. You're seeing a million-times difference because you're benchmarking doing almost no work, when LLVM eliminates the loop by replacing it with checksum = checksum &+ a.<operation>.bitPattern &* 1_000_000_000, against a real loop when it doesn't.

Jens · May 31, 2018, 10:34pm

Before answering in more detail, I just want to make sure that we are on the same page regarding the following observations (using dev snapshot 2018-05-30, and referring to the 5 programs in my last post):

The only two of the above 5 programs that are not "fully optimized" (which includes eliminating the loop in these particular programs) are Step 3 and Step 5.
However, all of those programs (including Step 3 and Step 5) are fully optimized (including eliminating the loop) if we replace .nextDown with .magnitude.
So Step 3 and Step 5 either is or is not fully optimized (including eliminating the loop) depending on if we use .nextDown or .magnitude.

(I failed to mention point 2 in the above post.)

Are you saying that this has to do with LLVM rather than Swift?

Joe_Groff · May 31, 2018, 10:44pm

There are at least three different compiler shortcomings I see:

When you use Double(1) instead of 1.0 as a global initializer, the Swift optimizer fails to turn this into a static initialization. That's a Swift bug.
When a global or static let property has an initialization expression that doesn't get optimized, then we apparently fail to mark loads from the let as being invariant. That's a Swift bug too.
LLVM fails to recognize that the nextDown computation on a floating-point number can be hoisted out of a loop, and instead vectorizes the loop. That's an LLVM bug.

Jens · May 31, 2018, 11:03pm

But if so, how come eg Step 1 is "fully optimized" (including eliminating the loop):

// Step 1 repeated here for completion:
import AppKit

func test() {
    for _ in 0 ..< 5 {
        let a = Double(1)
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
}
test()

Doesn't the .nextDown computation have to be hoisted out of the loop in order to eliminate the loop?

Joe_Groff · May 31, 2018, 11:14pm

In this case, a.nextDown probably gets constant-folded. The LLVM issue appears when a doesn't get constant propagated, as you've observed when it's made into a global variable. With magnitude, LLVM still appears to observe that adding <something>.magnitude a billion times is equivalent to multiplying it by a billion and adding the product, but for some reason it doesn't reach that conclusion with nextDown, or at least not before it's already vectorized the loop beyond recognition.

Jens · June 1, 2018, 12:09am

Thank you for taking the time to explain this!

So that little 15 line program and some seemingly insignificant changes to it can expose at least 2 Swift bugs and 1 LLVM bug.

I look forward to these two getting fixed:

Jens · June 1, 2018, 1:17pm

Sorry to keep going on about this, but I'd like to take the opportunity to educate myself. So just in case you (or anyone else, cc @scanon) would feel like answering these three quick questions:

Question 1:

Is the following correct:

This is the implementation for .nextUp (in dev snapshot 2018-05-30).

And this is the implementation for .nextDown, defined as -(-self).nextUp.

?

Question 2:

.nextDown is marked as being both @inlinable and @_transparent, but .nextUp is marked as just @inlinable. Why isn't .nextUp marked as @_transparent?

Your answer to essentially the same question above was that:

Does this mean that all @_transparent attributes can be removed, ie @_transparent is redundant?

You also say that:

If I understand you correctly, this is what turned out to be the case, if so, what are these underlying LLVM primitives?

Question 3:

(Assuming the answer to Question 1 is yes, and that .nextUp should be marked as @_transparent, even though it currently isn't, or that @inlinable isn't working as expected without being paired with @_transparent:)

Since .nextDown is little more than a call to .nextUp (ie return -(-self).nextUp) does that mean .nextDown is effectively as non-@_transparent as .nextUp?

Joe_Groff · June 1, 2018, 2:39pm

nextUp is effectively the C nextafter function IIUC, and if it were implemented by just forwarding to nextafter and being inlined, then LLVM would probably be able to optimize it. It's not inlinable or transparent otherwise because I'd expect a full implementation is too large to be profitable inlining. Maybe we could still stick an @_effects(readnone) attribute on it (and any other FloatingPoint methods that don't have it yet) to allow it to be optimized as a pure function.

scanon · June 1, 2018, 8:37pm

Yeah. nextUp is actually slightly more efficient than the C nextafter function, because it doesn't need to figure out what direction to move. It's right around the size threshold where inlining would make sense in general contexts. And, yes, we should be able to mark it readnone.

Jens · June 2, 2018, 2:36am

I guess eg .binade and .ulp, and a lot of the initializers should have it. But why limit this improvement to the floating point types? I'd think most types in the std lib have many properties, methods and initializers, that should similarly be marked @effects(readnone).

I'm really looking forward to trying out a Swift version with @effects(readnone) added everywhere where it makes sense in the std lib.

The reason for my enthusiasm / eagerness

I'm a bit obsessed by this because almost every time I bump into and investigate significant differences in performance that are caused by seemingly irrelevant changes to source code, at least part of the explanation will be due to Swift's current inability to see and utilize the fact that some initializers, properties or methods have ...

... no side effects and no dependencies on the state of the program. They always return an identical result given identical inputs.

For example, I might have some heavily used function that has a loop of only a few (not billions) iterations, the number of which I expect to be definitely statically knowable, as in eg

for i in 0 ..< MemoryLayout<T>.size {
    ...
}

And I notice that this loop (including its contents) will get optimized to various degrees depending on seemingly irrelevant changes to the loop, its contents or its context.

An attempt at preventing off topic discussion

By "seemingly irrelevant", I mean that you cannot understand how they are relevant unless you:

Know unreasonably much about the implementation of (your current version of) Swift, including inconsistently marked and/or inefficiently implemented low-level operations in the std lib, shortcomings of the optimizer, etc.
Understand exactly how the above inconsistencies will interact in the context of your code.

This is of course true for all languages in a way, but the number of performance related inconsistencies and bugs that are likely to be at play in any given piece of Swift, is still so high that it makes it impossible to develop an intuition for performance, and the only way forward seems to be a lot of experiments and bug reports. And further improvements to the benchmark suite.

But perhaps the optimizer isn't taking any advantage of the @effects attribute yet?