Missed optimization opportunity?

Filed the performance regression as SR-7094, asking about such tests, and rambling about my wish for a more predictable compiler/optimizer, which I guess boils down to a proper way of marking and tracking some notion of purity or side-effects.

1 Like

I'm bumping this thread as part of my little mission to try and bring more focus on performance issues in general.

I'd also like to ask two questions regarding this particular issue:


Question 1:

Exactly what difference between .nextDown and .magnitude is it that makes such a big difference for the execution speed of the below test program?


Question 2:

It seems like the optimizer is able to hoist the calculation of

let v = a.magnitude

out of the loop, as expected, while for some reason (ie the answer to question 1) it is unable to do so for

let v = a.nextDown

If I'm correct about this, is it hard to fix for .nextDown and all other properties in the std lib that are affected by the same issue?


import AppKit

let a = Double(1)

var checksum = UInt64(0)
var minTime = Double.infinity
for trial in 0 ..< 5 {
    print("#\(trial)")
    let start = CACurrentMediaTime()
    for _ in 0 ..< 1_000_000_000 {
        let v = a.nextDown // <-- Replace with `a.magnitude` and see time diff.
        checksum = checksum &+ v.bitPattern
    }
    let stop = CACurrentMediaTime()
    let time = stop - start
    print("  time:", time, "seconds (checksum: \(checksum)")
    minTime = min(time, minTime)
}
print("\nmin time:", minTime, "seconds")

Example compile and run on my MBP 2GHz i7 late 2013, default toolchain:

$ swiftc --version
Apple Swift version 4.1 (swiftlang-902.0.48 clang-902.0.37.1)
Target: x86_64-apple-darwin17.5.0
$ swiftc -O test.swift
$ ./test
#0
  time: 13.9918197880033 seconds (checksum: 6917529026641081856
#1
  time: 14.0257235940007 seconds (checksum: 13835058053282163712
#2
  time: 13.988855939002 seconds (checksum: 2305843006213693952
#3
  time: 14.0062829500021 seconds (checksum: 9223372032854775808
#4
  time: 14.0546091610013 seconds (checksum: 16140901059495857664

min time: 13.988855939002 seconds

So that's a min time of 14 seconds. Now, replacing .nextDown with .magnitude will result in a dramatic speed increase:

...

min time: 4.00032149627805e-08 seconds

That's about 450 million times faster.


Click to expand another little rant

If I compile the program with a recent dev snapshot (2018-05-26 in this case) instead of with the default toolchain of Xcode 9.3.1, then it will be around 120 times faster for .nextDown. So things have moved in the right direction since the default toolchain, but 120 times faster is nowhere near hundreds of millions times faster, so .nextDown is still causing some work in the loop when using the recent snapshot.

Of course I could manually hoist v out of the for loop in cases like this. But, apart from this being a typical task for an optimizer and not a human, you have to remember that this is just a very simple example program designed to make this issue super obvious. In real code, the underlying cause of this particular issue shows up in different ways and in much more involved cases. I realize that I have to take care of some "low level" optimizations myself, like designing my algorithms so that they avoid cache thrashing etc, but I really wish I could trust the optimizer with seemingly simple (for a computer) tasks like this.

I really believe that increasing the priority of performance (fixing bugs like these, coming up with and adding tests like these to the test suite, etc.) would turn out to be well worth the effort, benefiting not just people like me (who try to write performance sensitive code in Swift rather than giving up and using C).

As it currently stands, programs compiled with the latest official release of the compiler can be hundreds of millions times less efficient than they would be if there had been more focus on performance, which I'd expect by now, especially from a language called Swift.

1 Like

It sounds like nextDown must've been annotated with purity attributes to tell the compiler that it's safe to hoist out of loops between 4.1 and now. C and C++ would've had this exact same issue if for some reason nextafter() had not been annotated as __attribute__((pure)) (or LLVM hadn't been hardcoded with assumptions about the semantics of the C standard library).

But how come using .magnitude instead of .nextDown makes the program millions of times faster, even when using the most recent snapshot (Question 1 above)?

(The improvements of the recent snapshot makes the .nextDown case about a hundred times faster, but .magnitude will still be millions of times faster than that. So it seems like there’s still something different between them in the recent snapshot. If v was hoisted out of the loop for both the .magnitude and the .nextDown case, then the reported times would be equally fast.)

cc @Erik_Eckstein. It looks like the optimizer is failing to do LICM in either case. With magnitude, it's instead turning the addition loop into a multiplication by 1 billion (though it still computes the magnitude):

  %152 = tail call double @CACurrentMediaTime()
  %153 = load double, double* getelementptr inbounds (%TSd, %TSd* @"$S3foo1aSdvp", i64 0, i32 0), align 8
  %154 = tail call double @llvm.fabs.f64(double %153)
  %155 = bitcast double %154 to i64
  %.promoted = load i64, i64* getelementptr inbounds (%Ts6UInt64V, %Ts6UInt64V* @"$S3foo8checksums6UInt64Vvp", i64 0, i32 0), align 8
  %156 = mul i64 %155, 1000000000
  %157 = add i64 %.promoted, %156
  store i64 %157, i64* getelementptr inbounds (%Ts6UInt64V, %Ts6UInt64V* @"$S3foo8checksums6UInt64Vvp", i64 0, i32 0), align 8
  %158 = tail call double @CACurrentMediaTime()

On the other hand, with nextDown, it appears to be trying to vectorize the loop for no good reason:

  %152 = tail call double @CACurrentMediaTime()
  %153 = load double, double* getelementptr inbounds (%TSd, %TSd* @"$S3foo1aSdvp", i64 0, i32 0), align 8
  %154 = fsub double 0.000000e+00, %153
  %155 = fcmp olt double %154, 0x7FF0000000000000
  %156 = bitcast double %154 to i64
  %157 = ashr i64 %156, 63
  %158 = or i64 %157, 1
  %159 = add i64 %158, %156
  %160 = bitcast i64 %159 to double
  %161 = select i1 %155, double %160, double %154
  %162 = fsub double -0.000000e+00, %161
  %163 = bitcast double %162 to i64
  %.promoted = load i64, i64* getelementptr inbounds (%Ts6UInt64V, %Ts6UInt64V* @"$S3foo8checksums6UInt64Vvp", i64 0, i32 0), align 8
  %broadcast.splatinsert136 = insertelement <2 x i64> undef, i64 %163, i32 0
  %broadcast.splat137 = shufflevector <2 x i64> %broadcast.splatinsert136, <2 x i64> undef, <2 x i32> zeroinitializer
  %broadcast.splatinsert138 = insertelement <2 x i64> undef, i64 %163, i32 0
  %broadcast.splat139 = shufflevector <2 x i64> %broadcast.splatinsert138, <2 x i64> undef, <2 x i32> zeroinitializer
  %164 = insertelement <2 x i64> <i64 undef, i64 0>, i64 %.promoted, i32 0
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %"$SSa13_adoptStorage_5countSayxG_SpyxGts016_ContiguousArrayB0CyxGn_SitFZSS_Tg5Tf4nnd_n.exit"
  %index = phi i64 [ 0, %"$SSa13_adoptStorage_5countSayxG_SpyxGts016_ContiguousArrayB0CyxGn_SitFZSS_Tg5Tf4nnd_n.exit" ], [ %index.next.7, %vector.body ]
  %vec.phi = phi <2 x i64> [ %164, %"$SSa13_adoptStorage_5countSayxG_SpyxGts016_ContiguousArrayB0CyxGn_SitFZSS_Tg5Tf4nnd_n.exit" ], [ %179, %vector.body ]
  %vec.phi134 = phi <2 x i64> [ zeroinitializer, %"$SSa13_adoptStorage_5countSayxG_SpyxGts016_ContiguousArrayB0CyxGn_SitFZSS_Tg5Tf4nnd_n.exit" ], [ %180, %vector.body ]
  %165 = add <2 x i64> %vec.phi, %broadcast.splat137
  %166 = add <2 x i64> %vec.phi134, %broadcast.splat139
  %167 = add <2 x i64> %165, %broadcast.splat137
  %168 = add <2 x i64> %166, %broadcast.splat139
  %169 = add <2 x i64> %167, %broadcast.splat137
  %170 = add <2 x i64> %168, %broadcast.splat139
  %171 = add <2 x i64> %169, %broadcast.splat137
  %172 = add <2 x i64> %170, %broadcast.splat139
  %173 = add <2 x i64> %171, %broadcast.splat137
  %174 = add <2 x i64> %172, %broadcast.splat139
  %175 = add <2 x i64> %173, %broadcast.splat137
  %176 = add <2 x i64> %174, %broadcast.splat139
  %177 = add <2 x i64> %175, %broadcast.splat137
  %178 = add <2 x i64> %176, %broadcast.splat139
  %179 = add <2 x i64> %177, %broadcast.splat137
  %180 = add <2 x i64> %178, %broadcast.splat139
  %index.next.7 = add nuw nsw i64 %index, 32
  %181 = icmp eq i64 %index.next.7, 1000000000
  br i1 %181, label %middle.block, label %vector.body, !llvm.loop !45

middle.block:                                     ; preds = %vector.body
  %bin.rdx = add <2 x i64> %180, %179
  %rdx.shuf = shufflevector <2 x i64> %bin.rdx, <2 x i64> undef, <2 x i32> <i32 1, i32 undef>
  %bin.rdx140 = add <2 x i64> %bin.rdx, %rdx.shuf
  %182 = extractelement <2 x i64> %bin.rdx140, i32 0
  store i64 %182, i64* getelementptr inbounds (%Ts6UInt64V, %Ts6UInt64V* @"$S3foo8checksums6UInt64Vvp", i64 0, i32 0), align 8
  %183 = tail call double @CACurrentMediaTime()

which is arguably an LLVM bug. There's still a bug on the Swift side that we aren't propagating the constant value of a. This appears to be specific to global code. If I factor the loop out into a local function, I get more reasonable-looking IR, in which both the constant value of a gets propagated and the addition loop gets completely constant folded:

/*
let a = 1.0

func loop(checksum: inout UInt64) {
  for _ in 0 ..< 1_000_000_000 {
    let v = a.magnitude // <-- Replace with `a.nextDown` and you get the same thing
    checksum = checksum &+ v.bitPattern
  }
}
*/

define hidden swiftcc void @"$S3foo4loop8checksumys6UInt64Vz_tF"(%Ts6UInt64V* nocapture dereferenceable(8)) local_unnamed_addr #0 {
entry:
  %._value = getelementptr inbounds %Ts6UInt64V, %Ts6UInt64V* %0, i64 0, i32 0
  %._value.promoted = load i64, i64* %._value, align 8
  %1 = add i64 %._value.promoted, 6917529027641081856
  store i64 %1, i64* %._value, align 8
  ret void
}

Possibly, because we model the variables in top-level code as global variables, the compiler is being unnecessarily conservative with them?

2 Likes

It looks like on master it's already fixed. LLVM can constant propagate everything with a top-of-master compiler.

This optimization is done by LLVM and not the SIL optimizer. Floating point constant propagation is pretty limited in the SIL optimizer. Which is okay as long as no other SIL optimizations are blocked by this.

Possibly, because we model the variables in top-level code as global variables

Yes, those are global variables.

Optimization of Swift global var/lets into static initializers is done at the SIL level, though, isn't it?

Optimization of Swift global var/lets into static initializers is done at the SIL level, though, isn't it?

Yes (as it is done in this example)

1 Like

I'm not sure if I misunderstand something but here's what I see (using the program below, modified and compiled according to the diagram):

//                         |    within test func    |       top-level        |
//                         |                        |                        |
//                         | .nextDown : .magnitude | .nextDown : .magnitude |
// ------------------------+-----------+------------+-----------+------------+
// Xcode 9.4 def toolchain |     9     :   4e-08    |     9     :   4e-08    |
// ------------------------+-----------+------------+-----------+------------+
// Dev Snapshot 2018-05-30 |   4e-08   :   4e-08    |    0.08   :   4e-08    |
// ------------------------+-----------+------------+-----------+------------+
//
// My interpretation:
//
// 9 seconds means something needs fixing.
// 0.08 seconds means it has been *partly* fixed.
// 4e-08 seconds means it has been fixed.

import AppKit

//func test() {
    let a = Double(1)
    
    var checksum = UInt64(0)
    var minTime = Double.infinity
    for trial in 0 ..< 5 {
        print("#\(trial)")
        let start = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown // <-- Replace with `a.magnitude`, see time diff.
            checksum = checksum &+ v.bitPattern
        }
        let stop = CACurrentMediaTime()
        let time = stop - start
        print("  time:", time, "seconds (checksum: \(checksum)")
        minTime = min(time, minTime)
    }
    print("\nmin time:", minTime, "seconds")
//}
//test()

Note that:

  • 9 seconds is the time for the .nextDown case, using Xcode 9.4 def toolchain, no matter if top level or in func.
  • 0.08 seconds is about 100 times faster than that (9 / 0.08 == 112.5).
  • 4e-08 seconds is still millions of times faster than that (0.08 / 4e-08 == 2_000_000).

so 0.08 seconds (most recent snapshot, for the .nextDown case, code at top level) is still millions of times slower than what it should be if it was entirely fixed, ie the same as for the .magnitude case: 4e-08 seconds. (My guess is that: While the a.nextDown calculation is being hoisted out of the loop, the further optimization of turning all the &+ operations into a single multiplication isn't being performed, but it is in the .magnitude case.)

This is why I wrote:

And by "equally fast", I mean both should be like .magnitude, which is 4e-08 seconds, not 0.08 seconds

1 Like

The above (and everything having to do with these issues) are too complex to communicate effectively.

So I'll try to simplify the current situation by starting from scratch, and only focus on:

  • the .nextDown case, and
  • the most recent Development Snapshot.

So make sure you are using Development Snapshot 2018-05-30 and follow these exact steps, because as you'll see, every little detail matters.


Step 1

This program will be fast (around 4e-08 seconds for me):

import AppKit

func test() {
    for _ in 0 ..< 5 {
        let a = Double(1)
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
}
test()

Step 2

Now, with the following modification, the program will still be fast (4e-08 s), which might lead us to believe that the issue has been fixed, even for the top level case! (remember that we are only talking about the most recent snapshot here).

import AppKit

// func test() {
    for _ in 0 ..< 5 {
        let a = Double(1)
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
// }
// test()

Step 3

But, if we move let a = Double(1) out of the (outer, trials-) for-in loop, then it will be millions of times slower (0.08 seconds for me).

import AppKit

// func test() {
    let a = Double(1) // <-- moved this 
    for _ in 0 ..< 5 {
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
// }
// test()

Step 4

If we now enclose it in the func again, it will be fast (4e-08 s).

import AppKit

func test() {
    let a = Double(1) // <-- this is still here, but moving it back inside the for-in loop doesn't matter when enclosed in func.
    for _ in 0 ..< 5 {
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
}
test()

Step 5

However: If we move let a = Double(1) to top level, it will be millions of times slower again (0.08 s)!

import AppKit

let a = Double(1) // <-- constant moved to top level

func test() {
    for _ in 0 ..< 5 {
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
}
test()

And it doesn't matter if a is declared as a static constant within a type, ie having it like this:

struct S {
    static let a = Double(1)
}

will still be slow (0.08 s).

This demo (Step 5), especially with a static constant in a type like that, is representative of situations that are very common in real code. Yes, the billion iteration loop is looking a bit contrived, but it's just to make the demo short, and you'll have to trust me that the underlying issue(s) come(s) up again and again (in different shapes) in non-contrived non-stupid real code.


So, Step 3 and Step 5 demonstrates that some optimization is still missed for the .nextDown case (while not being missed for eg .magnitude).

Hopefully, this makes it clear that the issue still remains in dev snapshot 2018-05-30.

EDIT: I've since noted that there is a tiny modification that can be made to the program of Step 5 in order to make it fast: Replace let a = Double(1) with a formulation that avoids the initializer and integer literal, ie any of these:

let a = 1 as Double
let a: Double = 1
let a = 1.0

I think I will try to sum all of this up into a new simplified bug report since SR-7094 has become rather messy.


Btw, you can also use eg .binade instead of .nextDown to get an even bigger difference in time when comparing to .magnitude, so this is not specific to .nextDown. The issue is meant to be just one specific example of a more general issue, and it would be nice if it ended up being perceived and fixed as such, so that "resolved" would mean that I don't have to repeat this effort of bumping into, isolating, and reporting every single one (.nextUp, .nextDown, .binade, etc.). This is also why I'm still wondering if/how issues like this is being tested/benchmarked, in order to prevent regressions. Just like @xwu wondered some time ago:

3 Likes

Yeah, that looks like a different issue, that the SIL constant folder doesn't look through the Double(Int) constructor when it could, so we don't optimize the global initializer. I think there's another optimizer limitation worth its own issue here: Even if a global let doesn't get its initializer constant-folded, we still ought to treat loads from the global as invariant, since the compiler can assume that a let doesn't change values. That would make the performance behavior less brittle.

Would you be able to share examples with more context, if not in this thread, then another thread? My gut feeling is that this particular example with nextDown/magnitude is too reduced to be representative, since the checksum loop in its ideal form is completely constant-foldable, and as Erik noted, the variations in behavior are largely LLVM's fault and not Swift, since LLVM is doing most of the floating-point analysis work. You're seeing a million-times difference because you're benchmarking doing almost no work, when LLVM eliminates the loop by replacing it with checksum = checksum &+ a.<operation>.bitPattern &* 1_000_000_000, against a real loop when it doesn't.

2 Likes

Before answering in more detail, I just want to make sure that we are on the same page regarding the following observations (using dev snapshot 2018-05-30, and referring to the 5 programs in my last post):

  1. The only two of the above 5 programs that are not "fully optimized" (which includes eliminating the loop in these particular programs) are Step 3 and Step 5.

  2. However, all of those programs (including Step 3 and Step 5) are fully optimized (including eliminating the loop) if we replace .nextDown with .magnitude.

  3. So Step 3 and Step 5 either is or is not fully optimized (including eliminating the loop) depending on if we use .nextDown or .magnitude.

(I failed to mention point 2 in the above post.)

Are you saying that this has to do with LLVM rather than Swift?

There are at least three different compiler shortcomings I see:

  • When you use Double(1) instead of 1.0 as a global initializer, the Swift optimizer fails to turn this into a static initialization. That's a Swift bug.
  • When a global or static let property has an initialization expression that doesn't get optimized, then we apparently fail to mark loads from the let as being invariant. That's a Swift bug too.
  • LLVM fails to recognize that the nextDown computation on a floating-point number can be hoisted out of a loop, and instead vectorizes the loop. That's an LLVM bug.
3 Likes

But if so, how come eg Step 1 is "fully optimized" (including eliminating the loop):

// Step 1 repeated here for completion:
import AppKit

func test() {
    for _ in 0 ..< 5 {
        let a = Double(1)
        var checksum = UInt64(0)
        let t0 = CACurrentMediaTime()
        for _ in 0 ..< 1_000_000_000 {
            let v = a.nextDown
            checksum = checksum &+ v.bitPattern
        }
        let t1 = CACurrentMediaTime()
        print(t1 - t0, "seconds, checksum:", checksum)
    }
}
test()

Doesn't the .nextDown computation have to be hoisted out of the loop in order to eliminate the loop?

In this case, a.nextDown probably gets constant-folded. The LLVM issue appears when a doesn't get constant propagated, as you've observed when it's made into a global variable. With magnitude, LLVM still appears to observe that adding <something>.magnitude a billion times is equivalent to multiplying it by a billion and adding the product, but for some reason it doesn't reach that conclusion with nextDown, or at least not before it's already vectorized the loop beyond recognition.

Thank you for taking the time to explain this!

So that little 15 line program and some seemingly insignificant changes to it can expose at least 2 Swift bugs and 1 LLVM bug.

I look forward to these two getting fixed:

Sorry to keep going on about this, but I'd like to take the opportunity to educate myself. So just in case you (or anyone else, cc @scanon) would feel like answering these three quick questions:


Question 1:

Is the following correct:

This is the implementation for .nextUp (in dev snapshot 2018-05-30).

And this is the implementation for .nextDown, defined as -(-self).nextUp.

?


Question 2:

.nextDown is marked as being both @inlinable and @_transparent, but .nextUp is marked as just @inlinable. Why isn't .nextUp marked as @_transparent?

Your answer to essentially the same question above was that:

Does this mean that all @_transparent attributes can be removed, ie @_transparent is redundant?

You also say that:

If I understand you correctly, this is what turned out to be the case, if so, what are these underlying LLVM primitives?


Question 3:

(Assuming the answer to Question 1 is yes, and that .nextUp should be marked as @_transparent, even though it currently isn't, or that @inlinable isn't working as expected without being paired with @_transparent:)

Since .nextDown is little more than a call to .nextUp (ie return -(-self).nextUp) does that mean .nextDown is effectively as non-@_transparent as .nextUp?


nextUp is effectively the C nextafter function IIUC, and if it were implemented by just forwarding to nextafter and being inlined, then LLVM would probably be able to optimize it. It's not inlinable or transparent otherwise because I'd expect a full implementation is too large to be profitable inlining. Maybe we could still stick an @_effects(readnone) attribute on it (and any other FloatingPoint methods that don't have it yet) to allow it to be optimized as a pure function.

1 Like

Yeah. nextUp is actually slightly more efficient than the C nextafter function, because it doesn't need to figure out what direction to move. It's right around the size threshold where inlining would make sense in general contexts. And, yes, we should be able to mark it readnone.

4 Likes

I guess eg .binade and .ulp, and a lot of the initializers should have it. But why limit this improvement to the floating point types? I'd think most types in the std lib have many properties, methods and initializers, that should similarly be marked @effects(readnone).

I'm really looking forward to trying out a Swift version with @effects(readnone) added everywhere where it makes sense in the std lib.

The reason for my enthusiasm / eagerness

I'm a bit obsessed by this because almost every time I bump into and investigate significant differences in performance that are caused by seemingly irrelevant changes to source code, at least part of the explanation will be due to Swift's current inability to see and utilize the fact that some initializers, properties or methods have ...

... no side effects and no dependencies on the state of the program. They always return an identical result given identical inputs.

For example, I might have some heavily used function that has a loop of only a few (not billions) iterations, the number of which I expect to be definitely statically knowable, as in eg

for i in 0 ..< MemoryLayout<T>.size {
    ...
}

And I notice that this loop (including its contents) will get optimized to various degrees depending on seemingly irrelevant changes to the loop, its contents or its context.

An attempt at preventing off topic discussion

By "seemingly irrelevant", I mean that you cannot understand how they are relevant unless you:

  1. Know unreasonably much about the implementation of (your current version of) Swift, including inconsistently marked and/or inefficiently implemented low-level operations in the std lib, shortcomings of the optimizer, etc.

  2. Understand exactly how the above inconsistencies will interact in the context of your code.

This is of course true for all languages in a way, but the number of performance related inconsistencies and bugs that are likely to be at play in any given piece of Swift, is still so high that it makes it impossible to develop an intuition for performance, and the only way forward seems to be a lot of experiments and bug reports. And further improvements to the benchmark suite.


But perhaps the optimizer isn't taking any advantage of the @effects attribute yet?

3 Likes