Missed optimization opportunity?

Jens · February 27, 2018, 12:56pm

Swift will unnecessarily compute v a billion times here, even though a is a statically knowable constant value, and thus I'd expect the compiler to be smart enough to compute v only once.

let a = Double(1)
var checksum = UInt64(0)
for _ in 0 ..< 1_000_000_000 {
    let v = a.nextDown
    checksum = checksum &+ v.bitPattern
}
print(checksum)

So it will take 5 seconds to execute this program on my machine:

› /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-02-26-a.xctoolchain/usr/bin/swiftc \
-O -whole-module-optimization -gnone -static-stdlib test.swift
› time ./test
6917529026641081856

real  0m5.170s
user  0m5.155s
sys   0m0.008s

Note that I've chosen Double and .nextDown just as an example, the same is true for eg .nextUp, .binade, .ulp and .significand.

But for example .magnitude seems to be optimizable and is not recalculated every iteration:

let a = Double(1)
var checksum = UInt64(0)
for _ in 0 ..< 1_000_000_000 {
    let v = a.magnitude
    checksum = checksum &+ v.bitPattern
}
print(checksum)

As can be seen here:

› time ./test
6917529027641081856

real      0m0.018s
user      0m0.007s
sys       0m0.005s

And for example .squareRoot() is also optimizable, so changing .magnitude to .squareRoot() in the example program will also result in a binary that takes approximately zero seconds to execute.

It seems like magnitude and squareRoot() (the examples that are optimizable as expected) are annotated with @_transparent while the unoptimizable examples are not.

Question: Why are not all these examples optimized? Will they be soon?

(I'm asking because I frequently find myself having to identify and work around unoptimizable properties like the ones mentioned here in order to make heavily used code not be way too inefficient. I end up having to spend a lot of time investigating and implement manual optimizations that I wish I could trust the compiler to do for me.)

anon27714269 · February 27, 2018, 9:02pm

This is only slow in my Xcode when using debug build, a release build is lightning fast.

Jens · February 27, 2018, 9:14pm

I'm 99.999% sure you are mistaken here (see the command line compilation flags I used in the original post). I have tested this with Xcode 9.3 beta 3, default toolchain and dev snapshot 2018-02-25.

Start a new command line app prj in Xcode, replace the content in main.swift with:

import AppKit
    
let a = Double(1)
var checksum = UInt64(0)
for _ in 0 ..< 5 {
    let t0 = CACurrentMediaTime()
    for _ in 0 ..< 1_000_000_000 {
        let v = a.nextDown
        checksum = checksum &+ v.bitPattern
    }
    let t1 = CACurrentMediaTime()
    print(t1 - t0, "seconds, checksum:", checksum)
}

Run it (a release build of course : ) ) and note that each of the 5 trials will take (probably, if your machine is anything like my MBP) a couple of seconds.

If this is really blazing fast for you, what Xcode (and Swift) version are you using?

And also, to make the point of this thread clear, here is a version that is "manually optimized" so that it works the way I expect it to when the compiler optimizes it for me:

import AppKit

let a = Double(1)
var checksum = UInt64(0)
for _ in 0 ..< 5 {
    let t0 = CACurrentMediaTime()
    let v = a.nextDown // <-- Moved this outside the loop.
    for _ in 0 ..< 1_000_000_000 {
        checksum = checksum &+ v.bitPattern
    }
    let t1 = CACurrentMediaTime()
    print(t1 - t0, "seconds, checksum:", checksum)
}

This will be very fast.

Joe_Groff · February 27, 2018, 9:21pm

We ought to consistently mark all these low-level floating point operations as being inlinable. If you haven't yet, please file bugs for the ones you've observed are not inlined. cc @scanon

anon27714269 · February 27, 2018, 9:28pm

Both on a MP2013 3.5GHz 6core Xcode Version 9.2 (9C40b)

release:
0.012738665973302 seconds, checksum: 6917529026641081856
0.0127187250182033 seconds, checksum: 13835058053282163712
0.0127166239544749 seconds, checksum: 2305843006213693952
0.0128470309427939 seconds, checksum: 9223372032854775808
0.0123814719845541 seconds, checksum: 16140901059495857664
Program ended with exit code: 0

debug:
32.576377587975 seconds, checksum: 6917529026641081856
32.639348156983 seconds, checksum: 13835058053282163712
32.5265893120086 seconds, checksum: 2305843006213693952
32.6720230139908 seconds, checksum: 9223372032854775808
32.6116706570028 seconds, checksum: 16140901059495857664
Program ended with exit code: 0

Jens · February 27, 2018, 9:32pm

Many of the non-optimized ones (eg .nextUp, .binade, .ulp) seem to already be marked with:

@_inlineable // FIXME(sil-serialize-all)

While the optimized examples (eg .magnitude, .squareRoot()) seem to be marked also with @_transparent:

@_inlineable // FIXME(sil-serialize-all)
@_transparent

(I'm just looking here and here and am not particularly sure how to interpret things.)

Jens · February 27, 2018, 9:39pm

Wow! I have 9.2 here, so I just tested the same code:

import AppKit

let a = Double(1)
var checksum = UInt64(0)
for _ in 0 ..< 5 {
    let t0 = CACurrentMediaTime()
    for _ in 0 ..< 1_000_000_000 {
        let v = a.nextDown
        checksum = checksum &+ v.bitPattern
    }
    let t1 = CACurrentMediaTime()
    print(t1 - t0, "seconds, checksum:", checksum)
}

in Xcode 9.2 (9C40b) and I get the same result as you! So something clearly has regressed as of Xcode 9.3 beta 3 default toolchain (or earlier) and the issue remains in latest dev snapshots. I don't have anything between Xcode 9.2 and Xcode 9.3 beta 3 available here to test with.

However, note that there still is a significant difference (although not as huge) when moving

    let v = a.nextDown

out of the loop, and I can't see why it should be any difference at all. So what I'm saying is that even in Xcode 9.2 (when things wasn't as bad as they are in eg Xcode 9.3 beta 3), moving that line out of the one-billion-for-loop makes the time change from 0.1 seconds to the order of 1.0e-08 seconds on my system. I'd expect the compiler to be able to produce a binary (the same actually) that is equally fast no matter if that line is within or before the loop.

Joe_Groff · February 27, 2018, 9:40pm

That might indicate an optimizer bug then. Either @_inlineable or @_transparent ought to enable inlining the operation. It's possible that we don't correctly expose the semantics of the underlying LLVM primitives these build on. cc @Erik_Eckstein

Jens · February 28, 2018, 12:18am

I guess there are tests in place that catches the sort of performance regressions that must have happened somewhere between Xcode 9.2 and Xcode 9.3 beta 3, so I assume that that is a known and separate issue (which happen to make my examples here more dramatic in Xcode 9.3 beta 3 than in Xcode 9.2).

But I'd like to know more about the main issue of this thread. So, assuming that the optimizer is working as intended, and we are looking at (a release build) of this program:

import AppKit

let a = Double(1)
var checksum = UInt64(0)
for _ in 0 ..< 5 {
    let t0 = CACurrentMediaTime()
    for _ in 0 ..< 1_000_000_000 {
        let v = a.nextUp // This line located here ==> Slow Version
        checksum = checksum &+ v.bitPattern
    }
    let t1 = CACurrentMediaTime()
    print(t1 - t0, "seconds, checksum:", checksum)
}

Should it make any difference for the execution time (or the binary for that matter) if the line let v = a.nextUp is written within the for-loop (Slow Version), or before the for loop (Fast Version)?

import AppKit

let a = Double(1)
var checksum = UInt64(0)
for _ in 0 ..< 5 {
    let t0 = CACurrentMediaTime()
    let v = a.nextUp // This line located here ==> Fast Version
    for _ in 0 ..< 1_000_000_000 {
        checksum = checksum &+ v.bitPattern
    }
    let t1 = CACurrentMediaTime()
    print(t1 - t0, "seconds, checksum:", checksum)
}

It does make a huge difference in Xcode 9.3 beta 3, and a big difference in Xcode 9.2.

But isn't this a typical example of an optimization that I should be able to trust the compiler to do for me?

jawbroken · February 28, 2018, 12:42am

You probably know all this, so I'm not clear exactly what you asking, but if it can prove that a.nextup is loop invariant then sure, everyone can agree that it should be reliably hoisted in these examples. But that can be difficult or impossible if the definition of .nextup is from another module (optimisation is unsound) or the compiler can't prove it is a pure function (possibly true in this case if it doesn't understand the LLVM primitives), etc.

Joe_Groff · February 28, 2018, 1:06am

In principle, the optimizer ought to hoist constant expressions with no side effects like this out of hot loops. Swift doesn't really have a formal notion of "pure" or "constant expression" at all yet, though. We do our best to try to communicate the pure-ness of low level things as much as we can in an ad-hoc fashion, so that the LLVM optimizer at least can clean up things well, but we probably missed some spots. Our annotations might be good enough for the standard library, but for general user code, you're probably best off hoisting expensive computations out of loops yourself, since the optimizer's effectiveness will likely be spotty.

Jens · February 28, 2018, 2:09am

Thanks! I did not know all that, and I asked because I'd like to learn more about these things, so that I can better understand and communicate my wish for them to be improved, because my productivity really suffers when I cannot "reason about" the efficiency of my code, and I think I'm not the only one feeling this way.

(
Warning: I feel that I have to let out a small rant here, nothing personal, I love Swift and everyone helping to make it better, I just have to release some of the frustration somewhere, and I might edit this away after a good nights sleep, anyway, here it comes:

I spend more time running into and isolating this sort of efficiency oddities/issues/bugs and trying to understand the (current version of the) compiler than writing actual algorithms / apps. It seems like I have two options:

Go back to using C/C++ for performance critical code, and miss all the good stuff from Swift.
Continue using, or rather, exploring, the mystical bleeding edge rabbit hole of Swift, don't expect to release an app, instead just try to learn and file as many bugs as possible, for another three more years and see if maybe Swift 8 is more mature/predictable.

)

Joe_Groff · February 28, 2018, 3:13am

Well, at a high level, C and C++ have the same general optimization problem Swift does, that the compiler doesn't have much information in the general case about when these kinds of optimizations that rely on lack of side effects are possible. C family compilers are of course also more mature and better at optimizing the known semantics of their more mature standard libraries. If you need the performance you're used to know, C/C++ will definitely be your best bet, though working with Swift and filing bugs when you don't see the optimizations expect would help us out a lot, and maybe help Swift reach parity with C sooner.

xwu · February 28, 2018, 3:36am

Are there such tests, though? Looking through bugs.swift.org, there's nothing I'm aware of that reflects this performance regression. Could you file a bug?

Jens · February 28, 2018, 11:06am

Filed the performance regression as SR-7094, asking about such tests, and rambling about my wish for a more predictable compiler/optimizer, which I guess boils down to a proper way of marking and tracking some notion of purity or side-effects.

Jens · May 29, 2018, 11:53am

I'm bumping this thread as part of my little mission to try and bring more focus on performance issues in general.

I'd also like to ask two questions regarding this particular issue:

Question 1:

Exactly what difference between .nextDown and .magnitude is it that makes such a big difference for the execution speed of the below test program?

Question 2:

It seems like the optimizer is able to hoist the calculation of

let v = a.magnitude

out of the loop, as expected, while for some reason (ie the answer to question 1) it is unable to do so for

let v = a.nextDown

If I'm correct about this, is it hard to fix for .nextDown and all other properties in the std lib that are affected by the same issue?

import AppKit

let a = Double(1)

var checksum = UInt64(0)
var minTime = Double.infinity
for trial in 0 ..< 5 {
    print("#\(trial)")
    let start = CACurrentMediaTime()
    for _ in 0 ..< 1_000_000_000 {
        let v = a.nextDown // <-- Replace with `a.magnitude` and see time diff.
        checksum = checksum &+ v.bitPattern
    }
    let stop = CACurrentMediaTime()
    let time = stop - start
    print("  time:", time, "seconds (checksum: \(checksum)")
    minTime = min(time, minTime)
}
print("\nmin time:", minTime, "seconds")

Example compile and run on my MBP 2GHz i7 late 2013, default toolchain:

$ swiftc --version
Apple Swift version 4.1 (swiftlang-902.0.48 clang-902.0.37.1)
Target: x86_64-apple-darwin17.5.0
$ swiftc -O test.swift
$ ./test
#0
  time: 13.9918197880033 seconds (checksum: 6917529026641081856
#1
  time: 14.0257235940007 seconds (checksum: 13835058053282163712
#2
  time: 13.988855939002 seconds (checksum: 2305843006213693952
#3
  time: 14.0062829500021 seconds (checksum: 9223372032854775808
#4
  time: 14.0546091610013 seconds (checksum: 16140901059495857664

min time: 13.988855939002 seconds

So that's a min time of 14 seconds. Now, replacing .nextDown with .magnitude will result in a dramatic speed increase:

...

min time: 4.00032149627805e-08 seconds

That's about 450 million times faster.

Click to expand another little rant

If I compile the program with a recent dev snapshot (2018-05-26 in this case) instead of with the default toolchain of Xcode 9.3.1, then it will be around 120 times faster for .nextDown. So things have moved in the right direction since the default toolchain, but 120 times faster is nowhere near hundreds of millions times faster, so .nextDown is still causing some work in the loop when using the recent snapshot.

Of course I could manually hoist v out of the for loop in cases like this. But, apart from this being a typical task for an optimizer and not a human, you have to remember that this is just a very simple example program designed to make this issue super obvious. In real code, the underlying cause of this particular issue shows up in different ways and in much more involved cases. I realize that I have to take care of some "low level" optimizations myself, like designing my algorithms so that they avoid cache thrashing etc, but I really wish I could trust the optimizer with seemingly simple (for a computer) tasks like this.

I really believe that increasing the priority of performance (fixing bugs like these, coming up with and adding tests like these to the test suite, etc.) would turn out to be well worth the effort, benefiting not just people like me (who try to write performance sensitive code in Swift rather than giving up and using C).

As it currently stands, programs compiled with the latest official release of the compiler can be hundreds of millions times less efficient than they would be if there had been more focus on performance, which I'd expect by now, especially from a language called Swift.

Joe_Groff · May 29, 2018, 2:38pm

It sounds like nextDown must've been annotated with purity attributes to tell the compiler that it's safe to hoist out of loops between 4.1 and now. C and C++ would've had this exact same issue if for some reason nextafter() had not been annotated as __attribute__((pure)) (or LLVM hadn't been hardcoded with assumptions about the semantics of the C standard library).

Jens · May 29, 2018, 4:04pm

But how come using .magnitude instead of .nextDown makes the program millions of times faster, even when using the most recent snapshot (Question 1 above)?

(The improvements of the recent snapshot makes the .nextDown case about a hundred times faster, but .magnitude will still be millions of times faster than that. So it seems like there’s still something different between them in the recent snapshot. If v was hoisted out of the loop for both the .magnitude and the .nextDown case, then the reported times would be equally fast.)

Joe_Groff · May 29, 2018, 4:39pm

cc @Erik_Eckstein. It looks like the optimizer is failing to do LICM in either case. With magnitude, it's instead turning the addition loop into a multiplication by 1 billion (though it still computes the magnitude):

  %152 = tail call double @CACurrentMediaTime()
  %153 = load double, double* getelementptr inbounds (%TSd, %TSd* @"$S3foo1aSdvp", i64 0, i32 0), align 8
  %154 = tail call double @llvm.fabs.f64(double %153)
  %155 = bitcast double %154 to i64
  %.promoted = load i64, i64* getelementptr inbounds (%Ts6UInt64V, %Ts6UInt64V* @"$S3foo8checksums6UInt64Vvp", i64 0, i32 0), align 8
  %156 = mul i64 %155, 1000000000
  %157 = add i64 %.promoted, %156
  store i64 %157, i64* getelementptr inbounds (%Ts6UInt64V, %Ts6UInt64V* @"$S3foo8checksums6UInt64Vvp", i64 0, i32 0), align 8
  %158 = tail call double @CACurrentMediaTime()

On the other hand, with nextDown, it appears to be trying to vectorize the loop for no good reason:

  %152 = tail call double @CACurrentMediaTime()
  %153 = load double, double* getelementptr inbounds (%TSd, %TSd* @"$S3foo1aSdvp", i64 0, i32 0), align 8
  %154 = fsub double 0.000000e+00, %153
  %155 = fcmp olt double %154, 0x7FF0000000000000
  %156 = bitcast double %154 to i64
  %157 = ashr i64 %156, 63
  %158 = or i64 %157, 1
  %159 = add i64 %158, %156
  %160 = bitcast i64 %159 to double
  %161 = select i1 %155, double %160, double %154
  %162 = fsub double -0.000000e+00, %161
  %163 = bitcast double %162 to i64
  %.promoted = load i64, i64* getelementptr inbounds (%Ts6UInt64V, %Ts6UInt64V* @"$S3foo8checksums6UInt64Vvp", i64 0, i32 0), align 8
  %broadcast.splatinsert136 = insertelement <2 x i64> undef, i64 %163, i32 0
  %broadcast.splat137 = shufflevector <2 x i64> %broadcast.splatinsert136, <2 x i64> undef, <2 x i32> zeroinitializer
  %broadcast.splatinsert138 = insertelement <2 x i64> undef, i64 %163, i32 0
  %broadcast.splat139 = shufflevector <2 x i64> %broadcast.splatinsert138, <2 x i64> undef, <2 x i32> zeroinitializer
  %164 = insertelement <2 x i64> <i64 undef, i64 0>, i64 %.promoted, i32 0
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %"$SSa13_adoptStorage_5countSayxG_SpyxGts016_ContiguousArrayB0CyxGn_SitFZSS_Tg5Tf4nnd_n.exit"
  %index = phi i64 [ 0, %"$SSa13_adoptStorage_5countSayxG_SpyxGts016_ContiguousArrayB0CyxGn_SitFZSS_Tg5Tf4nnd_n.exit" ], [ %index.next.7, %vector.body ]
  %vec.phi = phi <2 x i64> [ %164, %"$SSa13_adoptStorage_5countSayxG_SpyxGts016_ContiguousArrayB0CyxGn_SitFZSS_Tg5Tf4nnd_n.exit" ], [ %179, %vector.body ]
  %vec.phi134 = phi <2 x i64> [ zeroinitializer, %"$SSa13_adoptStorage_5countSayxG_SpyxGts016_ContiguousArrayB0CyxGn_SitFZSS_Tg5Tf4nnd_n.exit" ], [ %180, %vector.body ]
  %165 = add <2 x i64> %vec.phi, %broadcast.splat137
  %166 = add <2 x i64> %vec.phi134, %broadcast.splat139
  %167 = add <2 x i64> %165, %broadcast.splat137
  %168 = add <2 x i64> %166, %broadcast.splat139
  %169 = add <2 x i64> %167, %broadcast.splat137
  %170 = add <2 x i64> %168, %broadcast.splat139
  %171 = add <2 x i64> %169, %broadcast.splat137
  %172 = add <2 x i64> %170, %broadcast.splat139
  %173 = add <2 x i64> %171, %broadcast.splat137
  %174 = add <2 x i64> %172, %broadcast.splat139
  %175 = add <2 x i64> %173, %broadcast.splat137
  %176 = add <2 x i64> %174, %broadcast.splat139
  %177 = add <2 x i64> %175, %broadcast.splat137
  %178 = add <2 x i64> %176, %broadcast.splat139
  %179 = add <2 x i64> %177, %broadcast.splat137
  %180 = add <2 x i64> %178, %broadcast.splat139
  %index.next.7 = add nuw nsw i64 %index, 32
  %181 = icmp eq i64 %index.next.7, 1000000000
  br i1 %181, label %middle.block, label %vector.body, !llvm.loop !45

middle.block:                                     ; preds = %vector.body
  %bin.rdx = add <2 x i64> %180, %179
  %rdx.shuf = shufflevector <2 x i64> %bin.rdx, <2 x i64> undef, <2 x i32> <i32 1, i32 undef>
  %bin.rdx140 = add <2 x i64> %bin.rdx, %rdx.shuf
  %182 = extractelement <2 x i64> %bin.rdx140, i32 0
  store i64 %182, i64* getelementptr inbounds (%Ts6UInt64V, %Ts6UInt64V* @"$S3foo8checksums6UInt64Vvp", i64 0, i32 0), align 8
  %183 = tail call double @CACurrentMediaTime()

which is arguably an LLVM bug. There's still a bug on the Swift side that we aren't propagating the constant value of a. This appears to be specific to global code. If I factor the loop out into a local function, I get more reasonable-looking IR, in which both the constant value of a gets propagated and the addition loop gets completely constant folded:

/*
let a = 1.0

func loop(checksum: inout UInt64) {
  for _ in 0 ..< 1_000_000_000 {
    let v = a.magnitude // <-- Replace with `a.nextDown` and you get the same thing
    checksum = checksum &+ v.bitPattern
  }
}
*/

define hidden swiftcc void @"$S3foo4loop8checksumys6UInt64Vz_tF"(%Ts6UInt64V* nocapture dereferenceable(8)) local_unnamed_addr #0 {
entry:
  %._value = getelementptr inbounds %Ts6UInt64V, %Ts6UInt64V* %0, i64 0, i32 0
  %._value.promoted = load i64, i64* %._value, align 8
  %1 = add i64 %._value.promoted, 6917529027641081856
  store i64 %1, i64* %._value, align 8
  ret void
}

Possibly, because we model the variables in top-level code as global variables, the compiler is being unnecessarily conservative with them?

Erik_Eckstein · May 30, 2018, 8:54pm

It looks like on master it's already fixed. LLVM can constant propagate everything with a top-of-master compiler.

This optimization is done by LLVM and not the SIL optimizer. Floating point constant propagation is pretty limited in the SIL optimizer. Which is okay as long as no other SIL optimizations are blocked by this.

Possibly, because we model the variables in top-level code as global variables

Yes, those are global variables.