How to analyze the performance of for example this little program?

As previously mentioned in other threads, I'd like to get better at understanding and predicting the performance of Swift code. Apart from being interesting in itself, it would also make me better/quicker at identifying potential compiler issues and writing bug reports.

So I'm thinking that maybe someone would find it worthwhile to make a contribution to the community by describing their process of analyzing the below example program in order to answer the question at the end?

(I've tried looking at the disassembly, SIL, using Instruments in various ways, checked if there are any differences when commenting out various pieces of the code, or annotating funcs with @_transparent, @inline(…) etc. but I think I need to see some step by step practical examples in order to really understand what it is I'm doing. If it's easier/better to explain by using something like Hopper that would also be interesting to see.)


The example program:

import AppKit


// The following FixedWidthInteger extension adds an initializer that will
// be the subject of our performance analysis.
//
// The purpose of the initializer is to offer full range conversion between
// any two signed/unsigned fixed width integers, ie mapping the source value
// from the full range of the source type to the full range of the destination
// type (without using unnecessarily slow operations or risking overflow).
//
// So for example:
//
//   Int8(rangeConverted: UInt8(  0)) == Int8(-128)
//   Int8(rangeConverted: UInt8(128)) == Int8(   0)
//   Int8(rangeConverted: UInt8(255)) == Int8( 127)
//
//   UInt32(rangeConverted: UInt8(0x00)) == UInt32(0x00_00_00_00)
//   UInt32(rangeConverted: UInt8(0x80)) == UInt32(0x80_80_80_80)
//   UInt32(rangeConverted: UInt8(0xff)) == UInt32(0xff_ff_ff_ff)
//
//   UInt16(rangeConverted: Int64.min) == UInt16.min
//   UInt16(rangeConverted: Int64(0) ) == UInt16((Double(UInt16.max) / 2).rounded())
//   UInt16(rangeConverted: Int64.max) == UInt16.max

extension FixedWidthInteger {

    /// Creates a new instance by mapping `source` from the full range of
    /// `Src` to the full range of `Self`.
    init<Src>(rangeConverted source: Src) where Src: FixedWidthInteger {
        let srcBW = Src.bitWidth
        let dstBW = Self.bitWidth
        let source = Src.isSigned
            ? source &+ (Src(1) << (srcBW &- 1))
            : source
        if srcBW < dstBW {
            self = Self(truncatingIfNeeded: source)
            var shifts = srcBW
            while shifts < dstBW {
                self |= self << shifts
                shifts = shifts << 1
            }
        } else if srcBW > dstBW {
            let shifts = srcBW &- dstBW
            self = Self(truncatingIfNeeded: source >> shifts)
        } else {
            self = Self(truncatingIfNeeded: source)
        }
        if Self.isSigned {
            self = self &+ (Self(1) << (dstBW &- 1))
        }
    }
}

// This type is only a simple example context for this particular benchmark.
final class Buffer<Element: FixedWidthInteger> {
    var data: UnsafeMutableBufferPointer<Element>
    init(count: Int) {
        precondition(count > 0)
        data = UnsafeMutableBufferPointer<Element>.allocate(capacity: count)
        data.initialize(repeating: 0)
    }
    func setAllElementsToRandomBitPatterns() {
        let rc = SecRandomCopyBytes(nil, data.count, data.baseAddress!)
        precondition(rc == errSecSuccess)
    }
    func setElements<E>(from other: Buffer<E>, transform: (E) -> Element) {
        precondition(other.data.count <= data.count)
        for i in data.indices {
            data[i] = transform(other.data[i])
        }
    }
    func checksum() -> Element {
        var checksum = Element(0)
        for e in data { checksum = checksum ^ e }
        return checksum
    }
    deinit {
        data.deallocate()
    }
}

func benchmark<S, D>(source _: S.Type, destination _: D.Type)
    where S: FixedWidthInteger, D: FixedWidthInteger
{
    let num = 100_000_000
    print("\nRange converting \(num) values from \(S.self) to \(D.self):")
    let srcBuf = Buffer<S>(count: num)
    let dstBuf = Buffer<D>(count: num)
    for trial in 0 ..< 5 {
        srcBuf.setAllElementsToRandomBitPatterns()
        let t0 = CACurrentMediaTime()
        dstBuf.setElements(from: srcBuf, transform: D.init(rangeConverted:))
        let t1 = CACurrentMediaTime()
        print("  trial \(trial):",
            t1 - t0, "seconds (checksum: \(dstBuf.checksum()))")
    }
}

// This function is just here for completeness, in case anyone should find
// it interesting to see the benchmarks for all (8 * 8 = 64) conversions.
func benchmarkAll() {
    func benchAllSources<D: FixedWidthInteger>(withDestination: D.Type) {
        benchmark(source: UInt8.self, destination: D.self)
        benchmark(source: UInt16.self, destination: D.self)
        benchmark(source: UInt32.self, destination: D.self)
        benchmark(source: UInt64.self, destination: D.self)
        benchmark(source: Int8.self, destination: D.self)
        benchmark(source: Int16.self, destination: D.self)
        benchmark(source: Int32.self, destination: D.self)
        benchmark(source: Int64.self, destination: D.self)
    }
    benchAllSources(withDestination: UInt8.self)
    benchAllSources(withDestination: UInt16.self)
    benchAllSources(withDestination: UInt32.self)
    benchAllSources(withDestination: UInt64.self)
    benchAllSources(withDestination: Int8.self)
    benchAllSources(withDestination: Int16.self)
    benchAllSources(withDestination: Int32.self)
    benchAllSources(withDestination: Int64.self)
}

// But for the purpose of this thread, it suffices to look at
// this subset of conversions:
func benchmark8To64() {
    benchmark(source: UInt8.self, destination: UInt64.self)
    benchmark(source: Int8.self, destination: UInt64.self)
    benchmark(source: UInt8.self, destination: Int64.self)
    benchmark(source: Int8.self, destination: Int64.self)
}

benchmark8To64()

Compiling and running (on my MBP late 2013, 2GHz i7):

› /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-03-03-a.xctoolchain/usr/bin/swiftc \
--version
Apple Swift version 4.2-dev (LLVM c4ec2ab808, Clang af436f313e, Swift 95e1558e4a)
Target: x86_64-apple-darwin17.4.0

› /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-03-03-a.xctoolchain/usr/bin/swiftc \
-sdk /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk \
-O -whole-module-optimization -gnone -static-stdlib test.swift
› ./test

Range converting 100000000 values from UInt8 to UInt64:
  trial 0: 0.145914474007441 seconds (checksum: 9114861777597660798)
  trial 1: 0.111783446976915 seconds (checksum: 17651002172490708212)
  trial 2: 0.115817560988944 seconds (checksum: 4268070197446523707)
  trial 3: 0.110095688985894 seconds (checksum: 12659530246663417775)
  trial 4: 0.115379397000652 seconds (checksum: 14033993530586874562)

Range converting 100000000 values from Int8 to UInt64:
  trial 0: 0.111972735001473 seconds (checksum: 7234017283807667293)
  trial 1: 0.122052905004239 seconds (checksum: 7016996765293437226)
  trial 2: 0.123373145994265 seconds (checksum: 5280832617179597075)
  trial 3: 0.120153477007989 seconds (checksum: 10199964370168810929)
  trial 4: 0.122060198977124 seconds (checksum: 17651002172490708124)

Range converting 100000000 values from UInt8 to Int64:
  trial 0: 0.40651530600735 seconds (checksum: 868082074056920076)
  trial 1: 0.409130525978981 seconds (checksum: -1085102592571150096)
  trial 2: 0.405297498014988 seconds (checksum: 7234017283807667300)
  trial 3: 0.411088274006033 seconds (checksum: -5280832617179597130)
  trial 4: 0.404385382978944 seconds (checksum: -3906369333256140343)

Range converting 100000000 values from Int8 to Int64:
  trial 0: 0.412243063008646 seconds (checksum: 2314885530818453528)
  trial 1: 0.417125052015763 seconds (checksum: -3255307777713450296)
  trial 2: 0.416680544993142 seconds (checksum: 72340172838076760)
  trial 3: 0.414291215012781 seconds (checksum: 1663823975275763531)
  trial 4: 0.419294328981778 seconds (checksum: -3544668469065757043)
› 

Question:

Why are the last two conversions

  • Range converting 100000000 values from UInt8 to Int64:

  • Range converting 100000000 values from Int8 to Int64:

almost 4 times slower than the first two conversions (0.4 seconds compared to 0.1 seconds)?


Note 1: Even if we comment out everything in the initializer that has to do with checking if a type .isSigned, the results are the same, ie the last two conversions will still be around 0.4s, and the first two will still be around 0.1s. (Edit: clarified this note after reading this reply.)

Note 2: It seems to be the while loop (in the initializer) that is taking more time for the slower conversions, even though these four conversions will all do 3 iterations. Perhaps it is unrolled only for the fast conversions? I have not managed to see if this (or anything else really) is the case.


EDIT:

I thought that I had already checked this, but I just realized that the two last conversions are slower only when I'm compiling with these recent development snapshots:

  • swift-DEVELOPMENT-SNAPSHOT-2018-02-27-a
  • swift-DEVELOPMENT-SNAPSHOT-2018-03-02-a
  • swift-DEVELOPMENT-SNAPSHOT-2018-03-03-a
  • swift-DEVELOPMENT-SNAPSHOT-2018-03-07-a

But using when I'm compiling with the default toolchains of Xcode 9.2 or 9.3b3, all four conversions are equally fast (about 0.1 seconds).

The above versions of the compiler are the only ones I have access to, so I can only see that the regression must have been introduced somewhere between the default 9.3b3 toolchain and dev snapshot 2018-02-27.

I think this (the cause being a regression) only makes any potential step by step guide(s) more interesting!

By how much?

The same, ie still almost 4 times slower. Please note the edit of the original post about this being true only in recent dev snapshots!

The timings will in fact be identical (apart from the usual small fluctuations of course) when I run the test using the following version of the initializer, where anything having to do with checking .isSigned has been commented out:

/// Creates a new instance by mapping `source` from the full range of
/// `Src` to the full range of `Self`.
init<Src>(rangeConverted source: Src) where Src: FixedWidthInteger {
    let srcBW = Src.bitWidth
    let dstBW = Self.bitWidth
    /*
    let source = Src.isSigned
        ? source &+ (Src(1) << (srcBW &- 1))
        : source
    */
    if srcBW < dstBW {
        self = Self(truncatingIfNeeded: source)
        var shifts = srcBW
        while shifts < dstBW {
            self |= self << shifts
            shifts = shifts << 1
        }
    } else if srcBW > dstBW {
        let shifts = srcBW &- dstBW
        self = Self(truncatingIfNeeded: source >> shifts)
    } else {
        self = Self(truncatingIfNeeded: source)
    }
    /*
    if Self.isSigned {
        self = self &+ (Self(1) << (dstBW &- 1))
    }
    */
}

Since all the conditional tests are only on static constant values, the optimizer probably gets rid of them and reduces it to something that isn't faster or slower than when the ternary operator and if-statement are there, I guess.

But for some reason, the while-loop (that will have the same number of iterations for all four conversions) seems to be 4 times slower when the destination type (Self) is Int64 than when it is UInt64, but only in recent dev snapshots.

Filed SR-7150, but it would still be very interesting to see a step-by-step guide showing a deeper analysis of this. Such practical examples would be a valuable learning resource, because it would be in a form, or at a level, that is (in my experience) otherwise very hard to find.

EDIT: For anyone interested, @Andrew_Trick has written a detailed description of him looking into that bug (in the comments of the bug report), thanks!