As previously mentioned in other threads, I'd like to get better at understanding and predicting the performance of Swift code. Apart from being interesting in itself, it would also make me better/quicker at identifying potential compiler issues and writing bug reports.
So I'm thinking that maybe someone would find it worthwhile to make a contribution to the community by describing their process of analyzing the below example program in order to answer the question at the end?
(I've tried looking at the disassembly, SIL, using Instruments in various ways, checked if there are any differences when commenting out various pieces of the code, or annotating funcs with @_transparent, @inline(…) etc. but I think I need to see some step by step practical examples in order to really understand what it is I'm doing. If it's easier/better to explain by using something like Hopper that would also be interesting to see.)
The example program:
import AppKit
// The following FixedWidthInteger extension adds an initializer that will
// be the subject of our performance analysis.
//
// The purpose of the initializer is to offer full range conversion between
// any two signed/unsigned fixed width integers, ie mapping the source value
// from the full range of the source type to the full range of the destination
// type (without using unnecessarily slow operations or risking overflow).
//
// So for example:
//
// Int8(rangeConverted: UInt8( 0)) == Int8(-128)
// Int8(rangeConverted: UInt8(128)) == Int8( 0)
// Int8(rangeConverted: UInt8(255)) == Int8( 127)
//
// UInt32(rangeConverted: UInt8(0x00)) == UInt32(0x00_00_00_00)
// UInt32(rangeConverted: UInt8(0x80)) == UInt32(0x80_80_80_80)
// UInt32(rangeConverted: UInt8(0xff)) == UInt32(0xff_ff_ff_ff)
//
// UInt16(rangeConverted: Int64.min) == UInt16.min
// UInt16(rangeConverted: Int64(0) ) == UInt16((Double(UInt16.max) / 2).rounded())
// UInt16(rangeConverted: Int64.max) == UInt16.max
extension FixedWidthInteger {
/// Creates a new instance by mapping `source` from the full range of
/// `Src` to the full range of `Self`.
init<Src>(rangeConverted source: Src) where Src: FixedWidthInteger {
let srcBW = Src.bitWidth
let dstBW = Self.bitWidth
let source = Src.isSigned
? source &+ (Src(1) << (srcBW &- 1))
: source
if srcBW < dstBW {
self = Self(truncatingIfNeeded: source)
var shifts = srcBW
while shifts < dstBW {
self |= self << shifts
shifts = shifts << 1
}
} else if srcBW > dstBW {
let shifts = srcBW &- dstBW
self = Self(truncatingIfNeeded: source >> shifts)
} else {
self = Self(truncatingIfNeeded: source)
}
if Self.isSigned {
self = self &+ (Self(1) << (dstBW &- 1))
}
}
}
// This type is only a simple example context for this particular benchmark.
final class Buffer<Element: FixedWidthInteger> {
var data: UnsafeMutableBufferPointer<Element>
init(count: Int) {
precondition(count > 0)
data = UnsafeMutableBufferPointer<Element>.allocate(capacity: count)
data.initialize(repeating: 0)
}
func setAllElementsToRandomBitPatterns() {
let rc = SecRandomCopyBytes(nil, data.count, data.baseAddress!)
precondition(rc == errSecSuccess)
}
func setElements<E>(from other: Buffer<E>, transform: (E) -> Element) {
precondition(other.data.count <= data.count)
for i in data.indices {
data[i] = transform(other.data[i])
}
}
func checksum() -> Element {
var checksum = Element(0)
for e in data { checksum = checksum ^ e }
return checksum
}
deinit {
data.deallocate()
}
}
func benchmark<S, D>(source _: S.Type, destination _: D.Type)
where S: FixedWidthInteger, D: FixedWidthInteger
{
let num = 100_000_000
print("\nRange converting \(num) values from \(S.self) to \(D.self):")
let srcBuf = Buffer<S>(count: num)
let dstBuf = Buffer<D>(count: num)
for trial in 0 ..< 5 {
srcBuf.setAllElementsToRandomBitPatterns()
let t0 = CACurrentMediaTime()
dstBuf.setElements(from: srcBuf, transform: D.init(rangeConverted:))
let t1 = CACurrentMediaTime()
print(" trial \(trial):",
t1 - t0, "seconds (checksum: \(dstBuf.checksum()))")
}
}
// This function is just here for completeness, in case anyone should find
// it interesting to see the benchmarks for all (8 * 8 = 64) conversions.
func benchmarkAll() {
func benchAllSources<D: FixedWidthInteger>(withDestination: D.Type) {
benchmark(source: UInt8.self, destination: D.self)
benchmark(source: UInt16.self, destination: D.self)
benchmark(source: UInt32.self, destination: D.self)
benchmark(source: UInt64.self, destination: D.self)
benchmark(source: Int8.self, destination: D.self)
benchmark(source: Int16.self, destination: D.self)
benchmark(source: Int32.self, destination: D.self)
benchmark(source: Int64.self, destination: D.self)
}
benchAllSources(withDestination: UInt8.self)
benchAllSources(withDestination: UInt16.self)
benchAllSources(withDestination: UInt32.self)
benchAllSources(withDestination: UInt64.self)
benchAllSources(withDestination: Int8.self)
benchAllSources(withDestination: Int16.self)
benchAllSources(withDestination: Int32.self)
benchAllSources(withDestination: Int64.self)
}
// But for the purpose of this thread, it suffices to look at
// this subset of conversions:
func benchmark8To64() {
benchmark(source: UInt8.self, destination: UInt64.self)
benchmark(source: Int8.self, destination: UInt64.self)
benchmark(source: UInt8.self, destination: Int64.self)
benchmark(source: Int8.self, destination: Int64.self)
}
benchmark8To64()
Compiling and running (on my MBP late 2013, 2GHz i7):
› /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-03-03-a.xctoolchain/usr/bin/swiftc \
--version
Apple Swift version 4.2-dev (LLVM c4ec2ab808, Clang af436f313e, Swift 95e1558e4a)
Target: x86_64-apple-darwin17.4.0
› /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-03-03-a.xctoolchain/usr/bin/swiftc \
-sdk /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk \
-O -whole-module-optimization -gnone -static-stdlib test.swift
› ./test
Range converting 100000000 values from UInt8 to UInt64:
trial 0: 0.145914474007441 seconds (checksum: 9114861777597660798)
trial 1: 0.111783446976915 seconds (checksum: 17651002172490708212)
trial 2: 0.115817560988944 seconds (checksum: 4268070197446523707)
trial 3: 0.110095688985894 seconds (checksum: 12659530246663417775)
trial 4: 0.115379397000652 seconds (checksum: 14033993530586874562)
Range converting 100000000 values from Int8 to UInt64:
trial 0: 0.111972735001473 seconds (checksum: 7234017283807667293)
trial 1: 0.122052905004239 seconds (checksum: 7016996765293437226)
trial 2: 0.123373145994265 seconds (checksum: 5280832617179597075)
trial 3: 0.120153477007989 seconds (checksum: 10199964370168810929)
trial 4: 0.122060198977124 seconds (checksum: 17651002172490708124)
Range converting 100000000 values from UInt8 to Int64:
trial 0: 0.40651530600735 seconds (checksum: 868082074056920076)
trial 1: 0.409130525978981 seconds (checksum: -1085102592571150096)
trial 2: 0.405297498014988 seconds (checksum: 7234017283807667300)
trial 3: 0.411088274006033 seconds (checksum: -5280832617179597130)
trial 4: 0.404385382978944 seconds (checksum: -3906369333256140343)
Range converting 100000000 values from Int8 to Int64:
trial 0: 0.412243063008646 seconds (checksum: 2314885530818453528)
trial 1: 0.417125052015763 seconds (checksum: -3255307777713450296)
trial 2: 0.416680544993142 seconds (checksum: 72340172838076760)
trial 3: 0.414291215012781 seconds (checksum: 1663823975275763531)
trial 4: 0.419294328981778 seconds (checksum: -3544668469065757043)
›
Question:
Why are the last two conversions
-
Range converting 100000000 values from UInt8 to Int64:
-
Range converting 100000000 values from Int8 to Int64:
almost 4 times slower than the first two conversions (0.4 seconds compared to 0.1 seconds)?
Note 1: Even if we comment out everything in the initializer that has to do with checking if a type .isSigned
, the results are the same, ie the last two conversions will still be around 0.4s, and the first two will still be around 0.1s. (Edit: clarified this note after reading this reply.)
Note 2: It seems to be the while loop (in the initializer) that is taking more time for the slower conversions, even though these four conversions will all do 3 iterations. Perhaps it is unrolled only for the fast conversions? I have not managed to see if this (or anything else really) is the case.
EDIT:
I thought that I had already checked this, but I just realized that the two last conversions are slower only when I'm compiling with these recent development snapshots:
- swift-DEVELOPMENT-SNAPSHOT-2018-02-27-a
- swift-DEVELOPMENT-SNAPSHOT-2018-03-02-a
- swift-DEVELOPMENT-SNAPSHOT-2018-03-03-a
- swift-DEVELOPMENT-SNAPSHOT-2018-03-07-a
But using when I'm compiling with the default toolchains of Xcode 9.2 or 9.3b3, all four conversions are equally fast (about 0.1 seconds).
The above versions of the compiler are the only ones I have access to, so I can only see that the regression must have been introduced somewhere between the default 9.3b3 toolchain and dev snapshot 2018-02-27.
I think this (the cause being a regression) only makes any potential step by step guide(s) more interesting!