Execution time ContiguousArray vs Array

Jens · June 8, 2018, 6:42am

I'd say you are seeing only two substantially different time measurements there. One is about 15 ms, the other is about 30 ms. The for test with Array might be a bit slower than the while tests, but we'd need to see more runs and look at the disassembly to draw that conclusion.

(For anyone following this thread, note that the times have changed since the limit for the highest prime is now 4_000_037 instead of 1_000_000 as it was in previous examples.)

Here are a couple of runs of the same program on my machine (which seems to be a bit slower than yours):

for    Array           min time: 19.320964813232422 ms
for    ContiguousArray min time: 46.321988105773926 ms
lazy   Array           min time: 46.383023262023926 ms
lazy   ContiguousArray min time: 46.33796215057373 ms
while  Array           min time: 18.932104110717773 ms
while  ContiguousArray min time: 18.826007843017578 ms

for    Array           min time: 19.291043281555176 ms
for    ContiguousArray min time: 46.21601104736328 ms
lazy   Array           min time: 46.48399353027344 ms
lazy   ContiguousArray min time: 46.28896713256836 ms
while  Array           min time: 18.828988075256348 ms
while  ContiguousArray min time: 18.849968910217285 ms

for    Array           min time: 20.66493034362793 ms
for    ContiguousArray min time: 50.183892250061035 ms
lazy   Array           min time: 50.8190393447876 ms
lazy   ContiguousArray min time: 50.486087799072266 ms
while  Array           min time: 20.4010009765625 ms
while  ContiguousArray min time: 19.833087921142578 ms

I am also using Xcode 10 beta 1, default toolchain, release build (-O, there's no -Ofast, only -Ounchecked, -O, -Osize, -Onone, and -O is the one optimizing for speed).

Guess why the measurements of the third run is slower than the first two?

It's because I was moving the mouse cursor like crazy while the test was running. :)

Note that ContiguousArray is not always slower than Array.

Also, look what happens if I run the same program with disabled safety checks (-Ounchecked):

for    Array           min time: 20.627975463867188 ms
for    ContiguousArray min time: 20.71702480316162 ms
lazy   Array           min time: 56.8850040435791 ms
lazy   ContiguousArray min time: 46.7149019241333 ms
while  Array           min time: 17.348051071166992 ms
while  ContiguousArray min time: 17.39799976348877 ms

for    Array           min time: 20.56705951690674 ms
for    ContiguousArray min time: 20.666956901550293 ms
lazy   Array           min time: 57.30998516082764 ms
lazy   ContiguousArray min time: 46.66697978973389 ms
while  Array           min time: 17.3569917678833 ms
while  ContiguousArray min time: 17.35401153564453 ms

for    Array           min time: 20.643949508666992 ms
for    ContiguousArray min time: 20.751953125 ms
lazy   Array           min time: 57.5789213180542 ms
lazy   ContiguousArray min time: 46.80800437927246 ms
while  Array           min time: 17.465949058532715 ms
while  ContiguousArray min time: 17.420053482055664 ms

Now, both for-cases are equally fast. Which suggests that there might have been a bounds check or something that wasn't optimized away for for with ContiguousArray. (I haven't looked at the disassembly)

It might seem surprising that one test case (lazy with Array) has become significantly slower when disabling safety checks. Also, it seems like the for tests have become slightly slower, and the while tests has become slightly faster. My expectation would be that the tests could become faster, but never slower with -Ounchecked, I have seen (entirely different) code becoming slower with -Ounchecked before, and I'd like to understand why that is.

Anyway, my point is that, it's far from straight forward to draw conclusion like:

For any test ContiguousArray is slower than Array.
stride seems to be slow.

without trying out various contexts, rerunning the program (even though it is performing hundreds of trials and reports min, max, mean, median and whatever), seeing significant consistent time differences and analyzing the disassembly.

As shown in the first test runs of this post, using ContiguousArray is not always slower than using Array for lazy and while (at least not on my machine, even though all other things are the same). The time differences are too small and too fluctuating.

And the reason why lazy is slow need not be only because of stride. It might as well be because it is passing code through closures extensively, together with various shortcomings of the optimizer given the specific context for eg stride, etc. But don't get me wrong, it is possible that the current implementation of stride makes it unnecessarily inefficient.

By the way, the tiny bit of intuition I've been able to build for the performance of Swift code would not let me trust the harness used for measuring the time. Since it's passing the tested function around as a closure, it might itself introduce some of the performance inconsistencies we are trying to pin point, ie it might not be the least performance degrading context in which to call the test functions:

func timeOneTest(_ name: String, _ storageType: String, _ function: (Int) -> [Int]) -> (Double, Int)  {
    let start = Date()
    let last = function(4_000_037).last ?? 0
    let end = Date()
    let time = end.timeIntervalSince(start)
    //print("\(name) \(storageType) biggest: \(last) time: \(time)")
    return (time, last)
}
func test(_ name: String, _ storageType: String, _ function: @escaping (Int) -> [Int]) {
    let n = 47
    let minimum = (0 ..< n).lazy.map { _ in
        timeOneTest(name, storageType, function)
    }.reduce(Double.greatestFiniteMagnitude) { (r, t) in
        r > t.0 ? t.0 : r
    }
    print("\(name) \(storageType) min time: \(minimum * 1000.0) ms")
}

(Also, if I'm not mistaken, the code is written in a way such that the resulting primes (or the last resulting prime) are never used, and if the optimizer was more aggressive, it could probably remove all of the prime finding code by dead code elimination.)

But I did a quick check (changing 4_000_037 to 1_000_000) and saw that the results for for and while are the same as in the demo program of SR-7637 so, at least for these particular test functions, the harness is probably not incurring any loss in performance.

But please have a look at eg SR-7023, SR-7094 and SR-6983 to get a feel for how unpredictable and non-intuitive these things are.

As for the lazy variant, I haven't really looked closely at the implementation, and I can't say that I understand exactly how and why it does what it does, or whether it could be reformulated to be more efficient (while still deserving the name lazy, no pun intended).