For loop or while loop - which one is more efficient?

ibex10 · October 19, 2024, 2:17am

After reading this post by @Karl, the question popped in my mind, and I decided to find out.

Given:

// Let U be some type

func f (_ u: U) {...}

let uv: [U] = ...

let M = uv.count

Which one of these loops is more efficient?

// (1)
for u in uv {
   f (u)
}

// (2)
for i in 0..<M {
   f (uv[i])
}

// (3)
var i: Int = 0
while i < M {
   f (uv[i])
   i += 1
}

Measurements

// Built with Xcode 16.0, optimize for Speed [-O]
// On a MacMini 3.2 GHz 6-Core Intel Core i7

T0() trigger lazy initialization of uv...
initialize uv...
M = 1000000 T0 took 0.112677642 seconds   // Trigger lazy init of uv

M = 1000000 T1 took 0.000593867 seconds   // for loop - for u in uv {
M = 1000000 T2 took 0.000543171 seconds   // for loop - for i in 0..<M {
M = 1000000 T3 took 0.000509566 seconds   // while loop
M = 1000000 T4 took 0.00048985 seconds    // reduce
M = 1000000 T5 took 0.000427039 seconds   // while loop inside withUnsafeBufferPointer

Code

import Foundation

@main
enum ForOrWhileLoop {
    static func main ()  {
        print (type(of: Self.self), "...")
        
        measure (prefix: "T0", f: T0)
        measure (prefix: "T1", f: T1)
        measure (prefix: "T2", f: T2)
        measure (prefix: "T3", f: T3)
        measure (prefix: "T4", f: T4)
        measure (prefix: "T5", f: T5)
    }
}

let M = 1_000_000

let uv: [Int] = .init (unsafeUninitializedCapacity: M) { buffer, initializedCount in
    print ("initialize uv...")
    for i in 0..<M {
        buffer [i] = Int.random(in: 0..<1024)
    }
    initializedCount = M
}

func measure (prefix: String, f: () -> Int) {
    let c = ContinuousClock ()
    let d = c.measure {
        _ = f ()
    }
    print ("M = \(M)", prefix, "took", d)
}

func T0 () -> Int {
    print (#function, "trigger lazy initialization of uv...")
    
    var sum: Int = 0
    for u in uv {
        sum += u
    }
    return sum
}

func T1 () -> Int {
    var sum: Int = 0
    for u in uv {
        sum += u
    }
    return sum
}

func T2 () -> Int {
    var sum: Int = 0
    for i in 0..<M {
        sum += uv [i]
    }
    return sum
}

func T3 () -> Int {
    var sum: Int = 0
    var i: Int = 0
    
    while i < M {
        sum += uv [i]
        i += 1
    }
    return sum
}

func T4 () -> Int {
    let sum = uv.reduce (0) { partialResult, u in
        partialResult + u
    }
    return sum
}

func T5 () -> Int {
    var sum: Int = 0
    uv.withUnsafeBufferPointer {
        var i: Int = 0
        while i < M {
            sum += $0 [i]
            i += 1
        }
    }
    return sum
}

Looks like the good, old while loop is the winner.

Edit: My initial observation was incorrect as I wasn't aware of the effect of the lazy initialization of the global variable uv. See @ahti's post below. I have modified the test code and the results. Looks like there isn't a big difference after all. @jrose, @Alejandro, @Karl, @asdf_bro, @David_Smith. Thank you all.

jrose · October 19, 2024, 2:22am

An interesting result, worth filing a bug over! I can’t think of any reason why the for-in loop should be slower here.

Alejandro · October 19, 2024, 2:45am

Compiler Explorer The for .. in has retains and releases for the array it's iterating over. The for loop over the indices and while loop generate the same assembly.

Karl · October 19, 2024, 3:11am

Right, that's because the for loop gets expanded in to an iterator and while loop:

func test1(uv: [Int]) {
  for u in uv {
    test(u)
  }
}

func test1_expanded(uv: [Int]) {
  var iter = uv.makeIterator()
  while let u = iter.next() {
    test(u)
  }
}

In the Godbolt output, the implementations get merged because they are the same.

The optimiser is able to eliminate the retain/release pair in both implementations if you compile with -Xfrontend -enable-ossa-modules. I don't know what's preventing that being enabled by default, and in the case of Vector I don't know if it would be enough to save the Sequence conformance (Array copies do not depend on the element type, but vector copies do — can this flag still remove the copy of a Vector of resilient types where the compiler doesn't know if the copies have side-effects?).

asdf_bro · October 19, 2024, 4:03am

In your godbolt example, if you change Int to String, it's clear that all three looping methods have identical ARC traffic inside the loop.

Something is wrong with @ibex10's test, unless a single retain/release pair actually takes 100ms on their machine.

ibex10 · October 19, 2024, 6:14am

I don't have a new Apple silicon machine - My MacMini is quite ancient.

I am curious about what the numbers would look like on an Apple silicon.

David_Smith · October 19, 2024, 6:48am

Typically I would expect one retain-release pair to be in the ballpark of 30ns (note: not ms) on x86, and 6ns on Apple Silicon. A big relative difference, but not by itself enough to explain the result.

ahti · October 19, 2024, 9:39am

In your benchmark code uv is a global variable. IIRC, those are lazy, so the time measured for T1 includes the time spent creating the array.

ibex10 · October 19, 2024, 10:39am

Spot on - Thank you, @ahti

I will edit my post to take into account the effect of lazy initialization.

Joe_Groff · October 21, 2024, 4:49pm

You're using top-level code, so it's not lazy, but since script variables are still globally mutable, they are optimized much more conservatively than local variables would be. When benchmarking it's a good idea to at least wrap your local variable declarations and code in a do block so that it gets compiled the same as it would normally be.