For loop or while loop - which one is more efficient?

After reading this post by @Karl, the question popped in my mind, and I decided to find out.

Given:

// Let U be some type

func f (_ u: U) {...}

let uv: [U] = ...

let M = uv.count

Which one of these loops is more efficient?

// (1)
for u in uv {
   f (u)
}
// (2)
for i in 0..<M {
   f (uv[i])
}
// (3)
var i: Int = 0
while i < M {
   f (uv[i])
   i += 1
}
Measurements
// Built with Xcode 16.0, optimize for Speed [-O]
// On a MacMini 3.2 GHz 6-Core Intel Core i7

T0() trigger lazy initialization of uv...
initialize uv...
M = 1000000 T0 took 0.112677642 seconds   // Trigger lazy init of uv

M = 1000000 T1 took 0.000593867 seconds   // for loop - for u in uv {
M = 1000000 T2 took 0.000543171 seconds   // for loop - for i in 0..<M {
M = 1000000 T3 took 0.000509566 seconds   // while loop
M = 1000000 T4 took 0.00048985 seconds    // reduce
M = 1000000 T5 took 0.000427039 seconds   // while loop inside withUnsafeBufferPointer

Code
import Foundation

@main
enum ForOrWhileLoop {
    static func main ()  {
        print (type(of: Self.self), "...")
        
        measure (prefix: "T0", f: T0)
        measure (prefix: "T1", f: T1)
        measure (prefix: "T2", f: T2)
        measure (prefix: "T3", f: T3)
        measure (prefix: "T4", f: T4)
        measure (prefix: "T5", f: T5)
    }
}

let M = 1_000_000

let uv: [Int] = .init (unsafeUninitializedCapacity: M) { buffer, initializedCount in
    print ("initialize uv...")
    for i in 0..<M {
        buffer [i] = Int.random(in: 0..<1024)
    }
    initializedCount = M
}

func measure (prefix: String, f: () -> Int) {
    let c = ContinuousClock ()
    let d = c.measure {
        _ = f ()
    }
    print ("M = \(M)", prefix, "took", d)
}

func T0 () -> Int {
    print (#function, "trigger lazy initialization of uv...")
    
    var sum: Int = 0
    for u in uv {
        sum += u
    }
    return sum
}

func T1 () -> Int {
    var sum: Int = 0
    for u in uv {
        sum += u
    }
    return sum
}

func T2 () -> Int {
    var sum: Int = 0
    for i in 0..<M {
        sum += uv [i]
    }
    return sum
}

func T3 () -> Int {
    var sum: Int = 0
    var i: Int = 0
    
    while i < M {
        sum += uv [i]
        i += 1
    }
    return sum
}

func T4 () -> Int {
    let sum = uv.reduce (0) { partialResult, u in
        partialResult + u
    }
    return sum
}

func T5 () -> Int {
    var sum: Int = 0
    uv.withUnsafeBufferPointer {
        var i: Int = 0
        while i < M {
            sum += $0 [i]
            i += 1
        }
    }
    return sum
}

Looks like the good, old while loop is the winner. :slight_smile:

Edit: My initial observation was incorrect as I wasn't aware of the effect of the lazy initialization of the global variable uv. See @ahti's post below. I have modified the test code and the results. Looks like there isn't a big difference after all. @jrose, @Alejandro, @Karl, @asdf_bro, @David_Smith. Thank you all.

An interesting result, worth filing a bug over! I can’t think of any reason why the for-in loop should be slower here.

3 Likes

Compiler Explorer The for .. in has retains and releases for the array it's iterating over. The for loop over the indices and while loop generate the same assembly.

4 Likes

Right, that's because the for loop gets expanded in to an iterator and while loop:

func test1(uv: [Int]) {
  for u in uv {
    test(u)
  }
}

func test1_expanded(uv: [Int]) {
  var iter = uv.makeIterator()
  while let u = iter.next() {
    test(u)
  }
}

In the Godbolt output, the implementations get merged because they are the same.

The optimiser is able to eliminate the retain/release pair in both implementations if you compile with -Xfrontend -enable-ossa-modules. I don't know what's preventing that being enabled by default, and in the case of Vector I don't know if it would be enough to save the Sequence conformance (Array copies do not depend on the element type, but vector copies do β€” can this flag still remove the copy of a Vector of resilient types where the compiler doesn't know if the copies have side-effects?).

4 Likes

In your godbolt example, if you change Int to String, it's clear that all three looping methods have identical ARC traffic inside the loop.

Something is wrong with @ibex10's test, unless a single retain/release pair actually takes 100ms on their machine.

3 Likes

I don't have a new Apple silicon machine - My MacMini is quite ancient.

I am curious about what the numbers would look like on an Apple silicon.

Typically I would expect one retain-release pair to be in the ballpark of 30ns (note: not ms) on x86, and 6ns on Apple Silicon. A big relative difference, but not by itself enough to explain the result.

2 Likes

In your benchmark code uv is a global variable. IIRC, those are lazy, so the time measured for T1 includes the time spent creating the array.

8 Likes

Spot on - Thank you, @ahti

I will edit my post to take into account the effect of lazy initialization.

1 Like

You're using top-level code, so it's not lazy, but since script variables are still globally mutable, they are optimized much more conservatively than local variables would be. When benchmarking it's a good idea to at least wrap your local variable declarations and code in a do block so that it gets compiled the same as it would normally be.

3 Likes