Loop unrolling optimization

Clicket · April 18, 2019, 8:37pm

Calculating the sum of the elements of a numeric sequence is a common task. This post is motivated by this article and and a recent answer to this SO question.

A vanilla way of summing is :

let a = Array(1...1_000_000)
var sum = 0
while i < a.count {
	sum += a[i]
	i += 1
}

I would suspect the optimizer to use loop-unrolling for such a basic loop. But in these benchmarks, manual unrolling is quicker :

while i < a.count - 4 {
	sum += a[i] + a[i+1] + a[i+2] + a[i+3]
	i += 4
}

for j in i..<a.count {
	sum += a[j]
}

I've experimented with different sizes for the unrolling, 4 to 6 elements seems like a sweet spot.

It clocks the same or slightly better than reduce :

sum = a.reduce(0,+)

I was wondering why is reduce optimized but a normal loop isn't? Are my benchmarks valid at all? And is this related?

Unrelated side note:

Vectorization with cblas_dasum is twice as fast as the other approaches but doesn't work with integers:

import Accelerate

var x = a.map { Double($0) }
let sum = cblas_dasum(Int32(x.count),
                      &x,
                      1)

(Benchmarks were run on my local machine in the terminal with optimizations)

Joe_Groff · April 18, 2019, 8:39pm

It's a known issue that Swift's overflow checks can block LLVM loop unrolling passes from happening to the extent they should. Is it better if you use non-overflow-checking operators like &+, etc., in your loop condition arithmetic? e.g.:

let a = Array(1...1_000_000)
var sum = 0
while i < a.count {
	sum += a[i]
	i = i &+ 1 // avoid overflow-checking the loop variable
}

Standard library operations like reduce have already been optimized by hand to avoid unnecessary safety checks, which might account for the difference.

Clicket · April 18, 2019, 9:20pm

That explains it.

I couldn't notice any improvement using the same benchmarks.

TellowKrinkle · April 18, 2019, 10:33pm

I think the bigger issue is not the adding of the loop counter (the compiler is good at proving bounds on that) but the adding of the parts of the array. In my test here, all four ways to loop compiled to nearly identical (non-unrolled) code, with the manual unroll being the only one that was different.

If you really need speed (at the cost of overflow checks), sum the array pieces with &+, then the compiler will unroll and vectorize your code as seen here. There's a bit too much code here for me to compare the different versions, but at least the for item in a loop was so identical to the reduce that the compiler decided to make its function call the other one. Remember that you're giving up overflow checks for this though.

Joe_Groff · April 18, 2019, 10:54pm

Ah, I would've hoped the overflow checks on things other than the induction variable would not block unrolling, but I was mistaken. It'd be good to file a bug for this.

scanon · April 18, 2019, 11:58pm

Yeah, these are a bit over-unrolled (we certainly don't need more than 4 paddqs per loop iteration, and actually 2 should be fine), but are all more or less equivalent.

TellowKrinkle · April 19, 2019, 3:10am

Filed as SR-10518

TellowKrinkle · April 19, 2019, 3:27am

Also, not sure how you were benchmarking this, but if you were getting different results for your loop than reduce, I would guess this is because you were doing it in a top level swift file. Because the variables in a top level swift file are technically global, it messes up a lot of compiler optimizations. When you use reduce, you're no longer working with global variables for the loop counter and sum variable, so they don't have as many issues. When doing benchmarking, always stick your stuff in a do {} block or function.

Demo:
Top level swift file, hot loop is L53-L64 from what I can tell
do block, hot loop is L38-L44

Clicket · April 19, 2019, 6:50am

It seems that you're correct, it is a top-level optimization issue from what I can tell (and confirmed by the compiler code).

I've just run these new benchmarks on my local machine (Swift 5, terminal with optimizations):

do {
    let a = Array(1...1_000_000)
    let start = Date()
    
    var sum = 0
    var i = 0
    while i < a.count {
        sum += a[i]
        i += 1
    }
    
    let end = Date()
    print(sum, "\t", end.timeIntervalSince(start))
}

do {
    let a = Array(1...1_000_000)
    let start = Date()
    
    var sum = 0
    var i = 0
    while i < a.count - 4 {
        sum += a[i] + a[i+1] + a[i+2] + a[i+3]
        i += 4
    }
    
    for j in i..<a.count {
        sum += a[j]
    }
    
    let end = Date()
    print(sum, "\t", end.timeIntervalSince(start))
}

do {
    let a = Array(1...1_000_000)
    let start = Date()
    
    var sum = 0
    var i = 0
    while i < a.count - 6 {
        sum += a[i] + a[i+1] + a[i+2] + a[i+3] + a[i+4] + a[i+5]
        i += 6
    }
    
    for j in i..<a.count {
        sum += a[j]
    }
    
    let end = Date()
    print(sum, "\t", end.timeIntervalSince(start))
}

do {
    let a = Array(1...1_000_000)
    let start = Date()
    
    let sum = a.reduce(0,+)
    
    let end = Date()
    print(sum, "\t", end.timeIntervalSince(start))
}

Pretty much all loops + reduce clocked the same with a slight margin for error. Replacing all occurrences of + by &+ in all loops did gain up to 10µs for an array of 1,000,000 elements.

Thank you all for your time and replies.