Loop unrolling optimization

#1

Calculating the sum of the elements of a numeric sequence is a common task. This post is motivated by this article and and a recent answer to this SO question.

A vanilla way of summing is :

let a = Array(1...1_000_000)
var sum = 0
while i < a.count {
	sum += a[i]
	i += 1
}

I would suspect the optimizer to use loop-unrolling for such a basic loop. But in these benchmarks, manual unrolling is quicker :

while i < a.count - 4 {
	sum += a[i] + a[i+1] + a[i+2] + a[i+3]
	i += 4
}

for j in i..<a.count {
	sum += a[j]
}

I've experimented with different sizes for the unrolling, 4 to 6 elements seems like a sweet spot.

It clocks the same or slightly better than reduce :

sum = a.reduce(0,+)

I was wondering why is reduce optimized but a normal loop isn't? Are my benchmarks valid at all? And is this related?


Unrelated side note:

Vectorization with cblas_dasum is twice as fast as the other approaches but doesn't work with integers:

import Accelerate

var x = a.map { Double($0) }
let sum = cblas_dasum(Int32(x.count),
                      &x,
                      1)

(Benchmarks were run on my local machine in the terminal with optimizations)

(Joe Groff) #2

It's a known issue that Swift's overflow checks can block LLVM loop unrolling passes from happening to the extent they should. Is it better if you use non-overflow-checking operators like &+, etc., in your loop condition arithmetic? e.g.:

let a = Array(1...1_000_000)
var sum = 0
while i < a.count {
	sum += a[i]
	i = i &+ 1 // avoid overflow-checking the loop variable
}

Standard library operations like reduce have already been optimized by hand to avoid unnecessary safety checks, which might account for the difference.

#3

That explains it.

I couldn't notice any improvement using the same benchmarks.

(Tellow Krinkle) #4

I think the bigger issue is not the adding of the loop counter (the compiler is good at proving bounds on that) but the adding of the parts of the array. In my test here, all four ways to loop compiled to nearly identical (non-unrolled) code, with the manual unroll being the only one that was different.

If you really need speed (at the cost of overflow checks), sum the array pieces with &+, then the compiler will unroll and vectorize your code as seen here. There's a bit too much code here for me to compare the different versions, but at least the for item in a loop was so identical to the reduce that the compiler decided to make its function call the other one. Remember that you're giving up overflow checks for this though.

(Joe Groff) #5

Ah, I would've hoped the overflow checks on things other than the induction variable would not block unrolling, but I was mistaken. It'd be good to file a bug for this.

(Steve Canon) #6

Yeah, these are a bit over-unrolled (we certainly don't need more than 4 paddqs per loop iteration, and actually 2 should be fine), but are all more or less equivalent.

1 Like
(Tellow Krinkle) #7

Filed as SR-10518

1 Like
(Tellow Krinkle) #8

Also, not sure how you were benchmarking this, but if you were getting different results for your loop than reduce, I would guess this is because you were doing it in a top level swift file. Because the variables in a top level swift file are technically global, it messes up a lot of compiler optimizations. When you use reduce, you're no longer working with global variables for the loop counter and sum variable, so they don't have as many issues. When doing benchmarking, always stick your stuff in a do {} block or function.

Demo:
Top level swift file, hot loop is L53-L64 from what I can tell
do block, hot loop is L38-L44

3 Likes
#9

It seems that you're correct, it is a top-level optimization issue from what I can tell (and confirmed by the compiler code).

I've just run these new benchmarks on my local machine (Swift 5, terminal with optimizations):

do {
    let a = Array(1...1_000_000)
    let start = Date()
    
    var sum = 0
    var i = 0
    while i < a.count {
        sum += a[i]
        i += 1
    }
    
    let end = Date()
    print(sum, "\t", end.timeIntervalSince(start))
}

do {
    let a = Array(1...1_000_000)
    let start = Date()
    
    var sum = 0
    var i = 0
    while i < a.count - 4 {
        sum += a[i] + a[i+1] + a[i+2] + a[i+3]
        i += 4
    }
    
    for j in i..<a.count {
        sum += a[j]
    }
    
    let end = Date()
    print(sum, "\t", end.timeIntervalSince(start))
}

do {
    let a = Array(1...1_000_000)
    let start = Date()
    
    var sum = 0
    var i = 0
    while i < a.count - 6 {
        sum += a[i] + a[i+1] + a[i+2] + a[i+3] + a[i+4] + a[i+5]
        i += 6
    }
    
    for j in i..<a.count {
        sum += a[j]
    }
    
    let end = Date()
    print(sum, "\t", end.timeIntervalSince(start))
}

do {
    let a = Array(1...1_000_000)
    let start = Date()
    
    let sum = a.reduce(0,+)
    
    let end = Date()
    print(sum, "\t", end.timeIntervalSince(start))
}

Pretty much all loops + reduce clocked the same with a slight margin for error. Replacing all occurrences of + by &+ in all loops did gain up to 10µs for an array of 1,000,000 elements.

Thank you all for your time and replies.

1 Like