Shouldn't the optimizer make this manual loop-unrolling unnecessary?


(Jens Persson) #1

I've been doing a lot of performance testing related to generic value types
and SIMD lately, and I've built Swift from sources in order to get an idea
of what's coming up optimizerwise. Things have improved and the optimizer
is impressive overall. But I still see no improvement in the case
exemplified below.

Manually unrolling the simple for loop will make it ~ 4 times faster (and
exactly the same as when SIMD float4):

struct V4<T> {
    var elements: (T, T, T, T)
    /.../
    subscript(index: Int) -> T { /.../ }
    /.../
    func addedTo(other: V4) -> V4 {
        var r = V4()
        // Manually unrolling makes code ~ 4 times faster:
        // for i in 0 ..< 4 { r[i] = self[i] + other[i] }
        r[0] = self[0] + other[0]
        r[1] = self[1] + other[1]
        r[2] = self[2] + other[2]
        r[3] = self[3] + other[3]
        return r
    }
    /.../
}

Shouldn't the optimizer be able to handle that for loop and make the manual
unrolling unnecessary?

(compiled the test with -O -whole-module-optimizations, also tried
-Ounchecked but with same results.)

/Jens


(Jens Persson) #2

Correction: The test I'm running is actually using V4<V4<Float>>.
Manually unrolling the loop makes adding V4<V4<Float>> as fast as adding
SIMD float4x4.
Using the (un-unrolled) for loop will be about 4 times slower.
My question is still: Shouldn't the optimizer be able to handle that for
loop / make my manual unrolling unnecessary?
/Jens

···

On Fri, Dec 11, 2015 at 8:28 AM, Jens Persson <jens@bitcycle.com> wrote:

I've been doing a lot of performance testing related to generic value
types and SIMD lately, and I've built Swift from sources in order to get an
idea of what's coming up optimizerwise. Things have improved and the
optimizer is impressive overall. But I still see no improvement in the case
exemplified below.

Manually unrolling the simple for loop will make it ~ 4 times faster (and
exactly the same as when SIMD float4):

struct V4<T> {
    var elements: (T, T, T, T)
    /.../
    subscript(index: Int) -> T { /.../ }
    /.../
    func addedTo(other: V4) -> V4 {
        var r = V4()
        // Manually unrolling makes code ~ 4 times faster:
        // for i in 0 ..< 4 { r[i] = self[i] + other[i] }
        r[0] = self[0] + other[0]
        r[1] = self[1] + other[1]
        r[2] = self[2] + other[2]
        r[3] = self[3] + other[3]
        return r
    }
    /.../
}

Shouldn't the optimizer be able to handle that for loop and make the
manual unrolling unnecessary?

(compiled the test with -O -whole-module-optimizations, also tried
-Ounchecked but with same results.)

/Jens

--
bitCycle AB | Smedjegatan 12 | 742 32 Östhammar | Sweden
http://www.bitcycle.com/
Phone: +46-73-753 24 62
E-mail: jens@bitcycle.com


(Mark Lacey) #3

Correction: The test I'm running is actually using V4<V4<Float>>.
Manually unrolling the loop makes adding V4<V4<Float>> as fast as adding SIMD float4x4.
Using the (un-unrolled) for loop will be about 4 times slower.
My question is still: Shouldn't the optimizer be able to handle that for loop / make my manual unrolling unnecessary?
/Jens

I've been doing a lot of performance testing related to generic value types and SIMD lately, and I've built Swift from sources in order to get an idea of what's coming up optimizerwise. Things have improved and the optimizer is impressive overall. But I still see no improvement in the case exemplified below.

Manually unrolling the simple for loop will make it ~ 4 times faster (and exactly the same as when SIMD float4):

struct V4<T> {
    var elements: (T, T, T, T)
    /.../
    subscript(index: Int) -> T { /.../ }
    /.../
    func addedTo(other: V4) -> V4 {
        var r = V4()
        // Manually unrolling makes code ~ 4 times faster:
        // for i in 0 ..< 4 { r[i] = self[i] + other[i] }
        r[0] = self[0] + other[0]
        r[1] = self[1] + other[1]
        r[2] = self[2] + other[2]
        r[3] = self[3] + other[3]
        return r
    }
    /.../
}

Shouldn't the optimizer be able to handle that for loop and make the manual unrolling unnecessary?

In theory, yes. In practice there are some fairly complex phase ordering issues in the SIL optimizer, and certain optimizations (like general loop unrolling) that are only done in the LLVM optimizer. The LLVM optimizer runs after all the SIL-level optimizations, which may mean that SIL-level optimization opportunities are exposed by the LLVM optimizer but by then it is too late to do anything about them.

(compiled the test with -O -whole-module-optimizations, also tried -Ounchecked but with same results.)

Would you mind opening an issue on https://bugs.swift.org <https://bugs.swift.org/> will a small stand-alone test case that compiles successfully, and report your results there?

Mark

···

On Dec 11, 2015, at 6:05 AM, Jens Persson via swift-dev <swift-dev@swift.org> wrote:
On Fri, Dec 11, 2015 at 8:28 AM, Jens Persson <jens@bitcycle.com <mailto:jens@bitcycle.com>> wrote:


(Jens Persson) #4

Thank you. I've filed:
https://bugs.swift.org/browse/SR-203

···

On Fri, Dec 11, 2015 at 7:04 PM, Mark Lacey <mark.lacey@apple.com> wrote:

On Dec 11, 2015, at 6:05 AM, Jens Persson via swift-dev < > swift-dev@swift.org> wrote:

Correction: The test I'm running is actually using V4<V4<Float>>.
Manually unrolling the loop makes adding V4<V4<Float>> as fast as adding
SIMD float4x4.
Using the (un-unrolled) for loop will be about 4 times slower.
My question is still: Shouldn't the optimizer be able to handle that for
loop / make my manual unrolling unnecessary?
/Jens

On Fri, Dec 11, 2015 at 8:28 AM, Jens Persson <jens@bitcycle.com> wrote:

I've been doing a lot of performance testing related to generic value
types and SIMD lately, and I've built Swift from sources in order to get an
idea of what's coming up optimizerwise. Things have improved and the
optimizer is impressive overall. But I still see no improvement in the case
exemplified below.

Manually unrolling the simple for loop will make it ~ 4 times faster (and
exactly the same as when SIMD float4):

struct V4<T> {
    var elements: (T, T, T, T)
    /.../
    subscript(index: Int) -> T { /.../ }
    /.../
    func addedTo(other: V4) -> V4 {
        var r = V4()
        // Manually unrolling makes code ~ 4 times faster:
        // for i in 0 ..< 4 { r[i] = self[i] + other[i] }
        r[0] = self[0] + other[0]
        r[1] = self[1] + other[1]
        r[2] = self[2] + other[2]
        r[3] = self[3] + other[3]
        return r
    }
    /.../
}

Shouldn't the optimizer be able to handle that for loop and make the
manual unrolling unnecessary?

In theory, yes. In practice there are some fairly complex phase ordering
issues in the SIL optimizer, and certain optimizations (like general loop
unrolling) that are only done in the LLVM optimizer. The LLVM optimizer
runs after all the SIL-level optimizations, which may mean that SIL-level
optimization opportunities are exposed by the LLVM optimizer but by then it
is too late to do anything about them.

(compiled the test with -O -whole-module-optimizations, also tried

-Ounchecked but with same results.)

Would you mind opening an issue on https://bugs.swift.org will a small
stand-alone test case that compiles successfully, and report your results
there?

Mark

--
bitCycle AB | Smedjegatan 12 | 742 32 Östhammar | Sweden
http://www.bitcycle.com/
Phone: +46-73-753 24 62
E-mail: jens@bitcycle.com