Lazy/computed property performance

I was wondering why I get different time measurements when running the following code:

import CoreFoundation
import Foundation

class Test {
    let w = 5000
    let h = 5000

    var computedProperty: Int { return w }
    lazy var lazyProperty: Int = { w }()

    func execute() {
        let buffer = UnsafeMutableRawBufferPointer.allocate(byteCount: w * h, alignment: MemoryLayout<UInt8>.alignment)
        buffer.initializeMemory(as: UInt8.self, repeating: 0)
        
        let a = computedProperty // replace with lazyProperty
        let startTime = CFAbsoluteTimeGetCurrent()
        
        for y in 0..<h {
            for x in 0..<w {
                let index = y * a + x

                let byte = buffer[index]
                buffer[index] = byte / 2
            }
        }
        
        let total = CFAbsoluteTimeGetCurrent() - startTime
        print("\(#function): " + String(format: "%.5f", total))
        
        buffer.deallocate()
    }
}

let test = Test()
test.execute()

Compiled with:

swiftc -O -whole-module-optimization main.swift
Xcode 11.1 + Swift 5.1 or latest Docker image (tag:5.1)

If

let a = computedProperty

is replaced with:

let a = lazyProperty

the results are worse by an order of magnitude:

Sample averaged results:

Whole Module Optimization:
computedProperty: ~0.0025s
lazyProperty: ~0.025s

Without Whole Module Optimization:
computedProperty: ~0.025s
lazyProperty: ~0.025s

(always compiling with -O optimization level)

My assumption is that computedProperty gets inlined, whereas lazyProperty does not. If we add an annotation:

@inline(never) var computedProperty: Int { return w }

then the result is the same as lazyProperty.

Is there any other reason or am I missing something?

Just a quick note, the following:

    lazy var lazyProperty: Int = w

produces the same result, in both behavior and execution time, as what you wrote:

    lazy var lazyProperty: Int = { w }()
2 Likes

That suggests that the compiler is failing to inline the call to the closure for some reason. There shouldn't be any reason it couldn't. It's worth filing a bug about this.

I played around some more with this code example and was surprised to find that the following is as slow as lazyProperty:

var storedProperty: Int = 5000
...
let a = storedProperty

ie about 10 times slower than computedProperty.

2 Likes

I do not understand how there can possibly be any difference at all.

The property, whether computed or lazy, is only accessed once, before the timing begins, and its value is stored to a local let.

What is actually going on here?

1 Like

Also, note that changing the following:

                let index = y * a + x

to

                let index = y * a &+ x

will make all cases fast, no matter how/what a was assigned to.

Can anyone explain why/how?


PS: I have run into exactly this kind of unintuitive performance issue a number of times, so I guess it's a quite common missed optimization that will prove well worth investigating and fixing.
SR-7150 is an old bug report of mine which might be a related performance instability.

2 Likes

Isn't that because the "+" operator does a bunch of overflow checking, so that the compiler inserts a lot of code, whereas the "&+" dispenses with all of that and becomes pretty close to a machine add instruction?

I know the difference between + and &+, but I thought it was interesting to note that it is a way to trigger or not trigger the strange optimization miss/glitch demonstrated by the code example.


Another observation: Turning the Test class into a struct (and making execute a mutating func) won't change anything, but adding @inline(__always) to execute will also make
let a = lazyProperty as fast as let a = computedProperty. This only works if Test is a struct, not if it's a class or final class.

The code example I use
import CoreFoundation
import Foundation

struct Test {
    let w = 5000
    let h = 5000

    var computedProperty: Int { return w }
    lazy var lazyProperty: Int = w // { w }()
    var storedProperty: Int = 5000

    @inline(__always)
    mutating func execute() {
        let buffer = UnsafeMutableRawBufferPointer.allocate(byteCount: w * h, alignment: MemoryLayout<UInt8>.alignment)
        buffer.initializeMemory(as: UInt8.self, repeating: 0)

        //let a = computedProperty
        let a = lazyProperty
        //let a = storedProperty

        let startTime = CFAbsoluteTimeGetCurrent()

        for y in 0 ..< h {
            for x in 0 ..< w {
                let index = y * a + x
                let byte = buffer[index]
                buffer[index] = byte / 2
            }
        }

        let total = CFAbsoluteTimeGetCurrent() - startTime
        print("\(#function): " + String(format: "%.5f", total))

        buffer.deallocate()
    }
}

func test() {
    for _ in 0 ..< 10 {
        var test = Test()
        test.execute()
    }
}
test()

And, a third and final "magic trick" that will make all three cases (let a = computedPropery, let a = lazyProperty and let a = storedProperty) equally fast:
With Test as a struct (won't work if it's a class), the @inline(__always) is not needed if we wrap some of the code in an immediately executed closure. : )

Code example for that here
import CoreFoundation
import Foundation

struct Test {
    let w = 5000
    let h = 5000

    var computedProperty: Int { return w }
    lazy var lazyProperty: Int = w // { w }()
    var storedProperty: Int = 5000

    mutating func execute() {
        let buffer = UnsafeMutableRawBufferPointer.allocate(byteCount: w * h, alignment: MemoryLayout<UInt8>.alignment)
        buffer.initializeMemory(as: UInt8.self, repeating: 0)
        //let a = computedProperty
        let a = lazyProperty
        //let a = storedProperty
        let _ = {
            let startTime = CFAbsoluteTimeGetCurrent()
            for y in 0 ..< h {
                for x in 0 ..< w {
                    let index = y * a + x
                    let byte = buffer[index]
                    buffer[index] = byte / 2
                }
            }
            let total = CFAbsoluteTimeGetCurrent() - startTime
            print("\(#function): " + String(format: "%.5f", total))
        }()
        buffer.deallocate()
    }
}

func test() {
    for _ in 0 ..< 10 {
        var t = Test()
        t.execute()
    }
}
test()

So, there are all sorts of strange and totally unintuitive ways to trigger or not trigger this missed optimization. It's exactly the same frustrating situation that I always find myself in when trying to write performance critical code in Swift. Note that these optimization instabilities are probably hiding everywhere in everybody's code, and it's only when you have to really care about and try to improve efficiency that you'll notice them. It's been like this for years, and its probably a big improvement opportunity for Swift performance. cc @Andrew_Trick

3 Likes

It seems that the reason that computedProperty and storedProperty are much faster is due to LLVM auto-vectorization.

If we compile let a = computedProperty variant with -emit-ir flag:

swiftc -emit-ir -O -whole-module-optimization main.swift

for

Apple Swift version 5.1 (swiftlang-1100.0.270.13 clang-1100.0.33.7)
Target: x86_64-apple-darwin18.7.0

we can examine LLVM IR:

Expand

; ModuleID = '-'
source_filename = "-"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.14.0"

%swift.type = type { i64 }
%swift.full_type = type { i8**, %swift.type }
%swift.protocol = type { i8*, i8*, i8*, i8*, i8*, i8*, i8*, i8*, i32, i32, i32, i32, i32, i32 }
%Ts26DefaultStringInterpolationV = type <{ %TSS }>
%TSS = type <{ %Ts11_StringGutsV }>
%Ts11_StringGutsV = type <{ %Ts13_StringObjectV }>
%Ts13_StringObjectV = type <{ %Ts6UInt64V, %swift.bridge* }>
%Ts6UInt64V = type <{ i64 }>
%swift.bridge = type opaque
%swift.opaque = type opaque
%swift.metadata_response = type { %swift.type*, i64 }
%swift.refcounted = type { %swift.type*, i64 }

...

; :61: ; preds = %"$sSw16initializeMemory2as9repeatingSryxGxm_xtlFs5UInt8V_Tg5Tf4dnx_n.exit", %.preheader.preheader
%62 = phi i64 [ 0, %"$sSw16initializeMemory2as9repeatingSryxGxm_xtlFs5UInt8V_Tg5Tf4dnx_n.exit" ], [ %63, %.preheader.preheader ]
%63 = add nuw nsw i64 %62, 1
%64 = call { i64, i1 } @llvm.smul.with.overflow.i64(i64 %62, i64 5000)
%65 = extractvalue { i64, i1 } %64, 0
%66 = extractvalue { i64, i1 } %64, 1
br i1 %66, label %128, label %vector.body

vector.body: ; preds = %61, %vector.body
%index = phi i64 [ %index.next.2, %vector.body ], [ 0, %61 ]
%67 = add nuw nsw i64 %65, %index
%68 = getelementptr inbounds i8, i8* %11, i64 %67
%69 = bitcast i8* %68 to <16 x i8>*
%wide.load = load <16 x i8>, <16 x i8>* %69, align 1
%70 = getelementptr inbounds i8, i8* %68, i64 16
%71 = bitcast i8* %70 to <16 x i8>*
%wide.load28 = load <16 x i8>, <16 x i8>* %71, align 1
%72 = lshr <16 x i8> %wide.load, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
%73 = lshr <16 x i8> %wide.load28, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
%74 = bitcast i8* %68 to <16 x i8>*
store <16 x i8> %72, <16 x i8>* %74, align 1
%75 = bitcast i8* %70 to <16 x i8>*
store <16 x i8> %73, <16 x i8>* %75, align 1
%index.next = add nuw nsw i64 %index, 32
%76 = add nuw nsw i64 %65, %index.next
%77 = getelementptr inbounds i8, i8* %11, i64 %76
%78 = bitcast i8* %77 to <16 x i8>*
%wide.load.1 = load <16 x i8>, <16 x i8>* %78, align 1
%79 = getelementptr inbounds i8, i8* %77, i64 16
%80 = bitcast i8* %79 to <16 x i8>*
%wide.load28.1 = load <16 x i8>, <16 x i8>* %80, align 1
%81 = lshr <16 x i8> %wide.load.1, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
%82 = lshr <16 x i8> %wide.load28.1, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
%83 = bitcast i8* %77 to <16 x i8>*
store <16 x i8> %81, <16 x i8>* %83, align 1
%84 = bitcast i8* %79 to <16 x i8>*
store <16 x i8> %82, <16 x i8>* %84, align 1
%index.next.1 = add nuw nsw i64 %index, 64
%85 = add nuw nsw i64 %65, %index.next.1
%86 = getelementptr inbounds i8, i8* %11, i64 %85
%87 = bitcast i8* %86 to <16 x i8>*
%wide.load.2 = load <16 x i8>, <16 x i8>* %87, align 1
%88 = getelementptr inbounds i8, i8* %86, i64 16
%89 = bitcast i8* %88 to <16 x i8>*
%wide.load28.2 = load <16 x i8>, <16 x i8>* %89, align 1
%90 = lshr <16 x i8> %wide.load.2, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
%91 = lshr <16 x i8> %wide.load28.2, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
%92 = bitcast i8* %86 to <16 x i8>*
store <16 x i8> %90, <16 x i8>* %92, align 1
%93 = bitcast i8* %88 to <16 x i8>*
store <16 x i8> %91, <16 x i8>* %93, align 1
%index.next.2 = add nuw nsw i64 %index, 96
%94 = icmp eq i64 %index.next.2, 4992
br i1 %94, label %.preheader.preheader, label %vector.body, !llvm.loop !36

.preheader.preheader: ; preds = %vector.body
%95 = add nuw nsw i64 %65, 4992
%96 = getelementptr inbounds i8, i8* %11, i64 %95
%97 = load i8, i8* %96, align 1
%98 = lshr i8 %97, 1
store i8 %98, i8* %96, align 1
%99 = add nuw nsw i64 %65, 4993
%100 = getelementptr inbounds i8, i8* %11, i64 %99
%101 = load i8, i8* %100, align 1
%102 = lshr i8 %101, 1
store i8 %102, i8* %100, align 1
%103 = add nuw nsw i64 %65, 4994
%104 = getelementptr inbounds i8, i8* %11, i64 %103
%105 = load i8, i8* %104, align 1
%106 = lshr i8 %105, 1
store i8 %106, i8* %104, align 1
%107 = add nuw nsw i64 %65, 4995
%108 = getelementptr inbounds i8, i8* %11, i64 %107
%109 = load i8, i8* %108, align 1
%110 = lshr i8 %109, 1
store i8 %110, i8* %108, align 1
%111 = add nuw nsw i64 %65, 4996
%112 = getelementptr inbounds i8, i8* %11, i64 %111
%113 = load i8, i8* %112, align 1
%114 = lshr i8 %113, 1
store i8 %114, i8* %112, align 1
%115 = add nuw nsw i64 %65, 4997
%116 = getelementptr inbounds i8, i8* %11, i64 %115
%117 = load i8, i8* %116, align 1
%118 = lshr i8 %117, 1
store i8 %118, i8* %116, align 1
%119 = add nuw nsw i64 %65, 4998
%120 = getelementptr inbounds i8, i8* %11, i64 %119
%121 = load i8, i8* %120, align 1
%122 = lshr i8 %121, 1
store i8 %122, i8* %120, align 1
%123 = add nuw nsw i64 %65, 4999
%124 = getelementptr inbounds i8, i8* %11, i64 %123
%125 = load i8, i8* %124, align 1
%126 = lshr i8 %125, 1
store i8 %126, i8* %124, align 1
%127 = icmp eq i64 %63, 5000
br i1 %127, label %15, label %61

; :128: ; preds = %61
call void asm sideeffect "", "n"(i32 5) #4
call void @llvm.trap()
unreachable
}

...

!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6, !7, !8}
!swift.module.flags = !{!9}
!llvm.linker.options = !{!10, !11, !12, !13, !14, !15, !16, !17, !18, !19, !20, !21, !22, !23, !24, !25, !26, !27, !28, !29, !30, !31, !32, !33, !34}
!llvm.asan.globals = !{!35}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 10, i32 15]}
!1 = !{i32 1, !"Objective-C Version", i32 2}
!2 = !{i32 1, !"Objective-C Image Info Version", i32 0}
!3 = !{i32 1, !"Objective-C Image Info Section", !"__DATA,__objc_imageinfo,regular,no_dead_strip"}
!4 = !{i32 4, !"Objective-C Garbage Collection", i32 83953408}
!5 = !{i32 1, !"Objective-C Class Properties", i32 64}
!6 = !{i32 1, !"wchar_size", i32 4}
!7 = !{i32 7, !"PIC Level", i32 2}
!8 = !{i32 1, !"Swift Version", i32 7}
!9 = !{!"standard-library", i1 false}
!10 = !{!"-lswiftFoundation"}
!11 = !{!"-lswiftCore"}
!12 = !{!"-lswiftObjectiveC"}
!13 = !{!"-lswiftDarwin"}
!14 = !{!"-framework", !"Foundation"}
!15 = !{!"-lswiftCoreFoundation"}
!16 = !{!"-framework", !"CoreFoundation"}
!17 = !{!"-lswiftDispatch"}
!18 = !{!"-framework", !"Combine"}
!19 = !{!"-framework", !"ApplicationServices"}
!20 = !{!"-lswiftCoreGraphics"}
!21 = !{!"-framework", !"CoreGraphics"}
!22 = !{!"-lswiftIOKit"}
!23 = !{!"-framework", !"IOKit"}
!24 = !{!"-framework", !"ColorSync"}
!25 = !{!"-framework", !"ImageIO"}
!26 = !{!"-framework", !"CoreServices"}
!27 = !{!"-framework", !"Security"}
!28 = !{!"-framework", !"CFNetwork"}
!29 = !{!"-framework", !"DiskArbitration"}
!30 = !{!"-framework", !"CoreText"}
!31 = !{!"-lswiftXPC"}
!32 = !{!"-lobjc"}
!33 = !{!"-lswiftCompatibility50"}
!34 = !{!"-lswiftCompatibilityDynamicReplacements"}
!35 = distinct !{null, null, null, i1 false, i1 true}
!36 = distinct !{!36, !37}
!37 = !{!"llvm.loop.isvectorized", i32 1}

Additionally, this is the disassembled for loop in pseudo-code:

Expand
loc_10000161d:
    rbx = rbx + 0x1;
    rdx = 0x50;
    do {
            xmm1 = intrinsic_movdqu(xmm1, *(int128_t *)(rax + (rdx - 0x50)));
            xmm2 = intrinsic_movdqu(xmm2, *(int128_t *)(rax + (rdx - 0x40)));
            xmm3 = intrinsic_movdqu(xmm3, *(int128_t *)(rax + (rdx - 0x30)));
            xmm4 = intrinsic_movdqu(xmm4, *(int128_t *)(rax + (rdx - 0x20)));
            xmm1 = intrinsic_psrlw(xmm1, 0x1);
            xmm1 = intrinsic_pand(xmm1, xmm5);
            xmm2 = intrinsic_psrlw(xmm2, 0x1);
            xmm2 = intrinsic_pand(xmm2, xmm5);
            *(int128_t *)(rax + (rdx - 0x50)) = intrinsic_movdqu(*(int128_t *)(rax + (rdx - 0x50)), xmm1);
            *(int128_t *)(rax + (rdx - 0x40)) = intrinsic_movdqu(*(int128_t *)(rax + (rdx - 0x40)), xmm2);
            xmm3 = intrinsic_psrlw(xmm3, 0x1);
            xmm3 = intrinsic_pand(xmm3, xmm5);
            xmm4 = intrinsic_psrlw(xmm4, 0x1);
            xmm4 = intrinsic_pand(xmm4, xmm5);
            *(int128_t *)(rax + (rdx - 0x30)) = intrinsic_movdqu(*(int128_t *)(rax + (rdx - 0x30)), xmm3);
            *(int128_t *)(rax + (rdx - 0x20)) = intrinsic_movdqu(*(int128_t *)(rax + (rdx - 0x20)), xmm4);
            xmm1 = intrinsic_movdqu(xmm1, *(int128_t *)(rax + (rdx - 0x10)));
            xmm2 = intrinsic_movdqu(xmm2, *(int128_t *)(rax + rdx));
            xmm1 = intrinsic_psrlw(xmm1, 0x1);
            xmm1 = intrinsic_pand(xmm1, xmm5);
            xmm2 = intrinsic_psrlw(xmm2, 0x1);
            xmm2 = intrinsic_pand(xmm2, xmm5);
            *(int128_t *)(rax + (rdx - 0x10)) = intrinsic_movdqu(*(int128_t *)(rax + (rdx - 0x10)), xmm1);
            *(int128_t *)(rax + rdx) = intrinsic_movdqu(*(int128_t *)(rax + rdx), xmm2);
            rdx = rdx + 0x60;
    } while (rdx != 0x13d0);
    *(int8_t *)(r15 + rcx + 0x1380) = *(int8_t *)(r15 + rcx + 0x1380) >> 0x1;
    *(int8_t *)(r15 + rcx + 0x1381) = *(int8_t *)(r15 + rcx + 0x1381) >> 0x1;
    *(int8_t *)(r15 + rcx + 0x1382) = *(int8_t *)(r15 + rcx + 0x1382) >> 0x1;
    *(int8_t *)(r15 + rcx + 0x1383) = *(int8_t *)(r15 + rcx + 0x1383) >> 0x1;
    *(int8_t *)(r15 + rcx + 0x1384) = *(int8_t *)(r15 + rcx + 0x1384) >> 0x1;
    *(int8_t *)(r15 + rcx + 0x1385) = *(int8_t *)(r15 + rcx + 0x1385) >> 0x1;
    *(int8_t *)(r15 + rcx + 0x1386) = *(int8_t *)(r15 + rcx + 0x1386) >> 0x1;
    *(int8_t *)(r15 + rcx + 0x1387) = *(int8_t *)(r15 + rcx + 0x1387) >> 0x1;
    rax = rax + 0x1388;
    if (rbx != 0x1388) goto loc_100001610;

(notice the use of 128-bit XMM registers)

I will file a bug soon (as a missed/possible optimization opportunity), however I still kind of think that computedProperty and storedProperty being read-only might be crucial to trigger auto-vectorization.

I still don’t understand how this is possible. The computed (or stored, or lazy) property is accessed exactly once, before the loop begins.

Its value is an integer, which is stored to a local let declaration.

All subsequent uses of that value are through the local let. The original property is never accessed in the loop.

What am I missing?

The bug is this:
The code after the local let is either optimized or not, depending on some seemingly unrelated code before the local let.


I agree this totally unintuitive and seems crazy, and that's why it is a bug, or at least a missed optimization, and it certainly makes reasoning about performance in Swift hard/impossible.

But it is happening, as can easily be verified by trying out the code examples in this thread.

It would be fantastic if someone dug into this and fixed it, because to repeat myself, I'm pretty sure this makes lots and lots of code unnecessarily inefficient (without anyone noticing) and it also makes it impossible to reason about performance in Swift.

3 Likes

If the performance difference comes down to vectorization, it seems likely that, in the fast case, the optimizer is recognizing that overflow is impossible based on the constant values of w and h and eliminating the overflow checks. Swift and LLVM currently don't work well together to vectorize in the face of overflow checks. Something about the formulations involving closures seems like it's impeding inlining, which would prevent the compiler from seeing the value of w and making that assumption. That would be consistent with the patterns described in this thread, since the fast case happens either when computedProperty/lazyProperty can be inlined or when the overflow checks in the loop are manually skipped.

One of the big issues with autovectorization is that it's incredibly brittle as an optimization, even in C and C++. With small examples like this that aren't specifically engineered to be benchmarks, it's difficult to get consistent benchmark numbers from any modern optimizing-compiled language, because small extra bits of static information can have a large downstream impact.

1 Like
Terms of Service

Privacy Policy

Cookie Policy