Expensive calls to "outlined init with take"

Karl · September 18, 2021, 6:58pm

While profiling my code, I've found a lot of calls to "outlined init with take" functions that are surprisingly expensive.

These calls really add up - for instance, the function shown writes a normalized URL string, including percent-encoding inputs, parsing and rewriting a path, etc, and instruments shows this one line amounts to around 8% of the time spent in the function (recorded in high frequency mode):

And that's just one function.

Now, info.schemeKind is a stored property of type Optional<WebURL.SchemeKind>, and WebURL.SchemeKind is an enum with 7 cases. I would expect Optional<WebURL.SchemeKind> to occupy exactly 1 byte, and printing MemoryLayout<Optional<WebURL.SchemeKind>>.size confirms that. Copying a single byte which is a stored property shouldn't require runtime calls.

I've seen this with other small enums, as well. I have a Sigil enum with only 2 cases, and I'm seeing the same expensive "init with take" functions all around my code.

I can't understand what's causing this - does anybody have any information? How can I avoid them?

David_Smith · September 18, 2021, 7:17pm

If memory serves, what's happening here is that the amount of generated code to initialize a WebURL.SchemeKind exceeded a threshold, so it was extracted out into a non-inline init function to reduce code size. If, as I suspect, this automatic outlining (reverse-inlining) is using the same mechanism as some of the other places, it may be running into the issue described in Reducing value witness size in struct composition - #3 by gmittertreiner where the codegen for the outlined function is suboptimal.

David_Smith · September 18, 2021, 7:21pm

Hm. @scanon is it expected that we'd see the Nehalem and Ivybridge variants of memmove in the same process?

Karl · September 18, 2021, 8:29pm

Aha!

That link was helpful, thanks. After prodding around some of the PRs, I realised that it doesn't actually depend on the size or complexity of the enum (at least, not necessarily). I'm actually able to force an "outlined init with take" with the following code (nightly, -O):

func test(_ input: HasAnEnum) -> Int {
    input.readValue()
}

enum MyEnum {
    case four
}

struct HasAnEnum {
    var something: String
    var somethingTwo: String
    var value: MyEnum?

    func readValue() -> Int {
        let x = value!
        if case .four = x { return 4 }
        return -1
    }
}

(Yes, the enum has one case)

This generates the following assembly (Godbolt):

output.test(output.HasAnEnum) -> Swift.Int:
        push    rbx
        sub     rsp, 16
        add     rdi, 32
        lea     rbx, [rsp + 8]
        mov     rsi, rbx
        call    (outlined init with take of output.MyEnum?)
        mov     rsi, rsp
        mov     rdi, rbx
        call    (outlined init with take of output.MyEnum?)
        cmp     byte ptr [rsp], 0
        jne     .LBB1_2
        mov     eax, 4
        add     rsp, 16
        pop     rbx
        ret
.LBB1_2:
        ud2

However, if you comment out the somethingTwo property from HasAnEnum, suddenly it becomes:

output.test(output.HasAnEnum) -> Swift.Int:
        test    dl, 1
        jne     .LBB1_2
        mov     eax, 4
        ret
.LBB1_2:
        ud2

So... it depends on the type which holds the enum, rather than the enum itself. That's very strange.

It's really costing a lot in terms of performance, so it would be great to have some way to opt out of this. Perhaps forcing some simpler layout on HasAnEnum so it becomes easier to access the property.

David_Smith · September 18, 2021, 9:30pm

I'd hazard a guess that if you added a third String rather than an enum you might also get this. Each String there is going to require emiting a two word copy and a retain call.

(edit) you may have found an interesting missed optimization there though: in the inline case it manages to just delete the entire thing, and I think the outlining is preventing that from working

Karl · September 19, 2021, 9:05am

It doesn't. It seems to be related to the optional (the force-unwrap also doesn't matter).

I changed the struct to hold 10 strings (and added a second case to the enum, so it doesn't get constant-folded). If value is a MyEnum, everything is fine; if it's a MyEnum?, we have to slog through runtime functions (Godbolt):

enum MyEnum {
    case one
    case four
}

struct HasAnEnum {
    var s1: String
    var s2: String
    var s3: String
    var s4: String
    var s5: String
    var s6: String
    var s7: String
    var s8: String
    var s9: String
    var s10: String
    var value: MyEnum // try making this an optional

    func readValue() -> Int {
        let x = value
        if case .four = x { return 4 }
        return -1
    }
}

Non-optional:

output.test(output.HasAnEnum) -> Swift.Int:
        movzx   eax, byte ptr [rdi + 160]
        lea     rax, [rax + 4*rax]
        add     rax, -1
        ret

Optional:

output.test(output.HasAnEnum) -> Swift.Int:
        push    rbx
        sub     rsp, 16
        add     rdi, 160
        lea     rbx, [rsp + 8]
        mov     rsi, rbx
        call    (outlined init with take of output.MyEnum?)
        mov     rsi, rsp
        mov     rdi, rbx
        call    (outlined init with take of output.MyEnum?)
        movzx   eax, byte ptr [rsp]
        mov     ecx, eax
        and     ecx, 1
        cmp     rax, 2
        lea     rcx, [rcx + 4*rcx - 1]
        mov     rax, -1
        cmovne  rax, rcx
        add     rsp, 16
        pop     rbx
        ret

I'm quite sure that this a bug now, so I've filed SR-15215 about it.

Karl · September 19, 2021, 11:25am

I've found a workaround. It's super ugly...

@inlinable @inline(__always)
internal var schemeKind: Optional<WebURL.SchemeKind> {
  withUnsafeBytes(of: self) {
    $0.load(fromByteOffset: MemoryLayout.offset(of: \Self._schemeKind)!, as: Optional<WebURL.SchemeKind>.self)
  }
}

Basically I renamed the stored properties and wrote my own accessors for them. I did this for like 3 or 4 stored properties, all optionals. I really, really wish I didn't have to do this, but now my benchmarks are showing the best results I have ever seen from this project, with common cases improving by 17-24%(!!!).

Summary

benchmark                                              column     results/path_9 results/path_10       %
--------------------------------------------------------------------------------------------------------
Constructor.SpecialNonFile.AverageURLs                 std                 26.08           63.28 -142.65
Constructor.SpecialNonFile.AverageURLs                 warmup         3829983.00      3466338.00    9.49
Constructor.SpecialNonFile.AverageURLs                 iterations       37796.00        44360.00  -17.37
Constructor.SpecialNonFile.AverageURLs                 time             33922.00        28055.00   17.30
Constructor.SpecialNonFile.AverageURLs filtered        std                 35.28           33.86    4.03
Constructor.SpecialNonFile.AverageURLs filtered        warmup         4742413.00      5451983.00  -14.96
Constructor.SpecialNonFile.AverageURLs filtered        iterations       28011.00        28872.00   -3.07
Constructor.SpecialNonFile.AverageURLs filtered        time             45356.00        43094.50    4.99
Constructor.SpecialNonFile.IPv4 host                   std                 35.11           34.60    1.45
Constructor.SpecialNonFile.IPv4 host                   warmup         3577949.00      2827990.00   20.96
Constructor.SpecialNonFile.IPv4 host                   iterations       36648.00        42892.00  -17.04
Constructor.SpecialNonFile.IPv4 host                   time             35112.00        27630.00   21.31
Constructor.SpecialNonFile.IPv4 host filtered          std                 50.86           36.97   27.31
Constructor.SpecialNonFile.IPv4 host filtered          warmup         4221503.00      3779503.00   10.47
Constructor.SpecialNonFile.IPv4 host filtered          iterations       32093.00        33531.00   -4.48
Constructor.SpecialNonFile.IPv4 host filtered          time             39854.00        36601.00    8.16
Constructor.SpecialNonFile.IPv6 host                   std                 32.64           37.21  -14.00
Constructor.SpecialNonFile.IPv6 host                   warmup         4345729.00      3213953.00   26.04
Constructor.SpecialNonFile.IPv6 host                   iterations       32983.00        44068.00  -33.61
Constructor.SpecialNonFile.IPv6 host                   time             37398.00        28309.50   24.30
Constructor.SpecialNonFile.IPv6 host filtered          std                 24.86           38.78  -56.02
Constructor.SpecialNonFile.IPv6 host filtered          warmup         5475037.00      3436541.00   37.23
Constructor.SpecialNonFile.IPv6 host filtered          iterations       30296.00        37909.00  -25.13
Constructor.SpecialNonFile.IPv6 host filtered          time             41760.00        34307.00   17.85
Constructor.SpecialNonFile.Percent-encoding components std                 37.57           32.74   12.86
Constructor.SpecialNonFile.Percent-encoding components warmup         1278889.00      1141999.00   10.70
Constructor.SpecialNonFile.Percent-encoding components iterations      104477.00       119079.00  -13.98
Constructor.SpecialNonFile.Percent-encoding components time             12318.00        11108.00    9.82
Constructor.SpecialNonFile.Percent-encoded hostnames   std                104.19           37.64   63.87
Constructor.SpecialNonFile.Percent-encoded hostnames   warmup         1214436.00      1039024.00   14.44
Constructor.SpecialNonFile.Percent-encoded hostnames   iterations      109287.00       131662.00  -20.47
Constructor.SpecialNonFile.Percent-encoded hostnames   time             11629.00         9915.00   14.74
Constructor.SpecialNonFile.Long paths                  std                 24.11           25.35   -5.14
Constructor.SpecialNonFile.Long paths                  warmup        10246773.00      9702339.00    5.31
Constructor.SpecialNonFile.Long paths                  iterations       12921.00        13470.00   -4.25
Constructor.SpecialNonFile.Long paths                  time             94723.00        93464.50    1.33
Constructor.SpecialNonFile.Complex paths 1             std                 47.32           28.06   40.70
Constructor.SpecialNonFile.Complex paths 1             warmup         2991180.00      2786803.00    6.83
Constructor.SpecialNonFile.Complex paths 1             iterations       47141.00        48935.00   -3.81
Constructor.SpecialNonFile.Complex paths 1             time             26821.00        26165.00    2.45
Constructor.SpecialNonFile.Complex paths 2             std                 35.28           26.18   25.79
Constructor.SpecialNonFile.Complex paths 2             warmup         3751616.00      3253327.00   13.28
Constructor.SpecialNonFile.Complex paths 2             iterations       37345.00        41286.00  -10.55
Constructor.SpecialNonFile.Complex paths 2             time             32966.00        32142.00    2.50
Constructor.SpecialNonFile.Long query 1                std                194.64           70.05   64.01
Constructor.SpecialNonFile.Long query 1                warmup          362005.00       328641.00    9.22
Constructor.SpecialNonFile.Long query 1                iterations      407376.00       476298.00  -16.92
Constructor.SpecialNonFile.Long query 1                time              3089.00         2738.00   11.36
Constructor.SpecialNonFile.Long query 2                std                 34.20           30.88    9.71
Constructor.SpecialNonFile.Long query 2                warmup         2504260.00      2733403.00   -9.15
Constructor.SpecialNonFile.Long query 2                iterations       52335.00        54263.00   -3.68
Constructor.SpecialNonFile.Long query 2                time             24316.00        23348.00    3.98
--------------------------------------------------------------------------------------------------------
                                                       std                                         14.02
                                                       warmup                                      11.72
                                                       iterations                                 -13.05
                                                       time                                        11.09

Wow. I'm ecstatic. I can't begin to describe the countless hours I've spent, trying to figure out which algorithmic changes I could make, so I could bring these benchmarks anywhere close to this level. In particular the first one ("AverageURLs") - it's the most important, the most representative of real URLs on the web, and I'd set myself the goal of getting it under 30ms. But no matter how much I tried, I just couldn't get it reliably under that level.

With one change, it has steam-rolled past that. As you can imagine, I've ran them all again multiple times to confirm. The best I've seen is 27.6, and the worst is 28.8. It's an arbitrary benchmark and there's random variance and all that, but still - it's huge. Even now I can't really believe I've seen those numbers.

We need this fixed, badly. I hate to think how many other developers are out there, wondering why their code isn't meeting their performance targets, spending who-knows-how-much time trying to think of clever tricks or novel algorithms, when there are things like this lurking in the compiler.

scanon · September 19, 2021, 4:04pm

If memory serves right, those two implementations share some of their core copy loops, with separate entry points and thresholds. So that's just perf tools misattributing samples based on the heuristic of "scan backwards until you find an exported symbol".