Performance problems with Swift 4.2

Hi,

I have recently discovered some performance issues when I was checking performance of one of my projects when it was compiled with Swift 4.2. Basically, the same code when it is compiled with Swift 4.2 runs more than 2 times slower than when compiled with Swift 4.1.3.

Here's sample code from my library:

// main.swift
import Foundation

class ByteReader { // Making class final doesn't really affect performance

    let data: Data

    var offset: Int

    init(data: Data) {
        self.data = data
        self.offset = data.startIndex
    }

    func byte() -> UInt8 {
        precondition(self.offset < self.data.endIndex) // Removing this line changes the difference from 2.15x to 1.3x
        defer { self.offset += 1 }
        return self.data[self.offset]
    }

}

let byteReader = ByteReader(data: Data(count: 10_485_760)) // 10 MB

for _ in 0..<10_000_000 {
	_ = byteReader.byte()
}

Here are the results that I am seeing:

$ swiftc -O main.swift # Compile with Swift 4.2 (with optimizations)
$ time ./main

real	0m0.948s
user	0m0.920s
sys	0m0.022s
$ xcrun -toolchain org.swift.41320180727a swiftc -O main.swift # Compile with Swift 4.1.3 (with optimizations)
$ time ./main

real	0m0.450s
user	0m0.419s
sys	0m0.018s

I've also done a couple of experiments and determined the following:

  1. Removing the line precondition(self.offset < self.data.endIndex) seems to improve performance, which leads me to a conclusion that for some reason either access to class properties or preconditions became much slower in Swift 4.2.
  2. Making the class final doesn't change anything in terms of performance.
  3. With the "latest development snapshot" (November, 1) of Swift situation is somewhat better, but the slowdown is still very much noticeable.

I am also attaching a couple of screenshots from Instruments. (There, "main-swift-4-1-3" and "main-swift-4-2" are copies of "main" executable compiled with either Swift 4.1.3 compiler of Swift 4.2 compiler accordingly).

So my question is:

Is this a bug or am I doing something wrong?


Related?

In the post that you've linked it seems they are discussing performance issues with Data in general (and I agree: there are problems with Data's performance).

Here, I am mostly concerned about the gigantic difference between Swift 4.2 and Swift 4.1.3 compilers.

1 Like

We've noticed a number of issues where top-level code doesn't optimize as well as code in function bodies because of limitations around optimizations for global variables. Accessing globals and class properties also has some added overhead due to exclusive access enforcement. One thing to try when timing code is to put your timing code inside a function, where it will optimize better, and better reflect what the code will do in production:

func test() {
  let byteReader = ByteReader(data: Data(count: 10_485_760)) // 10 MB

  for _ in 0..<10_000_000 {
    _ = byteReader.byte()
  }
}

test()

Does that help at all?

2 Likes

You can also reduce the number of exclusivity checks by using a closure to group all of the updates to a class property into one access. For instance, this implementation of byte() may be more efficient:

    func byte() -> UInt8 {
      _ = { (offset: inout Int) in
        precondition(offset < self.data.endIndex)
        defer { offset += 1 }
        return self.data[offset]
      }(&self.offset)
    }

as may turning ByteReader into a struct, if possible.

2 Likes

Both of these options (removing top level code or reimplementing byte() with a closure) work in my example. Unfortunately, it seems like as soon as I go back to my initial setup with ByteReader in its own library both of these workarounds stop working.

Turning ByteReader into a struct is not an option for me because it is very important for me that ByteReader is a reference type.

I would hope that as time goes on this kind of optimization can be picked up more reliably by the compiler. I would not expect most Swift programmers to understand that doing this would lead more performant code.

My guess is that what you're seeing there is the optimizer losing visibility to the body of the method, and thus losing visibility for doing inlining and others. There are some annotations that you can use to bring that back up, but you should read up on them.

The optimizer continues to improve its understanding of exclusivity checks, and further improvements should be coming in Swift 5, enough to make us feel comfortable enabling them more aggressively—see Enabling run-time exclusivity checking in release mode for details.

I've actually considered using these attributes separately from this problem. My experiments have shown minor or no impact on performance.

I've repeated my experiments with @inlinable for this thread and @inlinable still changes nothing in terms of performance.

Do I understand correctly that there is not much else I can do apart from these two (not working in general situation) workarounds and waiting for Swift 5 where there should be some improvements to optimizer?

The only way to know if your issue will be fixed is to file a bug. It's not clear what limitation you're hitting until someone does the performance analysis.

The exclusivity enforcement issue is important to keep in mind for Swift 5, but it's not what you're hitting today. You can verify by turning off runtime checks when benchmarking:
-O -enforce-exclusivity=unchecked. But please don't do this during testing, and hopefully you won't have to do it in production!

Hmm, -O -enforce-exclusivity=unchecked doesn't seem to change anything.

Anyway, I filled a bug: SR-9185

I'm guessing this is due to differences in how @inline(__always) works with the @inlinable changes, you can see here that the endIndex getter is marked @inline(__always) (but not @inlinable), but you can see it (and the subscript getter) getting called anyways in the disassembly

Swift 4.2 Assembly Screenshot

(Demangler broke, _$S10Foundation4DataV8endIndexSivg is the endIndex getter, _$S10Foundation4DataVys5UInt8VSicig is the subscript getter)

On the other hand, the Swift 4.1 version properly inlines it, and you can see calls to _validateIndex in place of the subscript getter.

Swift 4.1 Assembly Screenshot

Interestingly, Swift 4.2 on Linux doesn't have this issue which you can see on line 48 here or by comparing times on Linux.

After some additional experiments I've managed to come up with the version of byte() which has improved performance with Swift 4.2 and does survive being separated into a library:

public func byte() -> UInt8 {
    return { (data: Data, offset: inout Int) -> UInt8 in
        precondition(offset < data.endIndex)
        defer { offset += 1 }
        return data[offset]
    } (self.data, &self.offset)
}

It uses the idea of closure suggested by @Joe_Groff but the closure also has data as an argument. This leads me to a conclusion, that the mutating access to offset property is not the main problem here, but accessing data property is.

Interesting. Since data is immutable, I would expect that it wouldn’t need exclusivity checking overhead, and that loads should be forwarded by the optimizer. @Andrew_Trick may be able to comment.

If you want to see the effect of enabling exclusivity checking, you can benchmark with -O -enforce-exclusivity=checked in a recent snapshot of "trunk/master". That will be the default in upcoming snapshots after this week. That's likely to be the performance you see in Swift 5.0 modulo any additional optimizations.