Data enumeration speed

In the following test:

let data = Data(repeating: 0xAD, count: 1000_000_000)

@discardableResult func measure<T>(_ title: String, execute: () -> T) -> T {
    let start = Date()
    let result = execute()
    let elapsed = Date().timeIntervalSince(start)
    print("\(title), \(elapsed), \(result)")
    return result
}

measure("withUnsafeBytes-1") { // 0.07s
    var result = 0
    data.withUnsafeBytes { bp in
        for byte in bp { result &+= Int(byte) }
    }
    return result
}

measure("withUnsafeBytes-2") { // 0.07s
    var result = 0
    data.withUnsafeBytes { bp in
        for i in 0 ..< bp.count { result &+= Int(bp[i]) }
    }
    return result
}

measure("withUnsafeBytes-3") { // 0.07s
    var result = 0
    data.withUnsafeBytes { bp in
        let p = bp.baseAddress!.assumingMemoryBound(to: UInt8.self)
        for i in 0 ..< bp.count { result &+= Int(p[i]) }
    }
    return result
}

measure("for i in 0 ..< data.count") { // 2.8s
    var result = 0
    for i in 0 ..< data.count { result &+= Int(data[i]) }
    return result
}

measure("for byte in data") { // 3.9s
    var result = 0
    for byte in data { result &+= Int(byte) }
    return result
}

measure("data.forEach") { // 4.9s
    var result = 0
    data.forEach { result &+= Int($0) }
    return result
}

measure("data.reduce") { // 4.9s
    data.reduce(0) { $0 &+ Int($1) }
}

I wasn't too surprised to see the withUnsafeBytes versions being ~5000% faster than the safe versions (the inner loop is doing almost nothing and the overhead of bounds checking is significant).

However what is surprising is that the fastest of the safe versions happened to be the "for i in 0 ..< data.count" version – the one where bounds checking must be present! The "for byte in data" is 40% slower and forEach/reduce are 75% slower, and in the last three slower versions there is no obvious need for the bound checking happening inside. Why are they slower?

4 Likes

It'd probably be illuminating to look at the disassembly. These cases are all about the compiler optimising out boilerplate (like bounds checks, trampoline functions, etc). Evidently it's failing to do so to varying degrees in the 'safe' cases.

There's nothing about them which strikes me as intrinsically poor-performing. In fact the compiler has everything it needs (in theory) to optimise every implementation down to one instruction (load immediate with the result of wrapping summation of 0xAD a million times). It's probably a question of which things break that by being semantically hidden from it (e.g. non-inlinable methods across module boundaries), and of course present limitations in its ability to reason about loops (even statically-defined ones as in this case).

2 Likes

The main issue is that Foundation.Data has three different internal representations. Every loop iteration switches over all possible representations. The optimizer won't disentangle something like this without creating three copies of the loop, which it won't do for code size reasons.

This post has enough detail for a github issue. And it's an important problem. I'm afraid nothing can be done short of deprecating subscript on Foundation.Data and replacing iteration over Data with iteration over StorageView once that makes it into the standard library.

3 Likes