Why does adding conformance to RandomAccessCollection cause a performance regression?

I have a struct called BitVector that could be : RandomAccessCollection (RAC). When I add RAC to the definition, I get a horrible performance regression in functions that use this class (but I'm not using any function that RAC requires, like startIndex or endIndex). This confuses me: why would declaring that my struct implements the property cause a slowdown?

I’m seeing this in another completely unrelated struct as well, though the regression isn’t 2x the way it was with BitVector - in this latest case, it’s “only” about 20%.

Can you post your timing code that exposes the regression, so we can see exactly what you’re measuring?

It’s here (that is, this is an update to main.swift in the current master branch, which has the 40-run version of this code):

let h = Graph<UInt32>(fromBinaryFile: "/Users/me/dev/swift/SwiftGraphs/data/indptrvecs-4m-30m.0based.bin")

for i in 1 ... 100_000 {
    start = DispatchTime.now()
    let bfs1 = h.BFS(from: 0)
    end = DispatchTime.now()
    let timediff = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / ms
    times.append(timediff)
    let timeAvg = times.reduce(0, +) / Double(times.count)
    let sumOfSquaredAvgDiff = times.map { pow($0 - timeAvg, 2.0) }.reduce(0, +)
    let timeStd = sqrt(sumOfSquaredAvgDiff / Double(times.count - 1))
    print("\(i): curr: \(timediff.truncate(places: 3)), min: \(times.min()!.truncate(places: 3)), max: \(times.max()!.truncate(places: 3)), avg: \(timeAvg.truncate(places: 3)), std: \(timeStd.truncate(places: 3))")
}

Without the property, I’m in the 600-700ms range for a single BFS. With the property, I’m in the 1400-1500ms range.

The BFS code is here: https://github.com/sbromberger/SwiftGraphs/blob/dc4c9c0462e54fdb8dac84aa529882736b38ca9a/SwiftGraphs/Graph.swift#L182-L215

If you build and run the current master, you will get timings that look like this*:

Times: min: 681.877303, max: 769.144946, avg: 703.053950575, std: 22.2595559447778

if you make a single change and add : RandomAccessCollection to line 9 of BitVector.swift so that it looks like this:

public struct BitVector: RandomAccessCollection  {

you should see times like this*:

Times: min: 1209.338157, max: 1727.203483, avg: 1333.903255475, std: 125.676044887502

(* exact timings may differ based on machine, of course, but what’s important is the regression.)

In cases like this, I always try to reduce it into a minimal complete/self-contained (command line) program that clearly demonstrates the issue.

Your current test depends on an (to readers of this forum) unknown file called: "/Users/me/dev/swift/SwiftGraphs/data/indptrvecs-4m-30m.0based.bin". It would be better if the test just produced some random data instead of this, which would probably also mean that LineReader could be skipped.

Having all the code in a single post, being able to just copy paste it in order to reproduce the issue, makes it much easier for people who wish to look into it.

(Also, by “adding a a property” I assume you mean “adding conformance to a protocol (RandomAccessCollection)”?)


Below is such a test program, measuring the performance of your BitVector's testAndSet() method (which calls its set and get subscript internally).

import AppKit

public struct BitVector  {
    private var bits: Array<Int>
    
    private let blockSize = Int.bitWidth // NOTE: Should probably be static
    
    public init(repeating: Bool, count: Int) {
        let fillInt = repeating ? ~0 : 0
        let nBlocks: Int = (count / blockSize) + 1
        bits = [Int](repeating: fillInt, count: nBlocks)
    }

    // Added this for testing:
    init(randomBitBlocks blockCount: Int) {
        let byteCount = (Int.bitWidth / 8) * blockCount
        bits = [Int](repeating: 0, count: blockCount)
        let rc = SecRandomCopyBytes(nil, byteCount, &bits)
        precondition(rc == errSecSuccess)
    }

    private func getBlockAndOffset(of bitIndex: Int) -> (Int, Int) {
        return (bitIndex / blockSize, bitIndex % blockSize)
    }
    
    public var startIndex: Int { return 0 }
    public var endIndex: Int { return bits.count * blockSize }
    
    public subscript(_ bitIndex: Int) -> Bool {
        get {
            let (block, offset) = getBlockAndOffset(of: bitIndex)
            let mask = 1 << offset
            return mask & bits[block] != 0
        }
        set {
            let (block, offset) = getBlockAndOffset(of: bitIndex)
            let mask = 1 << offset
            if newValue {
                bits[block] |= mask
            } else {
                bits[block] &= ~mask
            }
        }
    }
    
    public mutating func testAndSet(_ bitIndex: Int) -> Bool {
        let (block, offset) = getBlockAndOffset(of: bitIndex)
        let mask = 1 << offset
        let oldval = mask & bits[block] != 0
        if !oldval {
            bits[block] |= mask
        }
        return oldval
    }
}

// Comment/uncomment to see if it makes any difference in timings of test below:
// extension BitVector : RandomAccessCollection {}

func test() {
    for _ in 0 ..< 5 {
        var bv = BitVector(randomBitBlocks: 1_000_000 / Int.bitWidth)
        var setCount = 0
        let t0 = CACurrentMediaTime()
        for i in bv.startIndex ..< bv.endIndex {
            setCount += bv.testAndSet(i) ? 1 : 0
        }
        let t1 = CACurrentMediaTime()
        print("time:", t1 - t0, "seconds. (\(setCount))")
    }
}
test()

Using Xcode 9.3, and compiling with -O will result in output like the following on my MBP:

time: 0.00946520201978274 seconds. (499910)
time: 0.00749993498902768 seconds. (500297)
time: 0.00683424001908861 seconds. (500176)
time: 0.00753337901551276 seconds. (500429)
time: 0.00689123698975891 seconds. (498780)

And the timings are the same no matter if I comment or uncomment this line:

// extension BitVector : RandomAccessCollection {}

So unfortunately this didn’t reproduce the slowdown that you’re seeing, but perhaps you can devise a similar little (command line) program that does?

4 Likes

I tried most of the weekend to get it down to a manageable example, but I had no luck. The graph file is located on dropbox, but please be aware it’s ~244MB.

Sorry this isn’t an easier test, but with that file in the correct directory, you should have my environment fully replicated.

Also: testAndSet is not the cause of the regression. If I merely use the getter followed by the setter inside the conditional, I get the same performance.

Inside what conditional?

Also, just to rule out any unrelated causes, you might try what happens if you use eg CACurrentMediaTime() instead of DispatchTime.now().

testAndSet does two things: it gets a value at a given index and sets it to true if it’s not set already. This is functionally equivalent to

if !visited[Int(neighbor)] {
    visited[Int(neighbor)] = true
    ...
}

but my intent with testAndSet was to see whether I could speed things up by getting rid of one modulo arithmetic operation since those can be expensive if not optimized. (In Swift’s case, the two approaches are pretty much identical.)

I’ll try CACurrentMediaTime(). Thanks for the tip.

CACurrentMediaTime() didn’t change the results:

without :RandomAccessCollection:
Times: min: 640.769736026414, max: 765.039346995763, avg: 676.587614505843, std: 33.5452987882788

with :RandomAccessCollection:
Times: min: 1102.71427500993, max: 1316.3051699521, avg: 1187.05889147823, std: 50.3028173711964

testAndSet is using subscript which is using getBlockAndOffset, and this is all of BitVector except the initializer. So do you think the cause of the regression might be the initializer or something unrelated to BitVector?

I think it’s unrelated to testAndSet, since when I replace that block with

I still have the performance issue when I include : RandomAccessCollection to BitVector.

I also just got rid of the RandomAccessCollection extension I had in there (it defined searchSortedIndex which is used on some graph initializers), and that made no difference to the results.

I don’t think the cause is the graph initializer, since the graph is already initialized by the time BFS is called.

Ok, I see what you mean, I just figured testAndSet covered about all of BitVectors functionality, so that it should show the regression if it was in there somewhere. Anyway, I have downloaded the data and am able to reproduce your results, on my machine:

  • With BitVector not conforming to RandomAccessCollection:
    Times: min: 751.859559, max: 960.927005, avg: 833.25661605, std: 53.2723263777442

  • With BitVector conforming to RandomAccessCollection:
    Times: min: 1328.528196, max: 1817.240632, avg: 1480.1516392, std: 122.247619056441

2 Likes

Thanks for sticking with me. I’m new to Swift, so if you find a way to make a MWE, please share!

This is still a lot of code (removed some) and it’s still dependent on the data file (in the dropbox link above):

import Foundation


/// Read text file line by line
public class LineReader {
    public let path: String
    
    fileprivate let file: UnsafeMutablePointer<FILE>!
    
    init?(path: String) {
        self.path = path
        file = fopen(path, "r")
        guard file != nil else { return nil }
    }
    
    public var nextLine: String? {
        var line: UnsafeMutablePointer<CChar>?
        var linecap: Int = 0
        defer { free(line) }
        return getline(&line, &linecap, file) > 0 ? String(cString: line!) : nil
    }
    
    deinit {
        fclose(file)
    }
}

public struct BitVector {
    private var bits: Array<Int>
    
    private let blockSize = Int.bitWidth
    
    public init(repeating: Bool, count: Int) {
        let fillInt = repeating ? ~0 : 0
        let nBlocks: Int = (count / blockSize) + 1
        bits = [Int](repeating: fillInt, count: nBlocks)
    }
    
    private func getBlockAndOffset(of bitIndex: Int) -> (Int, Int) {
        return (bitIndex / blockSize, bitIndex % blockSize)
    }
    
    public var startIndex: Int { return 0 }
    public var endIndex: Int { return bits.count * blockSize }
    
    public subscript(_ bitIndex: Int) -> Bool {
        get {
            let (block, offset) = getBlockAndOffset(of: bitIndex)
            let mask = 1 << offset
            return mask & bits[block] != 0
        }
        set {
            let (block, offset) = getBlockAndOffset(of: bitIndex)
            let mask = 1 << offset
            if newValue {
                bits[block] |= mask
            } else {
                bits[block] &= ~mask
            }
        }
    }
    
    public mutating func testAndSet(_ bitIndex: Int) -> Bool {
        let (block, offset) = getBlockAndOffset(of: bitIndex)
        let mask = 1 << offset
        let oldval = mask & bits[block] != 0
        if !oldval {
            bits[block] |= mask
        }
        return oldval
    }
}


public struct Edge<T: BinaryInteger> {
    let src: T
    let dst: T
    
    public init(_ s: T, _ d: T) {
        src = s
        dst = d
    }
    
    public var ordered: Edge {
        if src > dst {
            return Edge(dst, src)
        } else { return self }
    }
    
    public var reverse: Edge {
        return Edge(dst, src)
    }
}

extension Edge: CustomStringConvertible {
    public var description: String {
        return "\(src) -> \(dst)"
    }
}

extension Edge: Comparable {
    public static func == (lhs: Edge, rhs: Edge) -> Bool {
        return lhs.src == rhs.src && lhs.dst == rhs.dst
    }
    
    public static func < (lhs: Edge, rhs: Edge) -> Bool {
        if lhs.src < rhs.src { return true }
        if lhs.src > rhs.src { return false }
        return lhs.dst < rhs.dst
    }
}

public struct Graph<T: BinaryInteger> {
    let rowidx: Array<T>
    let colptr: Array<Array<T>.Index>
    
    static var eltype: Any.Type { return T.self }
    public var nv: T { return T((colptr.count - 1)) }
    public var ne: Int { return rowidx.count / 2 }
    
    public var vertices: StrideTo<T> {
        return stride(from: T(0), to: nv, by: +1)
    }

    public init(fromBinaryFile fileName: String) {
        let file = URL(fileURLWithPath: (fileName as NSString).expandingTildeInPath)
        let fileHandle = try! FileHandle(forReadingFrom: file)
        let magic = fileHandle.readData(ofLength: 4)
        guard magic.elementsEqual("GRPH".utf8) else {
            fatalError("\(file) was not a graph file")
        }
        let colSize = fileHandle.readData(ofLength: MemoryLayout<UInt32>.size).withUnsafeBytes { (ptr: UnsafePointer<UInt32>) -> Int in
            return Int(ptr.pointee.bigEndian)
        }
        colptr = fileHandle.readData(ofLength: MemoryLayout<UInt32>.size * colSize).withUnsafeBytes({ (ptr: UnsafePointer<UInt32>) -> [Int] in
            let bufferPointer = UnsafeBufferPointer(start: ptr, count: colSize)
            return [Int](bufferPointer.lazy.map { Int($0.bigEndian) })
        })
        let rowSize = fileHandle.readData(ofLength: MemoryLayout<UInt32>.size).withUnsafeBytes { (ptr: UnsafePointer<UInt32>) -> Int in
            return Int(ptr.pointee.bigEndian)
        }
        rowidx = fileHandle.readData(ofLength: MemoryLayout<UInt32>.size * rowSize).withUnsafeBytes({ (ptr: UnsafePointer<UInt32>) -> [T] in
            let bufferPointer = UnsafeBufferPointer(start: ptr, count: rowSize)
            return [T](bufferPointer.lazy.map { T($0.bigEndian) })
        })
    }

    private func vecRange(_ s: Array<T>.Index) -> CountableRange<Array<T>.Index> {
        let rStart = colptr[s]
        let rEnd = colptr[s + 1]
        return rStart ..< rEnd
    }
    public func neighbors(of vertex: T) -> ArraySlice<T> {
        let range = vecRange(Array<T>.Index(vertex))
        return rowidx[range]
    }

    public func BFS(from sourceVertex: Int) -> [T] {
        let numVertices = Int(nv)
        let maxT = ~T()
        var visited = BitVector(repeating: false, count: numVertices)
        var vertLevel = Array<T>(repeating: maxT, count: numVertices)
        var nLevel: T = 1
        var curLevel = [T]()
        curLevel.reserveCapacity(numVertices)
        var nextLevel = [T]()
        nextLevel.reserveCapacity(numVertices)
        
        visited[sourceVertex] = true
        vertLevel[sourceVertex] = 0
        curLevel.append(T(sourceVertex))
        
        while !curLevel.isEmpty {
            for vertex in curLevel {
                for neighbor in neighbors(of: vertex) {
                    if !visited.testAndSet(Int(neighbor)) {
                        nextLevel.append(neighbor)
                        vertLevel[Int(neighbor)] = nLevel
                    }
                }
            }
            nLevel += 1
            curLevel.removeAll(keepingCapacity: true)
            (curLevel, nextLevel) = (nextLevel, curLevel)
            curLevel.sort()
        }
        return vertLevel
    }
}

extension Graph: CustomStringConvertible {
    public var description: String {
        return "{\(nv), \(ne)} Graph"
    }
}

//-----------------------------------------------------------------------------
// Comment/uncomment to see difference in timings:
//-----------------------------------------------------------------------------
// extension BitVector : RandomAccessCollection {}

// (
// With BitVector *not* conforming to RandomAccessCollection:
// Times: min: 751.859559, max: 960.927005, avg: 833.25661605, std: 53.2723263777442
//
// With BitVector conforming to RandomAccessCollection:
// Times: min: 1328.528196, max: 1817.240632, avg: 1480.1516392, std: 122.247619056441
// )


var start = DispatchTime.now()
let h = Graph<UInt32>(fromBinaryFile: "~/Downloads/indptrvecs-4m-30m.0based.bin")

var end = DispatchTime.now()
print("graph read took \(Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000.0) ms")

print(h)
var times = [Double]()
var ms = 1_000_000.0
for i in 1 ... 40 {
    start = DispatchTime.now()
    let bfs1 = h.BFS(from: 0)
    end = DispatchTime.now()
    let timediff = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / ms
    // let timeit = end2.uptimeNanoseconds - start2.uptimeNanoseconds
    print("Run \(i): BFS from vertex 0 to \(timediff) ms; sum = \(bfs1.reduce(0, +))")
    times.append(timediff)
}

let timeAvg = times.reduce(0, +) / Double(times.count)
let sumOfSquaredAvgDiff = times.map { pow($0 - timeAvg, 2.0) }.reduce(0, +)
let timeStd = sqrt(sumOfSquaredAvgDiff / Double(times.count - 1))
print("Times: min: \(times.min()!), max: \(times.max()!), avg: \(timeAvg), std: \(timeStd)")

If you check out the sbromberger/MWE branch, that might help.

I tried some things with the code (since I’ve ran into similar issues that I never managed to isolate), but the only thing I could find out was that the regression disappears if you annotate the get and set of BitVectors subscript with @_transparent. I don’t know why this is, but I guess it does demonstrate a bug (or at least an opportunity for improvement) in the optimizer. But it would be nice if someone more skilled could find a smaller reproducing example, cc @Slava_Pestov, @Joe_Groff, @Andrew_Trick.

I’m attaching the code (it’s the files of your MWE-branch, joined into one) in the state where I left it after finding this workaround (and after doing some other things that I tested but which made no difference, eg removing all public, sticking the test code in a func, fixing the warnings that then appeared about changing some vars to lets etc.).

import Dispatch
import QuartzCore

struct BitVector {
    private var bits: Array<Int>
    
    private let blockSize = Int.bitWidth
    
    init(repeating: Bool, count: Int) {
        let fillInt = repeating ? ~0 : 0
        let nBlocks: Int = (count / blockSize) + 1
        bits = [Int](repeating: fillInt, count: nBlocks)
    }
    
    private func getBlockAndOffset(of bitIndex: Int) -> (Int, Int) {
        return (bitIndex / blockSize, bitIndex % blockSize)
    }
    
    var startIndex: Int { return 0 }
    var endIndex: Int { return bits.count * blockSize }
    
    // NOTE: Adding @_transparent to the get and set of this subscript
    // removes the slowdown that is otherwise magically introduced by
    // conforming BitVector to RandomAccessCollection (see further down).
    // So if you comment out the two @_transparent and keep the conformance
    // to RandomAccessCollection, the slowdown will be there again.
    subscript(_ bitIndex: Int) -> Bool {
        @_transparent
        get {
            let (block, offset) = getBlockAndOffset(of: bitIndex)
            let mask = 1 << offset
            return mask & bits[block] != 0
        }
        @_transparent
        set {
            let (block, offset) = getBlockAndOffset(of: bitIndex)
            let mask = 1 << offset
            if newValue {
                bits[block] |= mask
            } else {
                bits[block] &= ~mask
            }
        }
    }
}

struct Edge<T: BinaryInteger> {
    let src: T
    let dst: T
    
    init(_ s: T, _ d: T) {
        src = s
        dst = d
    }
    
    var ordered: Edge {
        if src > dst {
            return Edge(dst, src)
        } else { return self }
    }
    
    var reverse: Edge {
        return Edge(dst, src)
    }
}

struct Graph<T: BinaryInteger> {
    let rowidx: Array<T>
    let colptr: Array<Array<T>.Index>
    
    static var eltype: Any.Type { return T.self }
    var nv: T { return T((colptr.count - 1)) }
    var ne: Int { return rowidx.count / 2 }
    
    var vertices: StrideTo<T> {
        return stride(from: T(0), to: nv, by: +1)
    }
    
    init(fromBinaryFile fileName: String) {
        let file = URL(fileURLWithPath: fileName)
        let fileHandle = try! FileHandle(forReadingFrom: file)
        let magic = fileHandle.readData(ofLength: 4)
        guard magic.elementsEqual("GRPH".utf8) else {
            fatalError("\(file) was not a graph file")
        }
        let colSize = fileHandle.readData(ofLength: MemoryLayout<UInt32>.size).withUnsafeBytes { (ptr: UnsafePointer<UInt32>) -> Int in
            return Int(ptr.pointee.bigEndian)
        }
        colptr = fileHandle.readData(ofLength: MemoryLayout<UInt32>.size * colSize).withUnsafeBytes({ (ptr: UnsafePointer<UInt32>) -> [Int] in
            let bufferPointer = UnsafeBufferPointer(start: ptr, count: colSize)
            return [Int](bufferPointer.lazy.map { Int($0.bigEndian) })
        })
        let rowSize = fileHandle.readData(ofLength: MemoryLayout<UInt32>.size).withUnsafeBytes { (ptr: UnsafePointer<UInt32>) -> Int in
            return Int(ptr.pointee.bigEndian)
        }
        rowidx = fileHandle.readData(ofLength: MemoryLayout<UInt32>.size * rowSize).withUnsafeBytes({ (ptr: UnsafePointer<UInt32>) -> [T] in
            let bufferPointer = UnsafeBufferPointer(start: ptr, count: rowSize)
            return [T](bufferPointer.lazy.map { T($0.bigEndian) })
        })
    }
    
    private func vecRange(_ s: Array<T>.Index) -> Range<Array<T>.Index> {
        let rStart = colptr[s]
        let rEnd = colptr[s + 1]
        return rStart ..< rEnd
    }
    
    func neighbors(of vertex: T) -> ArraySlice<T> {
        let range = vecRange(Array<T>.Index(vertex))
        return rowidx[range]
    }
    
    func BFS(from sourceVertex: Int) -> [T] {
        let numVertices = Int(nv)
        let maxT = ~T()
        var visited = BitVector(repeating: false, count: numVertices)
        var vertLevel = Array<T>(repeating: maxT, count: numVertices)
        var nLevel: T = 1
        var curLevel = [T]()
        curLevel.reserveCapacity(numVertices)
        var nextLevel = [T]()
        nextLevel.reserveCapacity(numVertices)
        
        visited[sourceVertex] = true
        vertLevel[sourceVertex] = 0
        curLevel.append(T(sourceVertex))
        
        while !curLevel.isEmpty {
            for vertex in curLevel {
                for neighbor in neighbors(of: vertex) {
                    if !visited[Int(neighbor)] {
                        visited[Int(neighbor)] = true
                        nextLevel.append(neighbor)
                        vertLevel[Int(neighbor)] = nLevel
                    }
                }
            }
            nLevel += 1
            curLevel.removeAll(keepingCapacity: true)
            (curLevel, nextLevel) = (nextLevel, curLevel)
            curLevel.sort()
        }
        return vertLevel
    }
}

extension Graph: CustomStringConvertible {
    var description: String {
        return "{\(nv), \(ne)} Graph"
    }
}

// NOTE: This conformance is not necessary for the test to succeed, but
// without the workaround of adding @_transparent to the get and set of
// BitVector's subscript (see above), this conformance will somehow cause
// a performance regression in the below test.
extension BitVector : RandomAccessCollection {}

func test() {
    let start = DispatchTime.now()
    let h = Graph<UInt32>(fromBinaryFile: "/Users/jens/Downloads/indptrvecs-4m-30m.0based.bin")
    let end = DispatchTime.now()
    print("graph read took \(Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000.0) ms")
    print(h)
    var times = [Double]()
    for i in 1 ... 40 {
        let start = CACurrentMediaTime()
        let bfs1 = h.BFS(from: 0)
        let end = CACurrentMediaTime()
        let timediff = (end - start) * 1000
        print("Run \(i): BFS from vertex 0 to \(timediff) ms; sum = \(bfs1.reduce(0, +))")
        times.append(timediff)
    }
    
    let timeAvg = times.reduce(0, +) / Double(times.count)
    let sumOfSquaredAvgDiff = times.map { pow($0 - timeAvg, 2.0) }.reduce(0, +)
    let timeStd = sqrt(sumOfSquaredAvgDiff / Double(times.count - 1))
    print("Times: min: \(times.min()!), max: \(times.max()!), avg: \(timeAvg), std: \(timeStd)")
}
test()

For anyone wanting to look into this: In order to run the above demo program, you also need the graph data file linked to here.

Thanks for narrowing this down, @Jens! It'd also be interesting to compare the -dump-ast output with and without the RandomAccessCollection conformance. It's likely that it's influencing overload resolution somewhere here in a way that's bad for performance; maybe it's picking a "better" overload that isn't really better in this case.

I'm happy to help narrow this down, but I'll need explicit steps. :slight_smile:

If you build the Swift file by replacing the normal output option flag (-c, -emit-library, etc.) with -dump-ast on the commandline, then you ought to get a dump of the Swift compiler's parse and semantic analysis of the program. It'd be interesting to diff the before and after dumps.