Performant byte-based String wrapper

FlorianPircher · May 27, 2024, 10:50pm

I am trying to implement a performant, immutable type ByteString that uses String as its backing store for UTF-8 data. ByteString bases its equality on the UTF-8 bytes, not Unicode equivalence as String does. My current attempt measures much less performant than the native String implementations of ==, <, and maybe hash(into:) (not sure how to best test hashing performance).

Here is my current attempt (demo package):

import Darwin

public struct ByteString: Codable, Sendable {
    public let value: String
    
    @inlinable
    public init(_ value: String) {
        var value = value
        value.makeContiguousUTF8()
        self.value = value
    }
}

extension ByteString: Hashable {
    @inlinable
    @_effects(readonly)
    public static func == (lhs: Self, rhs: Self) -> Bool {
        var lhs = lhs.value
        var rhs = rhs.value
        return lhs.withUTF8 { lhsUTF8 in
            rhs.withUTF8 { rhsUTF8 in
                guard lhsUTF8.count == rhsUTF8.count else {
                    return false
                }
                let lhsBaseAddress = lhsUTF8.baseAddress.unsafelyUnwrapped
                let rhsBaseAddress = rhsUTF8.baseAddress.unsafelyUnwrapped
                if lhsBaseAddress == rhsBaseAddress {
                    return true
                }
                return memcmp(lhsBaseAddress, rhsBaseAddress, lhsUTF8.count) == 0
            }
        }
    }
    
    @inlinable
    public func hash(into hasher: inout Hasher) {
        var value = value
        value.withUTF8 { utf8 in
            hasher.combine(bytes: UnsafeRawBufferPointer(utf8))
        }
    }
}

extension ByteString: Comparable {
    @inlinable
    @_effects(readonly)
    public static func < (lhs: ByteString, rhs: ByteString) -> Bool {
        var lhs = lhs.value
        var rhs = rhs.value
        return lhs.withUTF8 { lhsUTF8 in
            rhs.withUTF8 { rhsUTF8 in
                let lhsBaseAddress = lhsUTF8.baseAddress.unsafelyUnwrapped
                let rhsBaseAddress = rhsUTF8.baseAddress.unsafelyUnwrapped
                if lhsBaseAddress == rhsBaseAddress && lhsUTF8.count == rhsUTF8.count {
                    return false
                }
                let count = min(lhsUTF8.count, rhsUTF8.count)
                return memcmp(lhsBaseAddress, rhsBaseAddress, count) < 0
            }
        }
    }
}

Here, the native String.== is more than four times faster when testing in Release mode (ByteString.== 0.958s vs. String.== 0.223s):

Test Code

import XCTest
import ByteString

@inline(never)
func blackHole(_ x: Bool) {}

final class ByteStringTests: XCTestCase {
    func testPerformance() throws {
        let fileURL = Bundle.module.url(forResource: "Words", withExtension: "txt")!
        let words = try! String(contentsOf: fileURL)
            .split(separator: "\n")
            .map { ByteString(String($0)) }
        
        measure {
            for a in words {
                for b in words {
                    blackHole(a == b)
                    // blackHole(a.value == b.value)
                }
            }
        }
    }
}

The result is similar when profiling <: 1.301s vs. 0.231s. (Test code and demo data set are included in the demo package.)

I am no expert at profiling code, so take these results with a healthy amount of salt, but the performance difference is significant enough that it is probably not just related to the profiling methodology.

If anything, my code should be faster, as I only care about byte equivalence, not the more involved Unicode equivalence. The strings in my performance tests are mostly ASCII, as that matches most strings in the file format for which I intend to use this ByteString type. String has fast paths for at least ASCII and NFC-Unicode content, so I don’t expect my code to be much faster, but it certainly should not be slower by this degree.

Alternative approaches like

a.value.utf8.elementsEqual(b.value.utf8)

are slightly faster than the code shown above (0.802s vs. 0.958, vs. 0.223 native String), but are still far off and offer no room for improvement.

Is there anything I could do to get closer to the performance of String in my ByteString type?

Background: I am writing version control software based on Git, where I parse text files into a tree structure of arrays, dictionaries, and strings. These files can be multiple megabytes in size. ByteString uses of < and hashing, but particularly ==, are extremely frequently used. Git diffs files based on byte equivalence, not Unicode equivalence, so I have to do so as well. I also need to display the strings in the GUI and do other string-related things with them (log, string search based on user-provided queries, …) so I would like to avoid using Data or [UInt8] as that would involve conversions that I could sidestep since String also holds UTF-8 directly.

bbrk24 · May 27, 2024, 10:52pm

Isn’t string.unicodeScalars Equatable? Does that have the desired characteristics?

FlorianPircher · May 27, 2024, 11:06pm

No, "...".unicodeScalars is String.UnicodeScalarView which is not Equatable. However, you can write

a.unicodeScalars.elementsEqual(b.unicodeScalars)

It measures 0.399s in my test, which is getting closer to the 0.223s of a == b for String, but does not help with hashing or <. Also, I am not certain that unicodeScalars equality implies utf8 equality, but I cannot come up with a counterexample off the top of my head.

jrose · May 27, 2024, 11:17pm

Many of your “words” are small enough to fit inline in a String’s value, and most or all of them are ASCII (I’m guessing) and thus already normalized. So most likely the String-based comparisons are (1) early-exiting on the count equality check (which to be fair is something you might hope for in your implementation as well), and (2) comparing the strings directly in registers rather than calling out to memcmp. If you switch your corpus up for longer, non-ASCII strings, and measure hashValues (no early exit), you might see times more in line with what you expect, but if that’s not representative of your work it may not matter. A lot of effort has gone into String’s Unicode behavior not actually slowing things down much in fast paths, and its baseline is faster than a naive implementation.

FlorianPircher · May 27, 2024, 11:29pm

For my use case, most strings are less than 16 bytes and most are ASCII. Some, however, are enormous and filled with the most exotic Unicode content imaginable. So, I think I could store a separate “is ASCII” flag with ByteString and just use the String implementations in that case. I know String has something similar, but inaccessible, already, so that might be interesting to use as well if it ever becomes part of the public API.

Interesting, thanks for the hint.

bbrk24 · May 27, 2024, 11:30pm

memcmp?

jrose · May 27, 2024, 11:35pm

Oops, thanks, will edit.

FlorianPircher · May 28, 2024, 3:59pm

I looked into how String is handling things, and I don’t think I can (and maybe also don’t want to) work as low-level. Instead, on creation, I now check whether the string is ASCII-only, and if so, use the String implementations of ==, hash(into:), and < since the characters of an all-ASCII String fulfill my byte-based equality as well.

Checking for all-ASCII is done using code I copied from Swift’s internal standard library:

github.com

apple/swift/blob/ef79042ae15a608a6f106ec4ec3bfd699a021b6e/stdlib/public/core/StringCreate.swift#L15-L15


      
          internal func _allASCII(_ input: UnsafeBufferPointer<UInt8>) -> Bool {

This function has a comment pointing out possible optimizations via SIMD, but it is not trivial as-is either, reading and masking eight bytes at a time. That should be good enough for me, too.

With the is-ASCII check added, ByteString creation is twice as slow as before. However, creation was not a bottleneck for me before and still is not now. I could also offer a second initializer that skips the check (downgrading the “is ASCII” to a “is known to be ASCII” property) if I need more short-lived ByteString values in the future where creation time is more critical.

The new implementation matches or beats Swift’s String time for == and < in a mixed ASCII/Unicode corpus with a heavy ASCII-only bias. (Ignoring creation time.) Hashing is still a bit slower, but only marginally so.

Here is the updated ByteString:

import Darwin

public struct ByteString: Codable, Sendable {
    public let value: String
    public let isASCII: Bool
    
    /// Copied from the [internal Swift standard library](https://github.com/apple/swift/blob/main/stdlib/public/core/StringCreate.swift), licensed under the [Apache License 2.0](https://www.apache.org/licenses/)
    @usableFromInline
    static func isAllASCII(_ input: UnsafeBufferPointer<UInt8>) -> Bool {
        if input.isEmpty { return true }
        
        let count = input.count
        var pointer = UnsafeRawPointer(input.baseAddress.unsafelyUnwrapped)
        
        let asciiMask64 = 0x8080_8080_8080_8080 as UInt64
        let asciiMask32 = UInt32(truncatingIfNeeded: asciiMask64)
        let asciiMask16 = UInt16(truncatingIfNeeded: asciiMask64)
        let asciiMask8 = UInt8(truncatingIfNeeded: asciiMask64)
        
        let end128 = pointer + count & ~(MemoryLayout<(UInt64, UInt64)>.stride &- 1)
        let end64 = pointer + count & ~(MemoryLayout<UInt64>.stride &- 1)
        let end32 = pointer + count & ~(MemoryLayout<UInt32>.stride &- 1)
        let end16 = pointer + count & ~(MemoryLayout<UInt16>.stride &- 1)
        let end = pointer + count
        
        while pointer < end128 {
            let pair = pointer.loadUnaligned(as: (UInt64, UInt64).self)
            let result = (pair.0 | pair.1) & asciiMask64
            guard result == 0 else { return false }
            pointer = pointer + MemoryLayout<(UInt64, UInt64)>.stride
        }
        
        if pointer < end64 {
            let value = pointer.loadUnaligned(as: UInt64.self)
            guard value & asciiMask64 == 0 else { return false }
            pointer = pointer + MemoryLayout<UInt64>.stride
        }
        
        if pointer < end32 {
            let value = pointer.loadUnaligned(as: UInt32.self)
            guard value & asciiMask32 == 0 else { return false }
            pointer = pointer + MemoryLayout<UInt32>.stride
        }
        
        if pointer < end16 {
            let value = pointer.loadUnaligned(as: UInt16.self)
            guard value & asciiMask16 == 0 else { return false }
            pointer = pointer + MemoryLayout<UInt16>.stride
        }
        
        if pointer < end {
            let value = pointer.loadUnaligned(fromByteOffset: 0, as: UInt8.self)
            guard value & asciiMask8 == 0 else { return false }
        }
        
        return true
    }
    
    @inlinable
    public init(_ value: String) {
        var value = value
        self.isASCII = value.withUTF8(ByteString.isAllASCII(_:))
        self.value = value
    }
}

extension ByteString: Hashable {
    @inlinable
    @_effects(readonly)
    public static func == (lhs: Self, rhs: Self) -> Bool {
        if lhs.isASCII {
            guard rhs.isASCII else {
                return false
            }
            return lhs.value == rhs.value
        }
        else if rhs.isASCII {
            return false
        }
        
        var lhs = lhs.value
        var rhs = rhs.value
        return lhs.withUTF8 { lhsUTF8 in
            rhs.withUTF8 { rhsUTF8 in
                guard lhsUTF8.count == rhsUTF8.count else {
                    return false
                }
                let lhsBaseAddress = lhsUTF8.baseAddress.unsafelyUnwrapped
                let rhsBaseAddress = rhsUTF8.baseAddress.unsafelyUnwrapped
                if lhsBaseAddress == rhsBaseAddress {
                    return true
                }
                return memcmp(lhsBaseAddress, rhsBaseAddress, lhsUTF8.count) == 0
            }
        }
    }
    
    @inlinable
    public func hash(into hasher: inout Hasher) {
        if isASCII {
            hasher.combine(value)
        }
        else {
            var value = value
            value.withUTF8 { utf8 in
                hasher.combine(bytes: UnsafeRawBufferPointer(utf8))
            }
        }
    }
}

extension ByteString: Comparable {
    @inlinable
    @_effects(readonly)
    public static func < (lhs: ByteString, rhs: ByteString) -> Bool {
        if lhs.isASCII && rhs.isASCII {
            return lhs.value < rhs.value
        }
        
        var lhs = lhs.value
        var rhs = rhs.value
        return lhs.withUTF8 { lhsUTF8 in
            rhs.withUTF8 { rhsUTF8 in
                let lhsBaseAddress = lhsUTF8.baseAddress.unsafelyUnwrapped
                let rhsBaseAddress = rhsUTF8.baseAddress.unsafelyUnwrapped
                if lhsBaseAddress == rhsBaseAddress && lhsUTF8.count == rhsUTF8.count {
                    return false
                }
                let count = min(lhsUTF8.count, rhsUTF8.count)
                return memcmp(lhsBaseAddress, rhsBaseAddress, count) < 0
            }
        }
    }
}

Thanks for your feedback!