I am trying to implement a performant, immutable type ByteString
that uses String
as its backing store for UTF-8 data. ByteString
bases its equality on the UTF-8 bytes, not Unicode equivalence as String
does. My current attempt measures much less performant than the native String
implementations of ==
, <
, and maybe hash(into:)
(not sure how to best test hashing performance).
Here is my current attempt (demo package):
import Darwin
public struct ByteString: Codable, Sendable {
public let value: String
@inlinable
public init(_ value: String) {
var value = value
value.makeContiguousUTF8()
self.value = value
}
}
extension ByteString: Hashable {
@inlinable
@_effects(readonly)
public static func == (lhs: Self, rhs: Self) -> Bool {
var lhs = lhs.value
var rhs = rhs.value
return lhs.withUTF8 { lhsUTF8 in
rhs.withUTF8 { rhsUTF8 in
guard lhsUTF8.count == rhsUTF8.count else {
return false
}
let lhsBaseAddress = lhsUTF8.baseAddress.unsafelyUnwrapped
let rhsBaseAddress = rhsUTF8.baseAddress.unsafelyUnwrapped
if lhsBaseAddress == rhsBaseAddress {
return true
}
return memcmp(lhsBaseAddress, rhsBaseAddress, lhsUTF8.count) == 0
}
}
}
@inlinable
public func hash(into hasher: inout Hasher) {
var value = value
value.withUTF8 { utf8 in
hasher.combine(bytes: UnsafeRawBufferPointer(utf8))
}
}
}
extension ByteString: Comparable {
@inlinable
@_effects(readonly)
public static func < (lhs: ByteString, rhs: ByteString) -> Bool {
var lhs = lhs.value
var rhs = rhs.value
return lhs.withUTF8 { lhsUTF8 in
rhs.withUTF8 { rhsUTF8 in
let lhsBaseAddress = lhsUTF8.baseAddress.unsafelyUnwrapped
let rhsBaseAddress = rhsUTF8.baseAddress.unsafelyUnwrapped
if lhsBaseAddress == rhsBaseAddress && lhsUTF8.count == rhsUTF8.count {
return false
}
let count = min(lhsUTF8.count, rhsUTF8.count)
return memcmp(lhsBaseAddress, rhsBaseAddress, count) < 0
}
}
}
}
Here, the native String.==
is more than four times faster when testing in Release mode (ByteString.==
0.958s vs. String.==
0.223s):
Test Code
import XCTest
import ByteString
@inline(never)
func blackHole(_ x: Bool) {}
final class ByteStringTests: XCTestCase {
func testPerformance() throws {
let fileURL = Bundle.module.url(forResource: "Words", withExtension: "txt")!
let words = try! String(contentsOf: fileURL)
.split(separator: "\n")
.map { ByteString(String($0)) }
measure {
for a in words {
for b in words {
blackHole(a == b)
// blackHole(a.value == b.value)
}
}
}
}
}
The result is similar when profiling <
: 1.301s vs. 0.231s. (Test code and demo data set are included in the demo package.)
I am no expert at profiling code, so take these results with a healthy amount of salt, but the performance difference is significant enough that it is probably not just related to the profiling methodology.
If anything, my code should be faster, as I only care about byte equivalence, not the more involved Unicode equivalence. The strings in my performance tests are mostly ASCII, as that matches most strings in the file format for which I intend to use this ByteString
type. String
has fast paths for at least ASCII and NFC-Unicode content, so I don’t expect my code to be much faster, but it certainly should not be slower by this degree.
Alternative approaches like
a.value.utf8.elementsEqual(b.value.utf8)
are slightly faster than the code shown above (0.802s vs. 0.958, vs. 0.223 native String
), but are still far off and offer no room for improvement.
Is there anything I could do to get closer to the performance of String
in my ByteString
type?
Background: I am writing version control software based on Git, where I parse text files into a tree structure of arrays, dictionaries, and strings. These files can be multiple megabytes in size. ByteString
uses of <
and hashing, but particularly ==
, are extremely frequently used. Git diffs files based on byte equivalence, not Unicode equivalence, so I have to do so as well. I also need to display the strings in the GUI and do other string-related things with them (log, string search based on user-provided queries, …) so I would like to avoid using Data
or [UInt8]
as that would involve conversions that I could sidestep since String
also holds UTF-8 directly.