Fastest way to read UInt8 buffer pointer into UInt or UInt64 chunks

Patrick_Pijnappel · May 7, 2019, 5:14am

I'm working on some stdlib performance improvements for integer parsing and other string handling, and I'm running into some issues.

I have a UTF8 pointer and (from contiguous string storage) and would like to efficiently read it out 64-bits (or alternatively, native UInt size) at a time. I'm ok with the remainder (%8) bytes being at either end. Two approaches I've been looking at:

Using copyMemory(from:byteCount:) on UnsafeMutableRawPointer. This works well with a constant 8 byte count argument, but seems to be relatively slow when using a byte count not know at compile time (i.e. for the remainder bytes).
Copying (as above) or directly loading (which requires handling alignment) UInt64 chunks, and then masking out the garbage bytes outside of the UTF-8 buffer. However I'm not sure whether any amount of reading before or after the buffer is ever guaranteed to be safe? Or is this safe up to word size or maximum alignment size?

lukasa · May 7, 2019, 1:09pm

Out of bounds reads are never guaranteed to be safe. By definition you may be reading uninitialised data. Depending on whether you're being careful with alignment you may read across a cache boundary and cause a segmentation violation, crashing your program. You must never, ever, ever read outside the bounds of an object that has been passed to you.

If you are not managing alignment, copyMemory is your friend. Alternatively, you can use shifting and UInt8s to do this as well. If you are managing alignment, using load to read 64-bit chunks and then reading the memory at the end is fine too.

Patrick_Pijnappel · May 7, 2019, 10:06pm

Ok in particular, can this be safe (assuming a 64-bit system for now):

let utf8: UnsafeBufferPointer<UInt8> = …
// Assume here utf8 count >= 8
let address = UInt(bitPattern: utf8.baseAddress!)
let alignedAddress = address & ~0x07
let alignedRaw = UnsafeRawPointer(bitPattern: rawAligned)!
let value = alignedRaw.load(as: UInt64.self) &<< (address - alignedAddress)

Andrew_Trick · May 7, 2019, 10:27pm

Patrick_Pijnappel:

Ok in particular, can this be safe (assuming a 64-bit system for now):

let utf8: UnsafeBufferPointer<UInt8> = …
// Assume here utf8 count >= 8
let address = UInt(bitPattern: utf8.baseAddress!)
let alignedAddress = address & ~0x07
let alignedRaw = UnsafeRawPointer(bitPattern: rawAligned)!
let value = alignedRaw.load(as: UInt64.self) &<< (address - alignedAddress)

I think it's a valid thing to want to do. But Swift does not have a programming model that supports it under any circumstance today. To even begin reasoning about correctness, the provider of the buffer pointer would need to guarantee that it has safe access to the bytes within some alignment boundary.

Depending on where the buffer actually comes from, I suspect you'll get away with it. Maybe the address sanitizer will catch it?

To be a better citizen, I think you should check for alignment, handle any partial words at the head or tail byte-by-byte via a separate loop, and handle only the aligned full word data using 64-bit loads.

scanon · May 7, 2019, 10:46pm

Yes, asan will absolutely flag such accesses.

What we should do here is provide an idiom for the ragged ends that does not access out-of-range memory in the Swift layer, but for which the SIL or LLVM layer is guaranteed to produce aligned loads and masks that are safe on the targeted machine model, decaying to byte access if that's the only safe option on the targeted machine (an extremely limited form of predicated autovectorization). This will require some careful design, but the benefits are significant (for C and C++ as well as Swift if we expose this in LLVM).

Andrew_Trick · May 7, 2019, 10:54pm

Yeah, I could see providing an API for partial word loads. But if it's not done in LLVM (as a builtin) I'm afraid of running afoul of any tools or diagnostics that work at the LLVM level.

scanon · May 7, 2019, 10:55pm

Right. If you want to avoid tripping asan, LLVM needs to know that it's special. There's a question of whether the optimization from the blessed source idiom to that builtin belongs at the SIL level or LLVM level (I can see reasonable arguments each way).