Best practice for parsing heterogeneous types from Data (in Swift 5)

@Andrew_Trick, @SusanCheng, alignment is one of the things I've yet to strongly grasp. Could you help me understand what my options are (today) if the received byte stream is "packed", i.e. potentially not aligned the way Swift would like the values to be? My intuition is that if you're accessing a value in place and it is misaligned that seems bad, but my understanding was that UnsafePointer.load(as:) first makes a copy of the data so it is never associating the type with the original (potentially un-aligned) bytes, which seems... less bad?

Thanks

Although load(as:) doesn't associate a type with the memory being loaded, it still unfortunately assumes that the pointer is aligned properly for a value of that type. Like in C, you could use memcpy or Swift's equivalent UnsafeMutableRawPointer.copyMemory method to do unaligned loads and stores. I usually add some helper methods to UnsafeRawPointer to facilitate this; as Andrew said, it'd be great to have these in the standard library:

extension UnsafeRawPointer {
  func loadUnaligned<T>(as: T.Type) -> T {
    assert(_isPOD(T.self)) // relies on the type being POD (no refcounting or other management)
    let buffer = UnsafeMutablePointer<T>.allocate(capacity: 1)
    defer { buffer.deallocate() }
    memcpy(buffer, self, MemoryLayout<T>.size)
    return buffer.pointee
  }
}
3 Likes

Thanks for sharing that helper method. I have two questions:

  • Is _isPOD() meant to be called from outside the standard library?
  • Is allocating memory temporarily the only way to instantiate an “empty” instance of a trivial type?

Not officially, but there isn't an ordained way to do so.

Currently, yes. A withScopedAllocation { } API to give you a pointer to temporary, uninitialized memory would also be a great addition. If you have a balanced malloc/free, however, note that it will get promoted away by LLVM:

1 Like

That is good to know, thanks!

Unfortunately, memcpy is currently the way to load misaligned data (semantically one byte at a time). Or you can load the bytes yourself and piece them together with bitwise operators.

This is obviously an unsatisfactory way to handle packed data. I think there's an argument to be made for having the UnsafeRawPointer load default to loading unaligned data. That wasn't done initially because

  • compiler support didn't exist, but that's easy to add
  • ignoring alignment may indicate a programmer error, this way you're forced to think about alignment and make it explicit
  • we wanted higher-level APIs to be expressible in terms of raw pointers without changing semantics or losing performance
  • we can loosen this restriction over time but can't strengthen it

I filed [SR-10273] Add an UnsafeRaw[Buffer]Pointer API for loading unaligned/packed data

1 Like

Yes, that would be very helpful, also because – as I was told here – one may not even know how the memory of a given Data value is aligned.

It'd be great if load and store just worked with unaligned data, I agree. That would require a constraint that they only work with POD types, or maybe that they only support unaligned loads on POD types, since generic non-POD types don't have value witnesses for unaligned accesses, but that's probably fine, and the supporting unaligned loads only for POD types would be backward compatible with the current semantics.

@Andrew_Trick I'm a long-time Swift user but new to contributing: Is there anything I can do to help add weight to that [SR-10273]? I know with bugreporter.apple.com the WWDC recommendation is to file duplicates to express cumulative interest.

Jira lets you "vote" for an issue; there's no need or benefit to dup bugs in the public tracker.

@cconway Bringing it up on this forum was a good start. I added an explanation about how to proceed in the bug. Essentially, someone needs to follow through with a prototype. I could help with that. Then someone needs to drive the Swift Evolution proposal to decide whether to change default behavior or simply add a new public API flag.

1 Like

Ok, I will plan to do something like the extension @Joe_Groff posted above for the time being.

I'm curious, however: The code snippet that @SusanCheng posted seems to suggest that calling UnsafeRawPointer.load(fromByteOffset: as:) in an unaligned case will trigger a stop in execution due to _debugPrecondition(), but I haven't noticed any issues with my non-alignment-aware parsing code prior to Xcode 10.2. Have I just gotten lucky that my binary data was aligned and didn't trigger the precondition? Say I do end up parsing misaligned data in a production build without a mitigation like memcpy: Would I expect a crash, poor performance, unpredictable behavior, or other?

It is error in debug mode.

I workaround by reimplement the load and store methods.
https://github.com/SusanDoggie/Doggie/blob/master/Sources/Doggie/Foundation/Data.swift#L58-L75

It should be same as old version of Data.withUnsafeBytes which’s also using bindMemory directly.

1 Like

Doesn't binding (and dereferencing) memory also require that the memory is aligned for the type T?

My code is doing the same thing as UnsafeRawPointer.load but without the alignment check. It means that it’s just force loading type T from memory and ignore the alignment.

I think it’s easy to get dragged down in the maw of unsafe memory access here, because most of us have come from some C-based language where that’s the only option. Personally I think that’s a mistake in most cases. One of the main goals of Swift is safety, and if you parse incoming data using the same style as you’d use in C, you are likely to suffer from the same exploitable memory management bugs as C.

Hence my post on the thread you referenced, outlining a completely different, and much safer, way to approach this problem. It’s not applicable in all cases, but IMO you should default to this approach, leaving the unsafe, C-style code for situations where profiling has shown that you absolutely need it.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

1 Like

@eskimo, thanks for re-linking to that post. I looked more closely at your alternative implementation and that does answer part of what I was originally asking about, which is best practices when parsing binary data from a source you don't control.

Does my extension below still capture what you were trying to demonstrate about moving up in abstraction? Instead of calling Data.withUnsafePointer() I envision getting slices of my Data packet around expected values, then parsing the expected integer type from it with this function. NOTE: I had to insert a .reversed() to your example code in order for it to correctly parse little endian integers.

extension Data {
    
    func parse<T: FixedWidthInteger>(type: T.Type) -> T? {
        
        let typeSize = MemoryLayout<T>.size
        guard self.count >= typeSize else { return nil }
        
        return self.prefix(typeSize).reversed().reduce(0) { $0 << 8 | T($1) }
    }
}

It's a shame Data is so complex and a lot of the safe solutions are so convoluted. I wish we had a more basic array-of-bytes data structure in standard lib with convenient low level methods like (de)serialization of ints and simpler interop with C.

Does my extension below still capture what you were trying to
demonstrate about moving up in abstraction?

Yep. I have written code like that myself (-:

I had to insert a .reversed() to your example code in order for it
to correctly parse little endian integers.

Be careful here. My code used big endian because that’s the documented order for 'icns' data. If you always reverse, you’re assuming little endian. That may be the right thing to do in your case, but if your goal is to support host endian — that is, big endian on big-endian architectures, little endian on little-endian ones — you’ll have to conditionalise that reverse.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

1 Like

It's a shame Data is so complex and a lot of the safe solutions are so
convoluted. I wish we had a more basic array-of-bytes data structure
in standard lib with convenient low level methods like
(de)serialization of ints and simpler interop with C.

It’s not easy to meet these requirements. On the one hand, you’re saying that Data is too complex, and on the other hand you’re asking to make it more complex by adding support for serialisation of common types. Also, from your other posts, I know that you’re very concerned about performance, and that’s often at odds with convenience.

However, I am sympathetic to your goals here, and I’m not the only one. If you search Swift Forums for data standard library, you’ll see multiple evolution threads about moving Data to the standard library. And the SwiftNIO folks created ByteBuffer because it offers specific advantages over Data.

If you want to help make this wish a reality, I recommend that you engage in Swift Evolution.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple