Best practice for parsing heterogeneous types from Data (in Swift 5)

cconway · April 2, 2019, 3:59pm

This post is partly motivated by the new warning messages in Xcode 10.2 / Swift 5 about deprecation of withUnsafeBytes that takes UnsafePointer<T> rather than the newer variant that takes UnsafeRawBufferPointer: e.g. withUnsafeBytes Data API Confusion

I have read through a number of strategies people use to get their existing code working, but I am interested in taking a fresh look at best practice for parsing serialized binary protocols from Data in Swift 5. For the sake of discussion, say I'm trying to parse a binary packet of data I received that looks like:

var receivedData: Data
// Internally structured like [ Length: UInt8 ] [ Type: UInt16 ] [ Payload (variable length) ] [ Checksum: UInt16 ]

Previously I might have done something like:

let parsedLength: UInt8 = receivedData.withUnsafeBytes { $0.pointee }
// Continue parsing typed values out individually...

Is this the recommended way to parse heterogenous types out of a Data instance now?

receivedData.withUnsafeBytes { (rawPointer) -> Void in
    var offset = 0
    let parsedLength = rawPointer.load(as: UInt8.self)
    offset += MemoryLayout.size(ofValue: parsedLength)
    let parsedType = rawPointer.load(fromByteOffset: offset, as: UInt16.self)
    offset += MemoryLayout.size(ofValue: parsedType)
    ... 
    // Do something with the parsed values
}

SusanCheng · April 2, 2019, 4:10pm

The load method of UnsafeRawBufferPointer is wrapper of UnsafeRawPointer.load

github.com

apple/swift/blob/main/stdlib/public/core/UnsafeRawBufferPointer.swift.gyb#L331


      
            ///
            /// - Parameters:
            ///   - byteCount: The number of bytes to allocate. `byteCount` must not be
            ///     negative.
            ///   - alignment: The alignment of the new region of allocated memory, in
            ///     bytes. `alignment` must be a whole power of 2.
            /// - Returns: A buffer pointer to a newly allocated region of memory aligned 
            ///     to `alignment`.
            @inlinable
            public static func allocate(
              byteCount: Int, alignment: Int
            ) -> UnsafeMutableRawBufferPointer {
              let base = UnsafeMutableRawPointer.allocate(
                byteCount: byteCount, alignment: alignment)
              return UnsafeMutableRawBufferPointer(start: base, count: byteCount)
            }
          %  end # mutable
          
          
  /// Deallocates the memory block previously allocated at this buffer pointer’s 
            /// base address. 
            ///

And the method is checking the alignment that would causing fatalError.

github.com

apple/swift/blob/main/stdlib/public/core/UnsafeRawPointer.swift#L354


      
          ///     ) {
          ///         return $0.pointee < 0
          ///     }
          ///
          /// After executing `body`, this method rebinds memory back to its original
          /// binding state. This can be unbound memory, or bound to a different type.
          ///
          /// - Note: The region of memory starting at this pointer must match the
          ///   alignment of `T` (as reported by `MemoryLayout<T>.alignment`).
          ///   That is, `Int(bitPattern: self) % MemoryLayout<T>.alignment`
          ///   must equal zero.
          ///
          /// - Note: The region of memory starting at this pointer may have been
          ///   bound to a type. If that is the case, then `T` must be
          ///   layout compatible with the type to which the memory has been bound.
          ///   This requirement does not apply if the region of memory
          ///   has not been bound to any type.
          ///
          /// - Parameters:
          ///   - type: The type to temporarily bind the memory referenced by this
          ///     pointer. This pointer must be a multiple of this type's alignment.

Andrew_Trick · April 2, 2019, 4:19pm

Yes, I think that's roughly best practice. You can also update a var raw pointer by byte offsets if that's your thing: rawPointer += offset, but I think what you've written is better.

I will point out that you're relying on type inference to generate correct byte offset. That might be more error-prone than being explicit:

offset += MemoryLayout<UInt16>.size

[edit] @SusanCheng made an important point about unaligned memory access. Adding an unaligned load to the UnsafeRawPointer API is long overdue. I wasn't able to find a bug tracking it though, so please file a bug if you need it. It was alluded to here:

github.com

apple/swift-evolution/blob/master/proposals/0107-unsaferawpointer.md#future-improvements-and-planned-additive-api

# UnsafeRawPointer API

* Proposal: [SE-0107](0107-unsaferawpointer.md)
* Author: [Andrew Trick](https://github.com/atrick)
* Review Manager: [Chris Lattner](http://github.com/lattner)
* Status: **Implemented (Swift 3)**
* Decision Notes: [Rationale](https://lists.swift.org/pipermail/swift-evolution-announce/2016-July/000231.html)

For detailed instructions on how to migrate your code to this new
Swift 3 API refer to the
[UnsafeRawPointer Migration Guide](https://swift.org/migration-guide-swift3/se-0107-migrate.html). See
also: See `bindMemory(to:capacity:)`, `assumingMemoryBound(to:)`, and
`withMemoryRebound(to:capacity:)`.

For quick reference on the full API, jump to:
- [Full UnsafeRawPointer API](#full-unsaferawpointer-api)

Contents:
- [Introduction](#introduction)
- [Proposed Solution](#proposed-solution)

This file has been truncated. show original

cconway · April 2, 2019, 5:26pm

@Andrew_Trick, @SusanCheng, alignment is one of the things I've yet to strongly grasp. Could you help me understand what my options are (today) if the received byte stream is "packed", i.e. potentially not aligned the way Swift would like the values to be? My intuition is that if you're accessing a value in place and it is misaligned that seems bad, but my understanding was that UnsafePointer.load(as:) first makes a copy of the data so it is never associating the type with the original (potentially un-aligned) bytes, which seems... less bad?

Thanks

Joe_Groff · April 2, 2019, 5:38pm

Although load(as:) doesn't associate a type with the memory being loaded, it still unfortunately assumes that the pointer is aligned properly for a value of that type. Like in C, you could use memcpy or Swift's equivalent UnsafeMutableRawPointer.copyMemory method to do unaligned loads and stores. I usually add some helper methods to UnsafeRawPointer to facilitate this; as Andrew said, it'd be great to have these in the standard library:

extension UnsafeRawPointer {
  func loadUnaligned<T>(as: T.Type) -> T {
    assert(_isPOD(T.self)) // relies on the type being POD (no refcounting or other management)
    let buffer = UnsafeMutablePointer<T>.allocate(capacity: 1)
    defer { buffer.deallocate() }
    memcpy(buffer, self, MemoryLayout<T>.size)
    return buffer.pointee
  }
}

Martin · April 2, 2019, 6:22pm

Thanks for sharing that helper method. I have two questions:

Is _isPOD() meant to be called from outside the standard library?
Is allocating memory temporarily the only way to instantiate an “empty” instance of a trivial type?

Joe_Groff · April 2, 2019, 6:26pm

Not officially, but there isn't an ordained way to do so.

Currently, yes. A withScopedAllocation { } API to give you a pointer to temporary, uninitialized memory would also be a great addition. If you have a balanced malloc/free, however, note that it will get promoted away by LLVM:

Martin · April 2, 2019, 6:28pm

That is good to know, thanks!

Andrew_Trick · April 2, 2019, 6:37pm

Unfortunately, memcpy is currently the way to load misaligned data (semantically one byte at a time). Or you can load the bytes yourself and piece them together with bitwise operators.

This is obviously an unsatisfactory way to handle packed data. I think there's an argument to be made for having the UnsafeRawPointer load default to loading unaligned data. That wasn't done initially because

compiler support didn't exist, but that's easy to add
ignoring alignment may indicate a programmer error, this way you're forced to think about alignment and make it explicit
we wanted higher-level APIs to be expressible in terms of raw pointers without changing semantics or losing performance
we can loosen this restriction over time but can't strengthen it

I filed [SR-10273] Add an UnsafeRaw[Buffer]Pointer API for loading unaligned/packed data

Martin · April 2, 2019, 6:43pm

Yes, that would be very helpful, also because – as I was told here – one may not even know how the memory of a given Data value is aligned.

Joe_Groff · April 2, 2019, 7:01pm

It'd be great if load and store just worked with unaligned data, I agree. That would require a constraint that they only work with POD types, or maybe that they only support unaligned loads on POD types, since generic non-POD types don't have value witnesses for unaligned accesses, but that's probably fine, and the supporting unaligned loads only for POD types would be backward compatible with the current semantics.

cconway · April 2, 2019, 8:04pm

@Andrew_Trick I'm a long-time Swift user but new to contributing: Is there anything I can do to help add weight to that [SR-10273]? I know with bugreporter.apple.com the WWDC recommendation is to file duplicates to express cumulative interest.

Joe_Groff · April 2, 2019, 8:16pm

Jira lets you "vote" for an issue; there's no need or benefit to dup bugs in the public tracker.

Andrew_Trick · April 2, 2019, 8:53pm

@cconway Bringing it up on this forum was a good start. I added an explanation about how to proceed in the bug. Essentially, someone needs to follow through with a prototype. I could help with that. Then someone needs to drive the Swift Evolution proposal to decide whether to change default behavior or simply add a new public API flag.

cconway · April 2, 2019, 10:27pm

Ok, I will plan to do something like the extension @Joe_Groff posted above for the time being.

I'm curious, however: The code snippet that @SusanCheng posted seems to suggest that calling UnsafeRawPointer.load(fromByteOffset: as:) in an unaligned case will trigger a stop in execution due to _debugPrecondition(), but I haven't noticed any issues with my non-alignment-aware parsing code prior to Xcode 10.2. Have I just gotten lucky that my binary data was aligned and didn't trigger the precondition? Say I do end up parsing misaligned data in a production build without a mitigation like memcpy: Would I expect a crash, poor performance, unpredictable behavior, or other?

SusanCheng · April 3, 2019, 12:18am

It is error in debug mode.

I workaround by reimplement the load and store methods.
https://github.com/SusanDoggie/Doggie/blob/master/Sources/Doggie/Foundation/Data.swift#L58-L75

It should be same as old version of Data.withUnsafeBytes which’s also using bindMemory directly.

Martin · April 3, 2019, 8:52am

Doesn't binding (and dereferencing) memory also require that the memory is aligned for the type T?

SusanCheng · April 3, 2019, 9:00am

My code is doing the same thing as UnsafeRawPointer.load but without the alignment check. It means that it’s just force loading type T from memory and ignore the alignment.

github.com

apple/swift/blob/main/stdlib/public/core/UnsafeRawPointer.swift#L358


      
          /// After executing `body`, this method rebinds memory back to its original
          /// binding state. This can be unbound memory, or bound to a different type.
          ///
          /// - Note: The region of memory starting at this pointer must match the
          ///   alignment of `T` (as reported by `MemoryLayout<T>.alignment`).
          ///   That is, `Int(bitPattern: self) % MemoryLayout<T>.alignment`
          ///   must equal zero.
          ///
          /// - Note: The region of memory starting at this pointer may have been
          ///   bound to a type. If that is the case, then `T` must be
          ///   layout compatible with the type to which the memory has been bound.
          ///   This requirement does not apply if the region of memory
          ///   has not been bound to any type.
          ///
          /// - Parameters:
          ///   - type: The type to temporarily bind the memory referenced by this
          ///     pointer. This pointer must be a multiple of this type's alignment.
          ///   - count: The number of instances of `T` in the re-bound region.
          ///   - body: A closure that takes a typed pointer to the
          ///     same memory as this pointer, only bound to type `T`. The closure's
          ///     pointer argument is valid only for the duration of the closure's

eskimo · April 3, 2019, 9:12am

I think it’s easy to get dragged down in the maw of unsafe memory access here, because most of us have come from some C-based language where that’s the only option. Personally I think that’s a mistake in most cases. One of the main goals of Swift is safety, and if you parse incoming data using the same style as you’d use in C, you are likely to suffer from the same exploitable memory management bugs as C.

Hence my post on the thread you referenced, outlining a completely different, and much safer, way to approach this problem. It’s not applicable in all cases, but IMO you should default to this approach, leaving the unsafe, C-style code for situations where profiling has shown that you absolutely need it.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

cconway · April 3, 2019, 3:57pm

@eskimo, thanks for re-linking to that post. I looked more closely at your alternative implementation and that does answer part of what I was originally asking about, which is best practices when parsing binary data from a source you don't control.

Does my extension below still capture what you were trying to demonstrate about moving up in abstraction? Instead of calling Data.withUnsafePointer() I envision getting slices of my Data packet around expected values, then parsing the expected integer type from it with this function. NOTE: I had to insert a .reversed() to your example code in order for it to correctly parse little endian integers.

extension Data {
    
    func parse<T: FixedWidthInteger>(type: T.Type) -> T? {
        
        let typeSize = MemoryLayout<T>.size
        guard self.count >= typeSize else { return nil }
        
        return self.prefix(typeSize).reversed().reduce(0) { $0 << 8 | T($1) }
    }
}