withUnsafeBytes Data API confusion

benjamin.g · March 26, 2019, 11:21am

Hi, i've been instructed by the latest xcode (10.2) to migrate some code that used "withUnsafeMutableBytes" and "withUnsafeBytes" over the Data type with this message : "use withUnsafeMutableBytes<R>(_: (UnsafeMutableRawBufferPointer) throws -> R) rethrows -> R.

After some digging on the net i started using a mix of UnsafeMutablePointer.allocate(capacity) and "myData.withUnsafeBytes { (_ ptr:UnsafeRawBufferPointer) -> Void ... " which now compiles without warning.

However, i tried to look at the definition for those variants in the Data documentation (Apple Developer Documentation) to make sure i wasn't doing incorrect things with the memory, and found that pretty much every "withUnsafe" functions are deprecated (yet xcode gives no warning anymore), and not a single line explain what are now the official recommended way of accessing those data (except for the subscript operators, which don't suit my case).

Autocompletion also suggested "withContinuousStorageIfAvailable", which seems to indicate that there could be issues regarding the underlying storage continuity ? But there also, i couldn't find any documentation on that subject on the documentation (is this a sequence only thing that's not really relevant in the case of the Data ?).

I must admit i really don't know where to look at, nor which method are now supposed to be the correct way of accessing a Data bytes buffer (for context, the initial goal of the function was to use the commoncrypto CC_MD5 over the data buffer).

lukasa · March 26, 2019, 11:27am

Data conforms to Foundation’s ContiguousBytes protocol, which defines a single function you can use to get the underlying bytes. That’s the function you should call.

You may find it helpful to explicitly define the type of the argument your closure expects to an UnsafeMutableRawBufferPointer such that the compiler will select the correct function.

benjamin.g · March 26, 2019, 11:47am

Thanks for the infos. I tried to look at the documentation for this method, and there's simply nothing. What's the return type for ? Since we're talking about "bytes" why is the closure parameter a RawPointer instead of a Something< UInt8 > (which would be convenient in my case since CC_MD5 uses a UnsafeMutablePointer< UInt8 > for the destination buffer) ? Would it work fine with the "memory rebound" apis in case i need to have it typed ?
Trying to do "data.withUnsafeMutableBytes { (_ ptr: UnsafeMutableRawBufferPointer< UInt8 >) -> Void in .. " results in a "Cannot specialize non generic type UnsafeMutableRawPointer"...

I had a look at the "Manual Memory Management" chapter of the documentation, which explains the "unsafe" types very well, but i'm starting to get the feeling there are still gaps in how those types are integrated in the stdlib (or at least in the stdlib documentation).

All the memory access functions are an extremely sensitive part of the api which most average developers (like myself) don't use on a daily basis, so i was a bit surprised by the lack of documentation, especially if xcode starts throwing new warnings...

Do you know if there there are any effort to get the official Swift documentation in the hands of the community ? I would glady contribute.

itaiferber · March 26, 2019, 4:30pm

I can speak to this since this changed in the DataProtocol changes for Swift 5.

The change here primarily had to do with the possibility of creating Data with Data.init(bytesNoCopy:count:deallocator:), which allows someone to create a Data instance wrapping an already-existing buffer.

When someone does this, they can pass in a raw pointer to any buffer which they have created, which may or may not have been initialized with various types of data; specifically, the passed pointer could be bound to Typed Memory where the bound type is non-trivial.

Previously, Data presented an interface which returned an UnsafeBufferPointer<UInt8>, and did this by rebinding the memory on your behalf: this could implicitly trigger undefined behavior if the original buffer was one you didn't own, and have no control over how it was allocated and initialized.

The change here keeps underlying Data access entirely untyped via Raw pointers. With a Raw pointer, you can read the bytes directly (via load(fromByteOffset:as:)/copyMemory(from:)), without running the risk of implicit undefined behavior. If you did have control over how the buffer as initialized (specifically, you know the original buffer was either untyped, or bound to a trivial type like UInt8), then it is also safe to rebind the raw buffer to the type you want with bindMemory(to:)).

Indeed, raw buffers differ from typed buffers, but there are various ways of reading directly out of a raw buffer, and hopefully the specific documentation on UnsafeRawBufferPointer and continued reading of the Manual Memory Management guide can help. (Also happy to answer specific questions to help guide you!)

Unfortunately, the documentation on developer.apple.com is not part of the open-source effort, but please do file a Radar for any unclear/missing documentation you find — we really do want the documentation on this to be clear, understandable, and easy to find.

nick.keets · March 28, 2019, 7:21am

I think an example of producing the MD5 of a Data using the CC_MD5 functions would help here.

eskimo · March 28, 2019, 10:07am

How about this?

func md5DigestA(of data: Data) -> Data {
    precondition(!data.isEmpty)
    var result = [UInt8](repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))
    data.withUnsafeBytes { buffer in
        _ = CC_MD5(buffer.baseAddress!, CC_LONG(buffer.count), &result)
    }
    return Data(result)
}

If you want to handle the empty data case [1], remove the precondition check on line 2 and the force unwrap on line 5. This works because CC_MD5 will handle a NULL parameter if the count is 0.

If you want to handle the empty data case and you’re dealing with a C function that doesn’t allow a NULL pointer when the count is 0, things get more complex (-:

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

[1] A question that deserves serious consideration, at least in a crypto context.

benjamin.g · March 28, 2019, 11:08am

I ended up doing

let nbBytes = Int(CC_MD5_DIGEST_LENGTH)
let digestBytes = UnsafeMutablePointer<UInt8>.allocate(capacity: nbBytes)
defer { digestBytes.deallocate() }
data.withUnsafeBytes { ptr in
    guard let baseAddress = ptr.baseAddress else { return }
    CC_MD5(baseAddress, CC_LONG(ptr.count), digestBytes)
}
return Data(bytes: digestBytes, count: nbBytes)

Because i wasn't sure about memory contiguity for any of the Foundation/stdlib structures (being arrays or Data). So using an explicitely allocated UnsafeMutablePointer structure seemed the safest bet.

nick.keets · March 28, 2019, 12:24pm

Thanks. So, Data(result) is using the memory allocated by result or is it allocating again?

itaiferber · March 28, 2019, 3:12pm

As of the aforementioned changes above in Swift 5 and beyond, Data is guaranteed to be contiguous such that allocating a separate copy should not be necessary.

itaiferber · March 28, 2019, 3:14pm

Data(result) here creates a copy, but it is possible to avoid this by creating a Data instead of an array (with the right count) and writing into its buffer directly:

import Foundation
import CommonCrypto

func digest(_ data: Data) -> Data {
    var md5 = Data(count: Int(CC_MD5_DIGEST_LENGTH))
    md5.withUnsafeMutableBytes { md5Buffer in
        data.withUnsafeBytes { buffer in
            let _ = CC_MD5(buffer.baseAddress!, CC_LONG(buffer.count), md5Buffer.bindMemory(to: UInt8.self).baseAddress)
        }
    }

    return md5
}

nick.keets · March 29, 2019, 8:16am

Thanks, this is helpful. I wanted to try a different approach, but it is not working (getting wrong result) and I can't figure out why. Do you mind having a look?

func digest2(_ data: Data) -> Data {
    let size = Int(CC_MD5_DIGEST_LENGTH)
    let md = UnsafeMutablePointer<UInt8>.allocate(capacity: size)
    data.withUnsafeBytes {
        CC_MD5($0.baseAddress!, UInt32(size), md)
    }
    return Data(bytesNoCopy: md, count: size, deallocator: .free)
}

eskimo · March 29, 2019, 9:36am

In the second parameter of your call to CC_MD5, you’re passing in the size of the digest not the size of the buffer. You want $0.count.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

eskimo · March 29, 2019, 9:43am

Data(result) is using the memory allocated by result or is it
allocating again?

It’s allocating again. Should you be concerned about that? Only if you’re calling this a lot. Unless this code is very hot, that extra allocation just won’t matter.

Moreover, attempting to remove it can cause you grief. For example, the code you posted downthread has these lines:

let md = UnsafeMutablePointer<UInt8>.allocate(capacity: size)
…
return Data(bytesNoCopy: md, count: size, deallocator: .free)

which is not valid. Memory that you allocate with allocate(capacity:) must be freed by deallocate, but .free causes it to be freed by free. This happens to work on Apple platforms, but is not guaranteed by the API. For more details, see UnsafeMutablePointer allocation compatibility with C malloc/free.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

nick.keets · March 29, 2019, 10:18am

Oops, thanks for the size correction!

In this case allocations don't really matter, but I'm just using MD5 as an example, since this is what the OP used. Would this work then?

Data(bytesNoCopy: md, count: size, deallocator: .custom({ buf, _ in buf.deallocate() }))

Thanks again for your answers!

itaiferber · March 29, 2019, 4:11pm

Yes, this deallocator is correct.

Andrew_Trick · March 31, 2019, 3:29am

It's a serious usability bug that Data and UnsafeRawPointer do not interoperate with C functions that take char * byte buffers. App programmers should never need to use any of the memory binding APIs just to call libraries.

This was a known issue, but I filed this anyway to make sure we're tracking it:
SR-10246 Support limited implicit pointer conversion when calling C functions.

JohnBrownie · April 1, 2019, 7:26am

I'm struggling with this whole memory management stuff, probably because I've never come across a good introduction to it. So I've basically relied on imitating other people's code. If someone can point me in the right direction for a good introduction (I've been programming for nearly 40 years in several languages, but seriously in Swift for only a few months), I'd be grateful.

Anyway, my issue comes up with the same recommendation from Xcode 10.2. I have a function to do a simple test of a block of data to confirm that it is likely to actually be icns data:

static func isIcns(data icnsData: Data) -> Bool {
    let header = icnsData.withUnsafeBytes {
        [UInt32](UnsafeBufferPointer(start: $0, count: 2))
    }
    let icnsHeader = header[0].byteSwapped
    let icnsLength = header[1].byteSwapped
    let expectedHeader = UnicodeScalar("i").value << 24 + UnicodeScalar("c").value << 16 + UnicodeScalar("n").value << 8 + UnicodeScalar("s").value
    if icnsData.count == icnsLength && icnsHeader == expectedHeader {
        return true
    }
    return false
}

How do I rewrite that first bit, to get the first eight bytes as two UInt32 references?

eskimo · April 1, 2019, 9:18am

I'm struggling with this whole memory management stuff …

That’s understandable. Swift makes this challenging because:

The API details have changed quite a lot over the years.
Recent versions have strict rules about aliasing (in this sense of the word). These will yield long-term benefits, but they do take some getting used to.

With regards your specific issue, I’m a big fan of moving up a level of abstraction. In your case, I’d rethink this as a parsing problem rather than a structure access problem. The fact that Data exposes its contents as a collection of bytes means you can take advantage of lots of functionality that’s available on collections. For example:

func isIcns(data icnsData: Data) -> Bool {
    guard
        icnsData.count >= 8,
        icnsData.starts(with: "icns".utf8)
    else {
        return false
    }
    let embeddedCount32 = icnsData.dropFirst(4).prefix(4).reduce(0) { $0 << 8 | UInt32($1) }
    return Int(exactly: embeddedCount32) == icnsData.count
}

One thing to note about this code it that, on a 32-bit machine, it avoids the trap you might encounter converting the length bytes of a maliciously crafted icns to Int.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

JohnBrownie · April 1, 2019, 10:01am

Thank you! The code is nice and avoids my byte-swapping with the neat trick of using reduce.

Any pointers to a good introduction to the memory management that is up to date for Swift 5?

John

ppamorim · May 16, 2019, 10:51am

My solution based in the snippets from here without force unwrap and setting all types possible:

func buildMD5(data: Data) -> String {
  var md5: Data = Data(count: Int(CC_MD5_DIGEST_LENGTH))
  md5.withUnsafeMutableBytes { (md5Buffer: UnsafeMutableRawBufferPointer) in
    data.withUnsafeBytes { (buffer: UnsafeRawBufferPointer) in
      guard let baseAddress: UnsafeRawPointer = buffer.baseAddress else {
        return
      }
      _ = CC_MD5(baseAddress, CC_LONG(buffer.count), md5Buffer.bindMemory(to: UInt8.self).baseAddress)
    }
  }
  return md5.base64EncodedString()
}