NSData.data vs Data.withUnsafeBytes

tera · July 6, 2021, 1:00am

speaking of ios/macOS are there any cons of "foo" compared to "bar" in this fragment? e.g. a speed difference?

func foo(data: Data) {
    let bytes = (data as NSData).bytes.assumingMemoryBound(to: UInt8.self)
    // work with bytes
}

func bar(data: Data) {
    data.withUnsafeBytes {
        let bytes = $0.baseAddress!.assumingMemoryBound(to: UInt8.self)
        // work with bytes
    }
}

Philippe_Hausler · July 6, 2021, 2:06am

The function foo is potentially unsafe since the NSData must outlive its access to bytes. Presuming that was fixed the performance in small data should be considerably faster, for larger datas the structure is likely only marginally faster (perhaps to a point of not significantly measurable)

lukasa · July 6, 2021, 9:05am

Making this explicit: the function foo as written is not just potentially unsafe, it is almost never safe. The odds of you writing an implementation of foo that does the right thing are very low. I recommend never calling NSData.bytes in Swift code. See also SR-14420.

tera · July 6, 2021, 10:35am

i tried this code and can't make it fail on iOS/macOS:

func prepareData(count: Int) -> Data {
    var data = Data(repeating: 0, count: count)
    data.withUnsafeMutableBytes {
        let bytes = $0.baseAddress!.assumingMemoryBound(to: UInt8.self)
        for i in 0 ..< count {
            bytes[i] = UInt8((i &* 12345) & 0xFF)
        }
    }
    return data
}

func foo(data: Data) {
    let count = data.count
    let bytes = (data as NSData).bytes.assumingMemoryBound(to: UInt8.self)
    for i in 0 ..< count {
        myassert(bytes[i] == UInt8((i &* 12345) & 0xFF))
    }
}

func myassert(_ expr: Bool) {
    if !expr {
        assert(false)
        fatalError("assert failed in release mode")
    }
}

func test() {
    for _ in 0 ..< 1000 {
        let count = Int.random(in: 1 ..< 10000)
        let data = prepareData(count: count)
        foo(data: data)
    }
}

i presume the failing UT would be only if i change or deallocate the data while iterating over it and so far i can't see what the "work with bytes" below could be (other lines of foo intact) for that to happen. can this fail on non apple platforms only?

func foo(data: Data) {
    let bytes = (data as NSData).bytes.assumingMemoryBound(to: UInt8.self)
    // work with bytes
}

lukasa · July 6, 2021, 2:47pm

I'm not telling you it fails, I'm telling you it's not safe. It's important to distinguish what the compiler does today from what the rules of the language are.

Concretely, bytes is an "interior pointer": it is a pointer to data whose lifetime is managed by the lifetime of the NSData object that owns it. If the NSData object is deinited, the data will be freed and the bytes pointer will now be dangling.

In Swift, objects do not have lexical lifetimes: that is, the swift_release call is not guaranteed to happen at the end of the block, but may happen any time after the point of last use of the object (see An unexpected deadlock with Combine only on Release build - #2 by lukasa for more). In general today the compiler is not good at taking advantage of this opportunity so it tends to be a bit lazy about when it actually issues a release call, but in principle this release can happen at any time after the point of last use.

So consider your function foo:

func foo(data: Data) {
    let count = data.count
    let bytes = (data as NSData).bytes.assumingMemoryBound(to: UInt8.self)
    for i in 0 ..< count {
        myassert(bytes[i] == UInt8((i &* 12345) & 0xFF))
    }
}

The moment the call to .bytes completes, data is never referenced again in this function. As a result, it would be entirely valid for the Swift compiler to transform your above function into:

func foo(data: Data) {
    let count = data.count
    let bytesTemporary = (data as NSData).bytes
    swift_release(data)  // deinit may occur here, making bytesTemporary immediately dangling
    let bytes = bytesTemporary.assumingMemoryBound(to: UInt8.self)
    for i in 0 ..< count {
        myassert(bytes[i] == UInt8((i &* 12345) & 0xFF))
    }
}

As you can see from the rewrite, bytes is allowed to immediately become dangling the moment it's used.

The fact that you can't make this happen today actually makes this more dangerous, not less: it's much easier for you to accidentally introduce this bug into your code and not trip over it until months or years later, when the Swift compiler gets smart enough to release data earlier and blow up your code.

Please please please never use NSData.bytes in Swift code.

tera · July 6, 2021, 5:28pm

"Please please please never use NSData.bytes in Swift code" shall be translated into a corresponding compilation error (or warning at the very least). otherwise it will surely be used by those who haven't seen this thread.

tera · July 6, 2021, 5:55pm

yes, (in some future compiler version) data might be released, but at the same time "as NSData" can return an autoreleased object that would be a new owner of bytes and valid until the current autorelease pool is drained.

tera · July 6, 2021, 6:14pm

this is very similar:

func baz(_ string: String) {
    let p = (string as NSString).utf8String!
    let len = strlen(p) // can string be already deallocated at this point?
    memmove(..., p, len + 1)
}

the doc for utf8String says:

"you should copy the C string if it needs to be stored outside of the memory context (?) in which you use this property."

what the heck is "memory context" here?

lukasa · July 6, 2021, 6:46pm

Yes, that’s what my bug report was for.

Yup, this is the same issue again.

Philippe_Hausler · July 6, 2021, 8:02pm

Just for clarity: there is a safe way to use .bytes

func foo(data: Data) {
    let count = data.count
    let dataObj = data as NSData
    withExtendedLifetime(dataObj) {
        let bytesTemporary = dataObj.bytes
        let bytes = bytesTemporary.assumingMemoryBound(to: UInt8.self)
        for i in 0 ..< count {
             myassert(bytes[i] == UInt8((i &* 12345) & 0xFF))
        }
    }
}

The lifetime of the bytes is guaranteed to be good as long as the lifetime of the NSData object exists. This is a feature of the NS_RETURNS_INNER_POINTER annotation on -[NSData bytes]

woolsweater · July 7, 2021, 12:56am

For posterity, there was a WWDC talk this year going over this point: ARC in Swift: Basics and beyond - WWDC21 - Videos - Apple Developer

Nobody1707 · July 8, 2021, 3:30pm

Wouldn't an easy correct version of foo be:

func foo(data: Data) {
    withExtendedLifetime(data) {
        let count = data.count
        let bytes = (data as NSData).bytes.assumingMemoryBound(to: UInt8.self)
        for i in 0 ..< count {
            myassert(bytes[i] == UInt8((i &* 12345) & 0xFF))
        }
    }
}

Edit: Nevermind, I think Philippe's is probably more correct.

lukasa · July 8, 2021, 3:32pm

Yes, but while this is correct it's also likely not to be what people use. It's also not any more obvious than:

func foo(data: Data) {
    data.withUnsafeMutableBytes { bytes in
        for i in 0 ..< bytes.count {
            myassert(bytes[i] == UInt8((i &* 12345) & 0xFF))
        }
    }
}

and the above has the advantage of being both a) clearer in intent, b) more obviously correct, and c) more generally applicable to Swift data structures.

(Not that it matters but the assumingMemoryBound call in all the examples above is also wrong, but not worth worrying about here.)

Nobody1707 · July 8, 2021, 3:38pm

I agree. I just wanted to make sure I understood how withExtendedLifetime works.

tera · July 8, 2021, 5:53pm

please clarify on that bit.

tera · July 8, 2021, 6:01pm

on that bug report:

or providing a diagnostic for using NSData.bytes as the last usage point

"last usage point" determination is trickier than it seems.

let bytes = (data as NSData).bytes.assumingMemoryBound(to: UInt8.self)
// work with bytes
print(data.count)

does the last data.count constitute data usage? it might, yet at the same time compiler can be smarter and understand that data is not changing, so the count can be calculated earlier, and so will allow releasing data earlier... even this can potentially be optimised away:

print(data[index])

glessard · July 9, 2021, 1:12am

In a buffer from a Data instance, you do not know whether the memory has been bound, and in general you have no control over that. Therefore you shouldn’t assume.

What you should use is the load(as:) function. That is safe.

tera · July 9, 2021, 12:10pm

you mean load individual UInt8 bytes from offsets? is not that slower?

lukasa · July 9, 2021, 1:23pm

Not meaningfully, no. The abstraction compiles away to nothing.

tera · July 9, 2021, 2:07pm

interesting.. is it as fast to use just data[i] subscript?