Move Foundation.Data in to the standard library

Karl · August 13, 2018, 10:02pm

This topic has come up a few times over the years, with apparently no objections.

The standard library currently exposes no public type for an owned buffer which is released when its last reference becomes unreachable.

Currently, the only facility for passing byte-buffers around is one of the UnsafeRaw(Mutable)(Buffer)Pointer types. When an API returns one of these pointers, it doesn't necessarily pass ownership of the memory it points to with it, so deallocation needs to be handled separately (perhaps by documenting "you need to deallocate this").

There are times when reading/writing a collection of bytes is not an "unsafe" operation - reading/writing a binary file or network packet, for example, is perfectly safe and you should not need to drop-down to the rather-unwieldy pointer APIs to work with them. Data is a much more natural fit, and should be what most APIs and libraries use to work with byte buffers - therefore, I believe it merits inclusion in the standard library.

As for bridging: we already have lots of bridging in the standard library - Array, Dictionary, String, etc. I don't see why Data is substantially different to those, although maybe the Apple Foundation team have something to say about that.

@Tony_Parker @Philippe_Hausler @itaiferber - thoughts?

Philippe_Hausler · August 13, 2018, 10:20pm

Well the problem is not that easy to move it downwards (not that I disagree with your logic). It has been something that we have discussed but there are a ton of ramifications in doing so. Part of which is that we need to disassociate the implementation from the reference semantics backing storage and just have the "swift" backing storage to implement it there since Swift itself does not have the objc version of Foundation/CoreFoundation below it on Linux (the layering is a tangled mess steeped with history). To be quite honest I find UnsafePointer and friends to be quite unwieldy as well. Even as someone who has used them a fair bit there are quite a number of unintuitive sharp edges that you can easily cut yourself on. I would like to think Data is a bit softer and more suited for ergonomic code.

In short this is not something we would take lightly and it would involve a lot of work with questionable gains (even the memory argument of bringing in Foundation is not as big of an argument as some would like to make it) and we have talked but not yet hashed out any concrete plans as of yet how to address this quandary.

Perhaps Tony Parker or Ben Cohen might have some different perspectives on this than myself to lend to the story here.

Karl · August 13, 2018, 10:28pm

What I don't understand is why that bridging with NSData is so much more awkward or difficult than, say, NSArray or NSSet (we even bridge those to generic types and validate their contents!).

itaiferber · August 13, 2018, 11:00pm

It's not necessarily significantly more difficult than the existing bridging we have, but would still require significant changes. Data's backing store is an enum which can represent multiple different forms of storage which explicitly references NSData and NSMutableData, for instance. Bringing this down into the stdlib requires a refactor of how this works, potentially at a relative loss of performance compared to what we have now. Not insurmountable, but still requiring a decent amount of change.

This also has the potential to be a source-breaking change as anyone who currently references Foundation.Data to disambiguate between the type and another Data type would break, unless we typealias Foundation.Data = Swift.Data, or similar.

One of the big questions here is: how much would we gain by going through this risky process that we don't already have? What would we get by lowering this over what you get with import Foundation at the moment?

Philippe_Hausler · August 13, 2018, 11:15pm

Those are much more complicated than they appear. They have hooks into the compiler itself and are definitely non trivial implementations.

Karl · August 13, 2018, 11:52pm

(Aside: noticing that enum -- couldn't those cases be conditionally defined with #if _runtime(_ObjC) for better performance on linux? Of course all switch statements would also need the checks. That's what Array does.)

My point is that, from a naïve perspective, there's nothing more basic than a buffer of bytes. One would expect that it's the easiest type to bridge. I know the bridging for Array and Dictionary and the other types are non-trivial (AnyHashable, ho!); that's why it's so confusing that Data should be such a headache. On the surface it seems like there would be far fewer impedance mismatches.

As for why we should do it:

Out of principle. I believe the line between Foundation and the Swift stdlib was not drawn correctly when it comes to Data. The standard library should have an answer when people ask how to pass around and manipulate byte buffers, and the current answer of using Unsafe(Blergh)Pointers is pretty poor (...and, potentially unsafe. See the name. Maybe a library client will forget to deallocate/over-release it).
To encourage use of Data as a common currency. See next point.
Foundation is monolithic. It contains a lot of great features, but also lots that you may not need. Let's say you have a library which doesn't do any networking or date/time calculations - just working with bytes (maybe a cryptography library, or some other kind of binary-data reader/manipulator). Currently they must either import the entire Foundation module just to access this single, very basic type, or traffic in Unsafe(GargleBargle)Pointers (being careful not to leak/over-release), or write their own RAII wrapper pretty-much just like Data. I think all of those options are not nearly as excellent as having Data in the standard library.
Potential compiler optimisations. I admit I don't know enough about this, but as a standard-library type we would be free to private details like @_semantics attributes to let the compiler better understand the ownership and lifetime of the allocated buffer. I definitely don't understand why Data would be less performant inside the standard library than outside of it, or if so why that wouldn't also apply to Array and String?

3a4oT · August 14, 2018, 7:22am

I wonder whether NIO's ByteBuffer-core and friends can be a good alternative for Data in stdlib.

Tony_Parker · August 14, 2018, 4:25pm

I'm looking for more concrete data about this argument of Foundation being monolithic.

One data point that we have heard is that it brings in unwanted dependencies. We intend to address this via splitting the URLSession and XML pieces out (see Pitch: Move URLSession to new FoundationNetworking module).

With respect to compiler optimizations, we believe those can be applied to struct Data regardless of its position in either the stdlib or Foundation.

Is there something else that has a measurable impact that we can address?

Joe_Groff · August 14, 2018, 5:10pm

Compared to other value types, our ability to optimize Data in a general context is limited by Data's ability to take ownership of mutable memory, and the expectation that mutations to that memory show up in the Data. Because of that, we'd have to assume that memory writes (or function calls that may write memory) may change the value of any Data. (That's independent of where Data lives, though.)

DevAndArtist · August 14, 2018, 5:18pm

Actually I have one objection regarding the current topic. I think Data is a very commonly used type, but it has some flaws that were made due to limited time frame and resulted into a design that is not consistent with the rest of the standard library types. Here I'm speaking about Data's views/slices which lead to unexpected crashes if you forget that any Data can be a view/slice of one original Data instance. I bumped myself into that and asked here for clarification. If Data continues to be inconsistent to the rest of the stdlib type then I'd be against it's migration from Foundation.

Tony_Parker · August 14, 2018, 5:25pm

Data is only intended to support direct mutation of the underlying bytes within the context of the withUnsafeMutableBytes closure. Any other mutation of the memory would probably violate our ability to CoW.

Tony_Parker · August 14, 2018, 5:36pm

Data's slicing was designed to be consistent with Array and ArraySlice.

let a = ["a", "b", "c", "d"]
let aSlice = a[1..<2]
let r = aSlice[0] // fatal error

If you write an API that accepts a collection or sequence generically, you can't assume that (a) it's integer indexed or (b) that 0 is the first index. The correct way to use all of these is to either use a higher level API to get what you want, or use startIndex.

DevAndArtist · August 14, 2018, 5:43pm

That is only partly true, because the inconsistency I referred to was terms of returned type. A slice from an Array returns an ArraySlice which signals to the API user that it refers to a different storage with respect to COW. In comparison Data returns Data as it's own slice. That makes it impossible to know whether the current data instance is correctly indexed or is a slice and has different indices.

I don't say that my assumption is that a collection has integer indices, but on an array we have integer indices and it's save to access the data if we know the bounds from the arrays count. This should also be applied to Data, while the returned slice should be a different type which can have offsetted indices to mimic exactly the same behavior of arrays.

Joe_Groff · August 14, 2018, 5:52pm

"Correctly indexed" is a property of the collection in question. Data doesn't guarantee its startIndex is zero, so it is "correct" for it to have any startIndex.

DevAndArtist · August 14, 2018, 5:59pm

That is what I tried to describe as inconsistency from my point of view. Types from the stdlib follow the same design principles, so should any new type that is added or moved to the stdlib follow the same principles.

I really dislike the following behavior:

func test(_ array: [UInt8]) {
  _ = array[0 ..< array.count]
}

func test(_ data: Data) {
  // We need to wrap the instance into itself to guarantee that it's
  // zero indexed
  // let newData = Data(data)
  _ = data[0 ..< data.count]
}

let array: [UInt8] = [0, 1, 2, 3]

test(array)
let arraySlice = array[1 ... 2] // ArraySlice<UInt8>
test(arraySlice) // expected to not even compile

let data: Data = Data(array)
test(data) // okay
let dataSlice = data[1 ... 2] // Data
test(dataSlice) // Compiles and crashes at runtime

Joe_Groff · August 14, 2018, 6:07pm

You shouldn't be assuming zero indexing to begin with, and we don't recommend forcing copies of collections just to get zero indexing—that's going to create a bunch of overhead compared to correctly using startIndex. In principle, the Data(...) initializer could choose to produce a new value with a non-zero startIndex if it so chose—I'm not sure whether the Foundation team considers that a formal guarantee or incidental behavior. The purpose of ArraySlice is not to distinguish its slicing behavior, but its memory management behavior. It is valid for a collection implementation to use itself as its SubSequence type.

DevAndArtist · August 14, 2018, 6:16pm

I'm not arguing against these facts, all I'm saying is that I wished Data would return a different view as it's own slice, because it would provide more compile-time guarantees. Otherwise you guys would have used Array as its own slice type and provide same memory management bahavior behind the scenes, no?

Joe_Groff · August 14, 2018, 6:18pm

The primary purpose of ArraySlice is so that Array can guarantee that it doesn't leak memory, not its slicing behavior. If not for the leak hazard, Array would likely have been its own slice type as well, and would not have guaranteed startIndex == 0 either. Collections are guaranteed to start indexing at startIndex; that holds independent of the type system.

DevAndArtist · August 14, 2018, 6:23pm

Interesting, thank you Joe for point that out. But does this imply that Data can potentially leak memory for the same facts that you just noted? Data returns itself as a slice so it can potentially leak if not used correctly? Would that not be a point for the type to not be included in the stdlib (regardless any other technical or conceptual issues)?

mayoff · August 14, 2018, 9:27pm

A compilation mode where every Array's startIndex is a random number would be an amazing way to test for correct indexing.