Array Initializer with Access to Uninitialized Buffer

Yeah, that's exactly what I had in mind. It's only semi-related, in that it has similar performance benefits and reasons for existing, so a separate proposal would be fine.

1 Like

This is a very important API, especially when interoperating with C libraries. I thought about this quite a bit in the past, but I'm glad to see a pitch.

My approach to this was a little less direct: Why do we even need these copies today? Why can't Array/String just wrap any old UnsafeMutableBufferPointer? I think the main reason is to provide value semantics and immutability, which requires the buffer itself to have value semantics (like Foundation.Data). Unfortunately we don't have something like that in the standard library, but if we did - theoretically Array and String could use it as backing storage, expose it directly to the user, and allow re-wrapping an existing buffer (e.g. you could create a String which shares the backing of an Array) while preserving value semantics and avoiding copying.

/// A COW memory buffer - like Foundation.Data, but guaranteed to be contiguous.
struct ContiguousData: RandomAccessCollection { /* ... */ }

struct Array<T> {
  init(rawStorage: ContiguousData)
  var rawStorage: ContiguousData
}

var string: String
do {
  // Create an Array
  let myArray = [1, 2, 3, 4]
  // View its memory as a String (no copying)
  string = String(rawStorage: myArray.rawStorage, encoding: .ascii)
}
// Array goes out of scope, string now holds unique reference to storage. No copy.
string.append("blah")

What do you think - would something like this work or not?

i don’t think this would work, Arrays have a header that’s allocated at the head of the buffer. so the elements actually come at an offset from the buffer start

1 Like

This seems like a broader concern that may not be worth addressing in this specific case: things that need to be mutable for initialisation but are immutable after that.

I like it!

Thanks for this Nate. It's been a long time coming. Here's Karl's old bug:

I agree that we should introduce a single unsafe API that handles initialization and reinitialization. The functionality needs to exist in a primitive form first. Mutability concerns and convenience can be added later.

I'm glad someone else proposed this name so I don't take any flak for it, but it works for me:

3 Likes

I'm looping back around to this idea after some discussion of the intention and behavior of array storage allocation in this thread: Array capacity optimization. The primary challenge that I've been looking at is that someone calling this method may easily expect that the size of the buffer in the closure matches either the capacity of the array that they observe through the capacity property or the capacity that they reserve immediately before, through a call to reserveCapacity(_:), but neither one of those is guaranteed.

I think I'm back to the original initializer that I had in mind, with a revised name (to add "unsafe"), since I'm not satisfied with the designs I can come up with for a mutating method. Here's the thought process that leads to my kind-of-paradox:

  • If someone calls a hypothetical withFullCapacityUnsafeMutableBufferPointer method, they may make mistaken assumptions about what size buffer they receive inside the closure. Mistakes of this kind are worse than usual, since they can lead to memory leaks or memory accesses past the buffer's boundaries. We don't have a great way to provide diagnostics when things go awry, and the buffer's size will likely be nondeterministic, so issues may go unnoticed in development only to appear in production.
  • We could add a capacity parameter to the method, but that complicates the method a bit, and means that the confusion is still possible, since the method doesn't really make sense if the user passes a capacity smaller than the array's count. Would we include a precondition that the specified capacity has to be at least the count? It feels like this version trades one problem for another.
  • Also, would we even call that method? It's no longer withFullCapacityUnsafeMutableBufferPointer, because the user is passing a specific capacity. Putting "uninitialized" in there doesn't really make sense either, since some portion of the buffer may be initialized already.

Agree? Disagree? Ideas for a mutating method that doesn't have these issues?

The initializer doesn't have these problems: the buffer can cover the exact number of elements that the user requests, even if the storage itself ends up being slightly larger, and since the whole buffer is uninitialized, we can use that word in the API, making its usage more clear.

I have an underscored implementation of the initializer here, at Ben's suggestion: Add an array initializer with access to uninitialized storage by natecook1000 · Pull Request #17774 · apple/swift · GitHub

1 Like

I probably don't understand the full complexity here, but why can't or shouldn't the available capacity just be passed as an argument to the provided closure? That would be my expectation, that you would get the buffer pointer, the available capacity, and the inout count.

A "buffer pointer" is a start/length pair, where the length is intended to be the capacity. Passing it separately would be redundant.

3 Likes

Is the inout count for the actual used count? If so could we instead return an Int over the inout?

Right, thanks, I can't keep the unsafe APIs straight. I guess I'm still not sure what the problem is then. If someone makes an assumption about the capacity instead of checking the buffer's count, then won't any such reasonable assumption presume that there is less capacity than is actually available, rather than more, which isn't a big problem from a memory safety perspective?

1 Like

I would have named it init(reserveCapacity:unsafeInitializingWith:). I don't feel strongly about the exact name, but I do think we should make a distinction between the reserved capacity vs. the actual capacity.

someone calling this method may easily expect that the size of the buffer in the closure matches either the capacity of the array that they observe through the capacity property

I didn't realize that. Why would the Array's capacity property differ from the UnsafeBufferPointer's capacity?

If the array's storage is shared, it has to allocate new, unique storage before calling the closure. The exact capacity can change during that reallocation.

Thanks for everyone's input! PR for the proposal is here: Add a proposal for `Array.init(unsafeUninitializedCapacity:initializingWith:)` by natecook1000 · Pull Request #882 · apple/swift-evolution · GitHub

3 Likes

The proposal is excellent.

Quick question on the partition example:

var high = low + buffer.count

I'm curious. In practice, wouldn't you expect users to typically
capture and use the same count that they passed in instead?

var high = low + count

Can you include some thoughts on why initializedCount is an out-parameter rather than just a return value? I get that naming it makes it slightly easier to talk about in the docs, but if it's a return value the client is required to deal with it. (Unlike the method version, you're not using the return value for anything.)


I'm still sad about not getting the mutating method version, since there's no recourse there other than to wrap your elements in Optional or drop down to UnsafeMutableBufferPointer allocation. I think my attempt to combat the capacity confusion would be to add the parameter, like you said:

mutating func withUnsafeMutableBufferPointer<Result>(
  reservingCapacity minimumCapacity: Int,
  do action: (
    _ buffer: inout UnsafeMutableBufferPointer<Element>,
    _ initializedCount: inout Int
  ) -> Result
) -> Result
3 Likes

i agree, the count should be a return value. i feel like i’m gonna forget to set the inout parameter

Either way should work, since this initializer guarantees that the number they specify and the size of the buffer are the same. In the Array.withUnsafe____ methods we require users to use the buffer's count rather than the array's (for exclusivity reasons), so they may be accustomed to doing that already.


To be perfectly honest, it's to be parallel with the method that I removed from the proposal, in case we add it in the future. Now that I write that out, it doesn't seem very compelling, esp. given your point about requiring the client to give a value rather than forgetting (although that's a bug would be apparent pretty quickly). I think I'll revert the closure's signature.

Re: the name for the mutating method, withUnsafeMutableBufferPointer(reservingCapacity:do:) still looks too close to the existing withUnsafeMutableBufferPointer method to me — I think it would need to make clear that the buffer includes uninitialized memory.

i don't want to derail this with naming drama but i’m not really convinced by this paragraph

This proposal leaves out wording that would reference two other relevant concepts:

reserving capacity: Arrays currently have a reserveCapacity(_:) method, which is somewhat akin to the first step of the initializer. However, that method is used for the sake of optimizing performance when adding to an array, rather than providing direct access to the array's capacity. In fact, as part of the RangeReplaceableCollection protocol, that method doesn't even require any action to be taken by the targeted type. For those reasons, the idea of "reserving" capacity doesn't seem as appropriate as providing a specific capacity that will be used.

for me i conceptually think of this method as a variant of the reserve-and-push idiom where the pushes are replaced with random accesses and the count is returned at the end rather than incremented with each element. i understand it’s problematic that the reserveCapacity(_:) method on the protocol doesn’t actually guarantee the capacity is there but that sounds more like a problem with the protocol. either way unsafeUninitializedCapacity: is a horrible argument label.

maybe instead we can use allocatingCapacity: which would evoke UnsafeMutablePointer<T>.allocate(capacity:) which is really the pattern we’re trying to replace here. people who are going to be using this API are going to be people who have already been using the unsafe pointer APIs (since until now, that’s been the only way to do this), and i feel like we’ve already drilled into everyone’s head that the word allocate means unsafe and uninitialized.

The initializer can be used for this purpose, though I doubt there's much of an optimization win, since appending is already super fast if there aren't re-allocations. The goals here are more to support noncontiguous access (like the partitioning example) and mostly C interoperability cases, when you need to start with a buffer of uninitialized memory and have a function write into it. I should add an example of that second usage.

allocatingCapacity: is good, but I still think we need unsafe to be visible at the use site.

no i know what you mean i meant at the highest level the act of making an array of size n and populating it with stuff. with pushes it would only work if you populate the array from beginning to end. this API generalizes it so it can be populated in any order, but at the high level, it’s still doing the same thing. You know how big the array should be at the end, the method of populating it is just slightly different.

1 Like