Array Initializer with Access to Uninitialized Buffer

Good point!

No—dictionaries and sets can't have arbitrarily sized storage buffers, so they choose the next size larger than what you pass. For Array, we would want to guarantee that the array has storage for exactly the specified number of elements.

But the Array.reserveCapacity(_:) API doesn't guarantee an exact capacity either:

  /// For performance reasons, the size of the newly allocated storage might be
  /// greater than the requested capacity. Use the array's `capacity` property
  /// to determine the size of the new storage.
1 Like

This is a feature I've wanted for a good while now and haven't had the time to put an initial pitch together. It would be very helpful when working with C APIs.

1 Like

You're right about that — the parameter name is even minimumCapacity for that method. With the current APIs that's probably fine, but if we're exposing the full capacity of the array in this way, it might be confusing to not have as precise control over the size as we do when allocating a buffer directly. These can be quite different amounts:

var a: [UInt8] = []
a.reserveCapacity(5)
// a.capacity == 16

Yeah, that's exactly what I had in mind. It's only semi-related, in that it has similar performance benefits and reasons for existing, so a separate proposal would be fine.

1 Like

This is a very important API, especially when interoperating with C libraries. I thought about this quite a bit in the past, but I'm glad to see a pitch.

My approach to this was a little less direct: Why do we even need these copies today? Why can't Array/String just wrap any old UnsafeMutableBufferPointer? I think the main reason is to provide value semantics and immutability, which requires the buffer itself to have value semantics (like Foundation.Data). Unfortunately we don't have something like that in the standard library, but if we did - theoretically Array and String could use it as backing storage, expose it directly to the user, and allow re-wrapping an existing buffer (e.g. you could create a String which shares the backing of an Array) while preserving value semantics and avoiding copying.

/// A COW memory buffer - like Foundation.Data, but guaranteed to be contiguous.
struct ContiguousData: RandomAccessCollection { /* ... */ }

struct Array<T> {
  init(rawStorage: ContiguousData)
  var rawStorage: ContiguousData
}

var string: String
do {
  // Create an Array
  let myArray = [1, 2, 3, 4]
  // View its memory as a String (no copying)
  string = String(rawStorage: myArray.rawStorage, encoding: .ascii)
}
// Array goes out of scope, string now holds unique reference to storage. No copy.
string.append("blah")

What do you think - would something like this work or not?

i don’t think this would work, Arrays have a header that’s allocated at the head of the buffer. so the elements actually come at an offset from the buffer start

1 Like

This seems like a broader concern that may not be worth addressing in this specific case: things that need to be mutable for initialisation but are immutable after that.

I like it!

Thanks for this Nate. It's been a long time coming. Here's Karl's old bug:

I agree that we should introduce a single unsafe API that handles initialization and reinitialization. The functionality needs to exist in a primitive form first. Mutability concerns and convenience can be added later.

I'm glad someone else proposed this name so I don't take any flak for it, but it works for me:

3 Likes

I'm looping back around to this idea after some discussion of the intention and behavior of array storage allocation in this thread: Array capacity optimization. The primary challenge that I've been looking at is that someone calling this method may easily expect that the size of the buffer in the closure matches either the capacity of the array that they observe through the capacity property or the capacity that they reserve immediately before, through a call to reserveCapacity(_:), but neither one of those is guaranteed.

I think I'm back to the original initializer that I had in mind, with a revised name (to add "unsafe"), since I'm not satisfied with the designs I can come up with for a mutating method. Here's the thought process that leads to my kind-of-paradox:

  • If someone calls a hypothetical withFullCapacityUnsafeMutableBufferPointer method, they may make mistaken assumptions about what size buffer they receive inside the closure. Mistakes of this kind are worse than usual, since they can lead to memory leaks or memory accesses past the buffer's boundaries. We don't have a great way to provide diagnostics when things go awry, and the buffer's size will likely be nondeterministic, so issues may go unnoticed in development only to appear in production.
  • We could add a capacity parameter to the method, but that complicates the method a bit, and means that the confusion is still possible, since the method doesn't really make sense if the user passes a capacity smaller than the array's count. Would we include a precondition that the specified capacity has to be at least the count? It feels like this version trades one problem for another.
  • Also, would we even call that method? It's no longer withFullCapacityUnsafeMutableBufferPointer, because the user is passing a specific capacity. Putting "uninitialized" in there doesn't really make sense either, since some portion of the buffer may be initialized already.

Agree? Disagree? Ideas for a mutating method that doesn't have these issues?

The initializer doesn't have these problems: the buffer can cover the exact number of elements that the user requests, even if the storage itself ends up being slightly larger, and since the whole buffer is uninitialized, we can use that word in the API, making its usage more clear.

I have an underscored implementation of the initializer here, at Ben's suggestion: Add an array initializer with access to uninitialized storage by natecook1000 · Pull Request #17774 · apple/swift · GitHub

1 Like

I probably don't understand the full complexity here, but why can't or shouldn't the available capacity just be passed as an argument to the provided closure? That would be my expectation, that you would get the buffer pointer, the available capacity, and the inout count.

A "buffer pointer" is a start/length pair, where the length is intended to be the capacity. Passing it separately would be redundant.

3 Likes

Is the inout count for the actual used count? If so could we instead return an Int over the inout?

Right, thanks, I can't keep the unsafe APIs straight. I guess I'm still not sure what the problem is then. If someone makes an assumption about the capacity instead of checking the buffer's count, then won't any such reasonable assumption presume that there is less capacity than is actually available, rather than more, which isn't a big problem from a memory safety perspective?

1 Like

I would have named it init(reserveCapacity:unsafeInitializingWith:). I don't feel strongly about the exact name, but I do think we should make a distinction between the reserved capacity vs. the actual capacity.

someone calling this method may easily expect that the size of the buffer in the closure matches either the capacity of the array that they observe through the capacity property

I didn't realize that. Why would the Array's capacity property differ from the UnsafeBufferPointer's capacity?

If the array's storage is shared, it has to allocate new, unique storage before calling the closure. The exact capacity can change during that reallocation.

Thanks for everyone's input! PR for the proposal is here: Add a proposal for `Array.init(unsafeUninitializedCapacity:initializingWith:)` by natecook1000 · Pull Request #882 · apple/swift-evolution · GitHub

3 Likes

The proposal is excellent.

Quick question on the partition example:

var high = low + buffer.count

I'm curious. In practice, wouldn't you expect users to typically
capture and use the same count that they passed in instead?

var high = low + count

Can you include some thoughts on why initializedCount is an out-parameter rather than just a return value? I get that naming it makes it slightly easier to talk about in the docs, but if it's a return value the client is required to deal with it. (Unlike the method version, you're not using the return value for anything.)


I'm still sad about not getting the mutating method version, since there's no recourse there other than to wrap your elements in Optional or drop down to UnsafeMutableBufferPointer allocation. I think my attempt to combat the capacity confusion would be to add the parameter, like you said:

mutating func withUnsafeMutableBufferPointer<Result>(
  reservingCapacity minimumCapacity: Int,
  do action: (
    _ buffer: inout UnsafeMutableBufferPointer<Element>,
    _ initializedCount: inout Int
  ) -> Result
) -> Result
3 Likes

i agree, the count should be a return value. i feel like i’m gonna forget to set the inout parameter