Array Initializer with Access to Uninitialized Buffer

nnnnnnnn · June 14, 2018, 9:32pm

Hello all,

I've been thinking about a way of creating an array that would let us access the buffer of uninitialized memory, so that we could implement algorithms that don't necessarily know the size in advance or need access to noncontiguous parts of an array. Here's a draft of my proposed solution:

There is currently no way to work with a buffer of uninitialized memory that then becomes a fully memory-managed Array. This limitation means that operations that don't know the final size of their data in advance, or that need to access noncontiguous parts of an array, are less efficient than necessary, often requiring an extra copy. This proposal suggests a new initializer for Array and ContiguousArray that would provide access to a newly created array's entire storage buffer.

Motivation

Some collection operations require working on a fixed-size buffer of uninitialized memory. For example, one O(n) algorithm for performing a stable partition of an array is as follows:

Create a new array the same size as the original array.
Iterate over the original array, copying matching elements to the beginning of the new array and non-matching elements to the end.
When finished iterating, reverse the slice of non-matching elements.

Unfortunately, the standard library provides no way to create an array of a particular size without initializing every element, or to copy elements to the end of an array's buffer without initializing every preceding element. Even if we manually allocate the memory using an UnsafeMutableBufferPointer, there's no way to convert that buffer into an array without copying the contents. There simply isn't a way to implement this particular algorithm with maximum efficiency in Swift.

Proposed solution

Adding a new Array initializer that lets a program manipulate an array's uninitialized buffer would fill in this missing functionality. The new initializer takes the capacity of the array as a parameter as well as a closure that operates on an UnsafeMutableBufferPointer. This closure has access to the uninitialized contents of the full capacity of the newly created array's storage, and returns the intialized count of the array.

let myArray = Array<Int>(uninitializedCapacity: 10) { buffer in
    for x in 1..<5 {
        buffer[x] = x
    }
    buffer[0] = 10
    return 5
}
// myArray == [10, 1, 2, 3, 4]
// myArray.capacity == 10

With this new initializer, it's possible to implement the stable partition as an extension to the Collection protocol, without any unnecessary copies:

func stablyPartitioned(by belongsInFirstPartition: (Element) -> Bool) -> [Element] {
    var _count = self.count
    return Array(uninitializedCapacity: _count) { buffer in
        var lowIndex = 0
        var highIndex = _count
        for element in self {
            if belongsInFirstPartition(element) {
                buffer[lowIndex] = element
                lowIndex += 1
            } else {
                highIndex -= 1
                buffer[highIndex] = element
            }
        }
        
        buffer[highIndex..<_count].reverse()
        return _count
    }
}

Detailed design

A single new initializer is added to both Array and ContiguousArray:

extension Array {
    /// Creates a new array with the specified capacity, then calls the given
    /// closure to initialize its contents.
    ///
    /// You use this initializer when you need to access noncontiguous positions
    /// of an array while initializing its elements. Pass the required capacity
    /// and a closure that can intialize the array's elements. The closure must
    /// return `c`, the number of initialized elements in the buffer, such that
    /// the elements in the range `0..<c` are initialized and the elements
    /// in the range `c..<capacity` are uninitialized. The resulting array
    /// has a `count` equal to `c`.
    /// 
    /// - Parameters:
    ///      - capacity: The capacity of the new array.
    ///   - body: A closure that can initialize the array's elements. This 
    ///     closure must return the count of the initialized elements, starting
    ///     at the beginning of the buffer.
    init(
        uninitializedCapacity capacity: Int, 
        initializingWith body: (UnsafeMutableBufferPointer<Element>) -> Int
    )
}

extension ContiguousArray {
    // same documentation as above
    init(
        uninitializedCapacity capacity: Int, 
        initializingWith body: (UnsafeMutableBufferPointer<Element>) -> Int
    )
}

Source compatibility

This is an additive change to the standard library, so there is no effect on source compatibility.

Effect on ABI stability

This addition has no effect on ABI stability.

Effect on API resilience

This addition will be a permanent part of the standard library, and will need to remain public API.

Alternatives considered

An Array initializer that simply converts an UnsafeMutableBufferPointer into an array's backing storage seems like it would be another solution. However, an array's storage includes information about the count and capacity at the beginning of its buffer, so an UnsafeMutableBufferPointer created from scratch isn't usable.

huon · June 14, 2018, 9:54pm

An alternative API that might compose better (since it doesn't require creating a new array each time, and so can be used for manipulations of existing ones) is an withUnsafeMutableBufferPointer-like API that exposes the full capacity, not just the initialised section, so the first example becomes:

let myArray = [Int]()
myArray.reserveCapacity(10)
myArray.withFullCapacityUnsafeMutableBufferPointer { buffer in 
    for x in 1..<5 {
        buffer[x] = x
    }
    buffer[0] = 10
    return 5
}

There's possible expansions like, maybe the closure should get the initialised length as well as the buffer; and maybe that length should be inout (meaning updates to the length need to get written back to it directly) so the closure can do the generic return thing like withUnsafeMutableBufferPointer.

Having an Array<Int>(withCapacity: 10) initialiser might be another dimension on which it can be abbreviated.

taylorswift · June 15, 2018, 1:05am

how does the compiler know you initialized everything up to count? this sounds like something that should be called unsafe at the minimum

taylorswift · June 15, 2018, 1:08am

big mood I write

var array:[Element] = []
    array.reserveCapacity(count)

way too often to not want a capacity: initializer

jawbroken · June 15, 2018, 3:07am

I agree with the other responses that it seems a little too attractive as an initialiser and is not clearly marked as unsafe. I would appreciate the feature though, as well as the semi-related feature of an Array initialiser that takes an Index -> Element closure to initialise its elements.

torquato · June 15, 2018, 3:42am

This is touching unsafe API. I don't like the idea of having unsafe initializers. Do we have precedence? For my taste an initializer should return a type or fail gracefully, but it shouldn't be unsafe.
Unsafe API is a very special case (thats why it's always called 'unsafe') that you shouldn't normaly call. But initializers are exactly the place you expect to normally call safely. Especially if you are new to the language.
I really like @huon's suggestion though, especially in combination with a capacity-initializer…

Chris_Lattner3 · June 15, 2018, 4:07am

+1 for something like this, but please put the word "unsafe" in the keyword argument, e.g. "unsafeUninitialized..."

-Chris

nnnnnnnn · June 15, 2018, 8:02pm

Thanks for the feedback!

That's a solid idea (and as a bonus, it would decisively bump withUnsafeMutablePointerToElements as the function with the longest name). Our recommendation for arrays is to drop down to the buffer if performance is a concern, but that solution has been limited to situations where you didn't need to change the length. This addition would allow for that solution in more use cases.

I'm conflicted about keeping the initializer — it's nice since you can end up with an immutable array, but it's probably better to have all the unsafe array APIs grouped together.

Agreed! Thanks, and congrats on your recent touring success.

Would that have a signature like this?

extension Array {
    init(count: Int, providingEachElement f: (Index) -> Element)
}

let myArray = Array<String>(count: 5) { String(repeating: "\($0)", count: $0) }
// ["", "1", "22", "333", "4444"]

I like that idea too, though it seems like a separate proposal. I think @Erica_Sadun has done some work in this area.

Revised version of APIs for this proposal:

extension Array {
    /// Creates an empty array with the specified capacity.
    init(capacity: Int)

    /// Calls the given closure with a pointer to the full capacity of the array's
    /// mutable contiguous storage.
    ///
    /// - Parameter body: A closure that can modify or deinitialize existing elements
    ///   of initialize new elements.
    ///	  - Parameters:
    ///     - buffer: An unsafe mutable buffer of the array's full storage, including
    ///       any uninitialized capacity after the initialized elements. Only the 
    ///       elements in `buffer[0..<initializedCount]` are initialized.
    ///     - initializedCount: The count of the array's initialized elements. If you
    ///       initialize or deinitialize any elements inside `body`, update 
    ///       `initializedCount` with the new count for the array.
    /// - Returns: The return value, if any, of the `body`` closure parameter.
    mutating func withFullCapacityUnsafeMutableBufferPointer<R>(
        _ body: (_ buffer: UnsafeMutableBufferPointer<Element>, _ initializedCount: inout Int) throws -> R
    ) rethrows -> R
}

Nevin · June 15, 2018, 8:44pm

Without commenting on anything else, I think this initializer should be on RangeReplaceableCollection. It could be a customization point if, eg, Array is able to do it more efficiently, or an extension method if the generic implementation is sufficient:

extension RangeReplaceableCollection {
  init(capacity: Int) {
    self.init()
    self.reserveCapacity(capacity)
  }
}

benrimmington · June 15, 2018, 8:48pm

Should this be init(minimumCapacity: Int) to match the existing Dictionary and Set APIs?

nnnnnnnn · June 15, 2018, 9:01pm

Good point!

No—dictionaries and sets can't have arbitrarily sized storage buffers, so they choose the next size larger than what you pass. For Array, we would want to guarantee that the array has storage for exactly the specified number of elements.

benrimmington · June 15, 2018, 9:15pm

But the Array.reserveCapacity(_:) API doesn't guarantee an exact capacity either:

  /// For performance reasons, the size of the newly allocated storage might be
  /// greater than the requested capacity. Use the array's `capacity` property
  /// to determine the size of the new storage.

Braden_Scothern · June 16, 2018, 3:37am

This is a feature I've wanted for a good while now and haven't had the time to put an initial pitch together. It would be very helpful when working with C APIs.

nnnnnnnn · June 16, 2018, 6:49am

You're right about that — the parameter name is even minimumCapacity for that method. With the current APIs that's probably fine, but if we're exposing the full capacity of the array in this way, it might be confusing to not have as precise control over the size as we do when allocating a buffer directly. These can be quite different amounts:

var a: [UInt8] = []
a.reserveCapacity(5)
// a.capacity == 16

jawbroken · June 17, 2018, 11:18am

Yeah, that's exactly what I had in mind. It's only semi-related, in that it has similar performance benefits and reasons for existing, so a separate proposal would be fine.

Karl · June 17, 2018, 1:23pm

This is a very important API, especially when interoperating with C libraries. I thought about this quite a bit in the past, but I'm glad to see a pitch.

My approach to this was a little less direct: Why do we even need these copies today? Why can't Array/String just wrap any old UnsafeMutableBufferPointer? I think the main reason is to provide value semantics and immutability, which requires the buffer itself to have value semantics (like Foundation.Data). Unfortunately we don't have something like that in the standard library, but if we did - theoretically Array and String could use it as backing storage, expose it directly to the user, and allow re-wrapping an existing buffer (e.g. you could create a String which shares the backing of an Array) while preserving value semantics and avoiding copying.

/// A COW memory buffer - like Foundation.Data, but guaranteed to be contiguous.
struct ContiguousData: RandomAccessCollection { /* ... */ }

struct Array<T> {
  init(rawStorage: ContiguousData)
  var rawStorage: ContiguousData
}

var string: String
do {
  // Create an Array
  let myArray = [1, 2, 3, 4]
  // View its memory as a String (no copying)
  string = String(rawStorage: myArray.rawStorage, encoding: .ascii)
}
// Array goes out of scope, string now holds unique reference to storage. No copy.
string.append("blah")

What do you think - would something like this work or not?

taylorswift · June 17, 2018, 10:02pm

i don’t think this would work, Arrays have a header that’s allocated at the head of the buffer. so the elements actually come at an offset from the buffer start

huon · June 20, 2018, 12:41am

This seems like a broader concern that may not be worth addressing in this specific case: things that need to be mutable for initialisation but are immutable after that.

I like it!

Andrew_Trick · June 21, 2018, 8:07pm

Thanks for this Nate. It's been a long time coming. Here's Karl's old bug:

github.com/apple/swift

[SR-3087] No way to arbitrarily initialise an Array's storage

opened 02:11AM - 31 Oct 16 UTC

closed 05:45AM - 24 Mar 19 UTC

karwa

improvement feature standard library swift evolution implemented

| | | |------------------|-----------------|… |Previous ID | SR-3087 | |Radar | rdar://problem/21880692 | |Original Reporter | @karwa | |Type | Improvement | |Status | Resolved | |Resolution | Done | <details> <summary>Additional Detail from JIRA</summary> | | | |------------------|-----------------| |Votes | 0 | |Component/s | Standard Library | |Labels | Improvement, LanguageFeatureRequest | |Assignee | @natecook1000 | |Priority | Medium | md5: 78d2891e46bb0f408db15202b1a1b736 </details> **relates to**: * [SR-3631](https://bugs.swift.org/browse/SR-3631) Introduce ContiguouslyStored protocol and kill ArraySlice type **Issue Description:** The standard library does not currently expose a way to allocate an Array's storage and (unsafely) initialise its elements, as we can with UnsafeMutablePointer. An example use-case is to implement a concurrent map. See: <https://lists.swift.org/pipermail/swift-users/Week-of-Mon-20161024/003769.html> Even though we expose a UMP from Array, it cannot be used to unsafely initialise the Array elements. The workaround at the moment is to allocate a UMP, initialise its elements unsafely, and then copy the contents in to an Array via the Array(Sequence) constructor, which is less than optimal.

I agree that we should introduce a single unsafe API that handles initialization and reinitialization. The functionality needs to exist in a primitive form first. Mutability concerns and convenience can be added later.

I'm glad someone else proposed this name so I don't take any flak for it, but it works for me:

nnnnnnnn · July 9, 2018, 9:23pm

I'm looping back around to this idea after some discussion of the intention and behavior of array storage allocation in this thread: Array capacity optimization. The primary challenge that I've been looking at is that someone calling this method may easily expect that the size of the buffer in the closure matches either the capacity of the array that they observe through the capacity property or the capacity that they reserve immediately before, through a call to reserveCapacity(_:), but neither one of those is guaranteed.

I think I'm back to the original initializer that I had in mind, with a revised name (to add "unsafe"), since I'm not satisfied with the designs I can come up with for a mutating method. Here's the thought process that leads to my kind-of-paradox:

If someone calls a hypothetical withFullCapacityUnsafeMutableBufferPointer method, they may make mistaken assumptions about what size buffer they receive inside the closure. Mistakes of this kind are worse than usual, since they can lead to memory leaks or memory accesses past the buffer's boundaries. We don't have a great way to provide diagnostics when things go awry, and the buffer's size will likely be nondeterministic, so issues may go unnoticed in development only to appear in production.
We could add a capacity parameter to the method, but that complicates the method a bit, and means that the confusion is still possible, since the method doesn't really make sense if the user passes a capacity smaller than the array's count. Would we include a precondition that the specified capacity has to be at least the count? It feels like this version trades one problem for another.
Also, would we even call that method? It's no longer withFullCapacityUnsafeMutableBufferPointer, because the user is passing a specific capacity. Putting "uninitialized" in there doesn't really make sense either, since some portion of the buffer may be initialized already.

Agree? Disagree? Ideas for a mutating method that doesn't have these issues?

The initializer doesn't have these problems: the buffer can cover the exact number of elements that the user requests, even if the storage itself ends up being slightly larger, and since the whole buffer is uninitialized, we can use that word in the API, making its usage more clear.

I have an underscored implementation of the initializer here, at Ben's suggestion: Add an array initializer with access to uninitialized storage by natecook1000 · Pull Request #17774 · apple/swift · GitHub