Relaxing the behaviour of UnsafeMutableBufferPointer.initialize(from: Sequence)

Currently, the documentation for UnsafeMutableBufferPointer.initialize(from: Sequence) says:

/// Initializes the buffer's memory with the given elements.
///
/// When calling the `initialize(from:)` method on a buffer `b`, the memory
/// referenced by `b` must be uninitialized or the `Element` type must be a
/// trivial type. After the call, the memory referenced by this buffer up
/// to, but not including, the returned index is initialized. The buffer
/// must contain sufficient memory to accommodate
/// `source.underestimatedCount`.
///
/// The returned index is the position of the element in the buffer one past
/// the last element written. If `source` contains no elements, the returned
/// index is equal to the buffer's `startIndex`. If `source` contains an
/// equal or greater number of elements than the buffer can hold, the
/// returned index is equal to the buffer's `endIndex`.
///
/// - Parameter source: A sequence of elements with which to initializer the
///   buffer.
/// - Returns: An iterator to any elements of `source` that didn't fit in the
///   buffer, and an index to the point in the buffer one past the last
///   element written.
public func initialize<S: Sequence>(from source: S) -> (S.Iterator, Index)
  where S.Element == Element

I'd like to discuss relaxing this condition:

The buffer must contain sufficient memory to accommodate source.underestimatedCount.

Because this is an unsafe type, the word "must" implies that violating this condition will result in undefined behaviour. Only standard library types are able to customise how this function is implemented (via the _copyContents(initializing:) requirement on Sequence), and none of them will over-write the bounds of the buffer if this condition is not met - either they calculate the minimum of their and the buffer's counts and initialise that many elements, trigger a runtime error, or use the default implementation which fills the buffer from an iterator and doesn't care about underestimatedCount.

I'd like to understand why this requirement is in place, and if there is no good reason for keeping it, I think it should be removed. This would mean changing the standard library's customisations to always initialise as much of the buffer as they can rather than trigger runtime errors, and remove this line from the documentation.

Thoughts?

1 Like

FWIW, I loathe this precondition, too. It originates from Sequence._copyContents, in a rather unfortunate example of a half-finished, temporary interface polluting the stdlib’s public API. (I hate almost everything about _copyContents, too.)

The issue with relaxing this requirement is that code targeting the new implementation won’t be safe to back-deploy to earlier Stdlib releases, unless we can somehow ensure that newly built code will never call the preexisting symbol.

For example, we could do this with the moral equivalent of:

// Original entry point
@available(unavailable, *)
public func initialize<S: Sequence>(from source: S) -> (S.Iterator, Index)
  where S.Element == Element {
  initialize(from: source, _hiddenDistinguishingMarker: ())
}

// Back-deployable replacement
@_alwaysEmitIntoClient
public func initialize<S: Sequence>(
  from source: S,
  _hiddenDistinguishingMarker: Void = ()
) -> (S.Iterator, Index)
  where S.Element == Element

This feels too disgusting to consider. (Although perhaps we could implement some language enhancements to make it a little more palatable — or even more disgusting, depending on your viewpoint. :see_no_evil:)

Unfortunately we’d have to do the same for _copyContents (which is where the requirement originates from), which probably makes this approach impractical, even if we were otherwise okay with such API hacks. Changing the preconditions of a protocol requirement (even an undocumented one) is a Big Deal — it’d be a source-breaking change.

Another approach is to document the first versions of initalize(from:) and _copyContents that implemented the relaxed precondition, and hope that people will keep this information in mind. This would be extremely cheap to implement, but I think it would be too user-hostile to be acceptable: it would effectively require clients to implement runtime availability checks, without the benefit of static compiler diagnostics that indicate this

The third option is to ignore the stdlib and write your own extension method that works the way you want. Sadly this is only practical if the parameter is a concrete type whose storage layout you are able to directly access. This is often the case, but this doesn’t make this a satisfying answer.

Probably the best option is to replace this API with something better. This is best done at the same time as we introduce a public replacement for _copyContents, with a more flexible design. My current thinking is that the _copyContent replacement would work best as a mutating IteratorProtocol method that provides read-only access to piecewise contiguous storage chunks of a client-supplied maximum size. I.e., something like:

protocol IteratorProtocol {
  mutating func next() -> Element?

  @available(…)
  mutating func withNextChunk<R>(
    maximumSize: Int? = nil, 
    body: (UnafeBufferPointer<Element>) throws -> R
  ) rethrows -> R
}

(Or, given the popularity of IndexingIterator, perhaps a static Sequence method that takes an inout iterator. There are many possible variants of this — another approach would be to replace the maximumSize parameter with the closure indicating how many elements it was able to process from the chunk it was given, or to have the caller pass in an optional temp buffer, NSFastEnumeration-style.) This method can then be used to build/implement higher-level APIs that efficiently copy data across (piecewise, or fully) contiguous collections, including a more flexible UMBP.initalize(from:).

This is likely best done once we have several good examples of piecewise contiguous data structures to play with — for example, we have none in the stdlib, and in Swift Collections right now we only have Deque. When we land the B-tree based SortedSet, then we might have enough guinea pigs to experiment on to design a proper API.

The big drawback to this is that new Collection or IteratorProtocol requirements would need to come with availability, which complicates their use from existing stdlib APIs that currently call _copyContents. (I doubt making this particular niche little stdlib feature back deployable would be worth the effort required.)

7 Likes

There is also a second order issue, if we manage to resolve this: the returned iterator cannot be used to efficiently copy the rest of the data into another buffer later — because next is all we are able to call on it.

The chunked iterator concept above would solve this problem nicely — indeed, this is its primary purpose.

Thank you very much, @lorentey!

Nooooooooo......!!!! :broken_heart::sob: That's an excellent point. Ugh.

I wonder, at least for now, if we could perhaps change the documentation to say something like:

If the buffer does not contain sufficient memory to accommodate source.underestimatedCount elements, a runtime error may be triggered.

At least as I read things, when an unsafe API tells me that something "must" be a certain way, I think "oh gosh, I'd better do that, otherwise undefined behaviour". But all the existing implementations are perfectly safe and will never write beyond the end of the buffer or introduce undefined behaviour. So it would just clarify the status quo, rather than changing anything.

Another thing that would be quite nice is if the standard library types could align their implementations (preferably to never trigger a runtime error), so all sequences at least behave consistently with this function. I mean, check this out - you can actually avoid the trap just by using a slice:

let ptr = UnsafeMutableBufferPointer<UInt8>.allocate(capacity: 5)
ptr.initialize(from: "hello, world".utf8) // ❌ Fatal error: Insufficient space allocated to copy string contents.
ptr.initialize(from: "hello, world".utf8[...]) // đź‘Ś Ok.

This inconsistency just doesn't feel good. As you point out, there isn't really a good option if we want to promise developers that they can expect the relaxed behaviour as an actual, documented guarantee, but consistency is at least better than having some sequences crash and others not. It may cause developers to rely on it, but I'd argue that the current behaviour is so inconsistent that they probably already do.

If we figure out a (not too horrible) way to backport that behaviour later, we could promote it to a guarantee at that time.

+1! This would be really great!

I know it's just a sketch, but IMO a prerequisite for this would be a safe type for contiguous storage in the standard library (e.g. a pointer-owner pair, with bounds-checking). IteratorProtocol is too important, and this function would be too important, to use unsafe buffers.

That seems like a good clarification -- PR is welcome. I recently updated the _copyContents docs with a similar clarification:

  /// - Parameter ptr: An unsafe buffer addressing uninitialized memory. The
  ///    buffer must be of sufficient size to accommodate
  ///    `source.underestimatedCount` elements. (Some implementations trap
  ///    if given a buffer that's smaller than this.)

We don't know what _copyContents implementations outside the stdlib do; hopefully nobody is clobbering memory when given too small a buffer. (_copyContents technically isn't public, but it's a rather desirable customization target.)

That's not ideal. I didn't realize anything but Array actually did this.

It gets worse: copying an empty UTF-8 view into an empty buffer traps spuriously:

1> var buffer = UnsafeMutableBufferPointer<UInt8>(start: nil, count: 0)
buffer: UnsafeMutableBufferPointer<UInt8> = none
2> buffer.initialize(from: "".utf8)
Swift/StringUTF8View.swift:389: Fatal error: Attempt to copy string contents into nil buffer pointer
2021-09-14 18:42:20.917276-0700 repl_swift[4852:73925] Swift/StringUTF8View.swift:389: Fatal error: Attempt to copy string contents into nil buffer pointer

(This is a bug we'll definitely want to fix, and soon.)

I'm not sure why the behavior would be different between "foo".utf8 and "foo".utf8[...] though -- the two expressions have literally the same value, because the UTF-8 view is self-slicing. :thinking:

This is a good point.

There is a bit of a caveat: exposing internal storage as unsafe buffers that are only valid for the duration of a method call is pretty easy to implement, and it doesn't assume much from the underlying data structure. That is not entirely the case if the implementation also needs to safely expose an owner reference -- AIUI, it strongly encourages (forces?) the data structure to use the copy-on-write scheme that is implemented by most stdlib collection types (using isKnownUniquelyReferenced, or just unconditionally copying stuff on each mutation).

This is not a huge burden, but technically Collection (not to mention Sequence) does not otherwise require conforming types to use this scheme -- indeed, these protocols do not even require conforming types to implement value semantics. So going with this design would somewhat limit the range of collection types that can implement the new API.

(It gets worse if this potential future safe buffer type directly exposes the owner reference -- naked references to internal storage classes are extremely dangerous, as they can (accidentally) defeat uniqueness checks. But I assume that won't be the case.)

I guess the question is: what is the primary use case for a _copyContents replacement? The original is mostly intended to speed up copying data into contiguous buffers. (I.e., the Array.init<Sequence>(_:) case.) The idea I described mostly facilitates copying data between piecewise contiguous data structures (Array.init, Deque.init, SortedSet.init), with perhaps some use cases in processing sequences.

Do you have a use case in mind for holding on to the chunks exposed through this chunked iteration concept?