SE-0456: Add Span-providing Properties to Standard Library Types

As @itaiferber mentioned, a Swift Data is always contiguous.

1 Like

Huh, I must have missed that. Well, it used to be possible for a Data to be discontiguous! :stuck_out_tongue: It used to be that it could wrap an NSData and do any of its DispatchData or Objective-C shenanigans under the hood. It looks like they got rid of the whole backing enum in that PR, though. Huh.

I might not have been avoiding Data in favor of ContiguousArray<UInt8> quite so much if I'd caught that one earlier...

Yeah, as part of the work, we got rid of the old Backing enum which was necessary for tracking non-Foundation subclasses of NSData which had potentially-unknowable and -unpredictable behavior in favor of making all buffers contiguous. (From much real-world testing, those were a vanishingly-small subset of all Data instances, and this allowed us to introduce a different representation and layout with significant performance benefits.)

In any case, circling back to the proposal itself: there may be opportunity in the future to introduce the Span equivalent of DataProtocol, which does represent a collection of contiguous Regions.

(It might also be interesting to see some crossover between Span and ContiguousBytes, to make the APIs easier to interoperate; e.g., should Span<UInt8> and RawSpan conform to ContiguousBytes? Theoretically, too, a future protocol which offers a collection of Spans could gain automatic conformance to DataProtocol... But, that's neither here nor there at the moment, and out of scope for this proposal.)

1 Like

I believe ContiguousArray is only helpful when dealing with types that could be the elements of an NSArray. Otherwise, like with UInt8, the standard Array type ([UInt8]) should work just as well.

3 Likes

I'm sure this is true after the compiler does its optimization magic; however, Array is still backed by _ArrayBridgeStorage and contains all the logic needed to make it possible for it to be backed by NSArray (I made sure to check that this is still the case this time :sweat_smile:). ContiguousArray is simpler and just feels cleaner, and when what I want to do is chug through a whole bunch of bytes as quickly as possible, it feels good to use it. :person_shrugging: If the optimizer fails to remove a step here or there, it might be just a smidge faster.

Huge +1. These are very natural additions to the standard library.

I have only 2 questions, the first is about "Implications on adoption" part:

The additions described in this proposal require a new version of the Swift standard library and runtime.

If I understand it correctly, this proposal contains 2 parts, the first part is the language rule of determining whether a certain property can represent a borrowing relation to the callee binding, and the second part is the extensions to the standard library. Can you clarify whether the first part requires a new runtime?

The second one is about the text following the introduction example:

Note that the copyability of B means that it cannot represent a mutation of A; it therefore represents a non-exclusive borrowing relationship.

Can you please elaborate more why the copyability is crucial here, because I think there should be situations where a ~Copyable & ~Escapable property can also be useful for representing a borrowing relationship.

Thanks!

Whether a property can represent a borrowing relation is a compile-time issue. I realize that I wrote this wrong! Rewording what I meant to say: we need a version of the standard library that supports Span and RawSpan, which has deployment implications for ABI-stable platforms. I will amend the proposal to clarify. Thanks for the question! [Added: the proposal is updated.]

You are right that the copyability is not crucial, but it illustrates the non-exclusive aspect of this borrow. We could model a non-exclusive borrow with a non-copyable type which supports an explicit copy() operation. [Added: ] Note that I believe it would be correct to infer an exclusive borrow from a non-copyable return value; the goal is for all of these aspects to be controllable by annotations (to be proposed later.)

2 Likes

As noted in the proposal, I find span to be a description of the how rather than a description of the what of the property. This makes me dislike just using the type name as the property name. This being said, as I'm working on the pitch for OutputSpan I'm finding that storage is a bit neutral with respect to initialization state. In another channel, @Tony_Parker called it contents instead, and that is resonating. "Contents" must exist, and therefore are initialized. Just a thought.

4 Likes

That's true. But I still agree with Steve that .span will be the most obvious name with the least cognitive burden. I find it very helpful for a family of APIs around single concept to share a short name. It's also helpful for the name to reflect how the property is exposed given that it impacts how and why you use it. When we don't put the what in the name, it unambiguously refers to the entire contents. Explicitly stating that only adds noise.

10 Likes

For the naming:

General naming patterns

We generally prefer to name properties primarily after their purpose. As a general rule, we do not simply repeat the name of the type, except for cases where the type is specifically created to back that particular property.

As an extreme example, consider the property Dictionary.keys. Its type is Dictionary.Keys, and that feels right: the only purpose for the Dictionary.Keys type is to be the result type for the keys property. Indeed, that property is the only way to produce dictionary key views. I think it would be fair to say in this case, the type name is in fact largely irrelevant, as we rarely (if ever) need to spell it out -- the type is legitimately named after the property.

On the other end of the spectrum, consider the description property of protocol CustomStringConvertible. The name tells us nothing about its result type; it focuses entirely on its purpose. Indeed, calling it string would be somewhat confusing, as it would tell us nothing about what sort of data we'd expect to find in the result. We would certainly cope if that property was named string, but I find description far better.

Our Span types are somewhere in between these two extremes; but I think they're closer to String than they're to Dictionary.Keys.

In my view, Span is a universal interface type that safely provides direct access to regions of memory. The proposed properties are going to be just one of many ways to generate span instances.

The proposed properties will obviously be very handy for contiguous containers like Array or String; but not all containers are contiguous, and not all memory regions are tied to one.

I think it makes sense for the name to clarify what the returned span is supposed to be, as it will make code that uses it easier to understand.

Appeal to direct precedent

These new properties are intended to be the spiritual successor to the Sequence protocol requirement withContiguousStorageIfAvailable that was added in SE-0237, back in 2018. Note how that name does not include the name of the type it yields, either: it is providing direct access to contiguous storage -- the fact that it happens to be represented by an UnsafeBufferPointer is beside the point.

Notably, SE-0237 ended up breaking with the previous precedent of simply calling such methods after the types they're providing. (For example, Array.withUnsafeBufferPointer and ManagedBuffer.withUnsafeMutablePointers are both predating that proposal.)

Remember that the initial versions of that proposal used the names withUnsafe[Mutable]BufferPointer (like the legacy Array or ManagedBuffer methods still do) -- the change to "contiguous storage" came as part of the Core Team's acceptance notes after the second review.

The notes do not provide a rationale for this change, and I do not remember our engineering discussions at the time. But my (fairly informed) hypothesis is that it came from the desire to clarify the purpose of these APIs. (Although I may be remembering my reaction after the change got announced, rather than the actual reasons that led to it.)

withContiguousStorageIfAvailable does not have much better "mouthfeel" than withUnsafeBufferPointerIfSupported -- both of these feel quite bureaucratic and verbose. But the accepted name is significantly better at explaining what the provided unsafe buffer pointer is supposed to be; I think it leaves far less room for confusion.

So it goes in our case. A property named span would be infuriatingly noncommittal about what its result is supposed to be -- calling it storage directly (and succinctly) documents that it is intended to expose direct access to the container's storage representation.

On the intuitive meaning of "storage"

The purpose of these properties is to expose direct access to the native storage representation behind these container types. We've made significant effort to make sure they do exactly that. We aren't pussyfooting around this: the API signatures we're proposing intentionally disallow the returned Span to be over memory that's allocated on the fly. (Unlike most existing APIs like Sequence.withContiguousStorageIfAvailable, Array.withUnsafeBufferPointer or String.withCString.)

Obviously, this limits the number of types that can provide these properties; but in exchange, we vastly increase their value, by providing predictable performance. Unlike the old with* APIs (or coroutine-based alternatives), we want systems programmers to feel comfortable repeatedly calling these new properties in a loop. (Layering concerns often make it unfeasible to avoid doing that.)

The proposed name directly reflects this. Beyond clearly communicating what data the returned Span is expected to hold, the word "storage" is also strongly implying permanence -- if a memory region is only materialized for the duration of each access, then it shouldn't be called "storage"!

On clashes with internal names

I believe that good public API names should have clear precedence over internal/private names. We should not avoid choosing public names that clash with non-public names that can be trivially renamed. I believe authors will be able to swallow a one-time, minor inconvenience, in exchange for their clients getting better interface names as a result.

Not many container types can provide direct access to their storage representation as a single contiguous span. Even if we end up defining a protocol that requires this property, I do not expect many types would want to conform to it. But if the name storage feels like it interferes too much with existing implementation conventions, then by all means we can go with something else.

Naming alternatives

As I said above, I think the proposed name storage implies two separate things, both quite valuable:

  1. It sets a clear expectation that the returned Span contains precisely the same elements (and in the same order) as the container on which the property is invoked.
  2. It strongly hints that the Span directly accesses the actual, native storage representation, rather than something that's materialized on demand.

The "obvious" name span captures neither of these points. I think it's far too noncommittal to be a good choice -- it would not be the end of the world if we chose it, but it sets a mediocre precedent. We can and should aim higher.

@glessard's contents keeps the first point, but it mostly loses the second. It isn't bad, but perhaps it might be a bit too prominent -- I do not think we'll want to suggest that Swift programmers should look at this property as the preferred way to express accessing the contents of containers. (I look at Span types as mostly under-the-hood plumbing -- we'd heavily rely on them, but typically they'd be hidden behind higher-level operations.)

How about something like storageSpan? It preserves both of the points above, it avoids clashing with anyone's internal names, and it is obscure enough not to attract too much unnecessary attention. (I often complain about unwieldy type names, but this is not that.)

10 Likes

If storage won’t do, how about contiguousStorage to emphasize that this is the contiguous underlying storage of the value? That would also be unlikely to clash with internal names in a type because it is not the logical name for a property you already know contains contiguous storage.

6 Likes

Sure; contiguousStorage is an even stronger match with SE-0273's withContiguousStorageIfAvailable. I think it'd be a good choice -- I avoided mentioning it only to avoid triggering the span tribe.

For whatever reason, people seem to be incredibly attached to naming the property after its type. As explained, I find the idea of naming the property span neither "obvious" nor natural -- I honestly don't get why it keeps getting framed as such. (Are we considering the name Span so great now that we want to plaster it everywhere? I am still smarting from some of the pushback against it; and I am quite annoyed to suddenly find I have to argue against its overuse.)

Hence the compromise suggestion storageSpan. It puts this beloved type name into the property name (in primary position, no less), directly acknowledging this widespread desire. But it also adds a qualifier to precisely explain the actual role/purpose of the result (following the pattern clearly established by startIndex/endIndex), while also still referencing SE-0237 -- acknowledging the intention of the proposal's author. That the result is a span over the container's actual storage is crucial information -- it explains the specific shape of the API, and the intent behind it.

(contiguousStorageSpan would be redundant, as Span is already strongly implying contiguousness.)

1 Like

If we're renaming the Span-providing properties, how will that affect the naming of other APIs with MutableSpan, OutputSpan, RawSpan, and UTF8Span?


SIMD types don't seem to require contiguous storage. Only SIMDScalar inherits from BitwiseCopyable. The SIMDStorage and SIMD protocols were excluded in swiftlang/swift#73890.

Can you give examples of some other ways you'd envision? I think that would go far in sharing a sense of where on the Dictionary.keys-to-description spectrum we are.

I believe the mutable variant's property name would add mutable as a prefix. As a corollary a property providing a MutableRawSpan would be mutableBytes. UTF8Span is likely to be provided only as read-only, and either a utf8 prefix or just utf8 would work. An OutputSpan would probably be provided as a parameter to a closure, since initializing storage generally requires post processing that is internal to the container's implementation.

1 Like

For sure! I have two other ways to generate spans in mind: (1) generic iteration over container/collection types, and (2) direct Span initialization.

(I consider both of these to be more important than the properties we're proposing in SE-0456. We're starting with the properties because their specific shape requires the least amount of unresolved language work.)

Sadly I didn't have time to edit this post down to a reasonable length, so TL;DR:

  • The properties we are proposing now are shortcuts for a very narrow sub-case of high-performance borrowing iteration over in-memory container types. I believe that borrowing iteration will become the most frequent source of spans in most Swift code, and I'd like to make sure we'll be able to match API names between the properties we're now proposing and the planned full-blown iteration protocol. These operations all serve similar purposes, and I believe this should be reflected in their names. If we go with var span here, then that makes me worry what name we will choose for the iteration operation.
  • I wish to clearly distinguish the "direct" operations from less performant / less useful coroutine (or closure-based) alternatives. (We are also considering those as future add-ons to the existing Sequence/Collection protocols, separate from the high-perf borrowing iteration constructs.)
  • Finally, I argue that the primitive way to create a Span is to invoke one of its UBP-based initializers, to be proposed later. I believe that it is best to think of these unsafe initializers as the actual way to create Span instances; after all, every other form of span generation will ultimately end up calling one of them.

(Note: I am also working on a document specifically detailing the design of the new container protocols and the new data structure variants that they enable. This post necessarily has to preview some of that.)

Most prominent span source: borrowing iteration

I hope that client code will generally not need to drop down to working on direct span instances. But there will always be cases where that becomes necessary, and I expect the most obvious way to produce spans will be to use new iterator constructs, in code that is generic over a general-purpose container protocol.

The idea is to fully embrace the fact that container types have piecewise contiguous storage, and thus define borrowing iteration to use contiguous storage chunks as its primary unit. (This is in contrast to IteratorProtocol, which currently requires one full call to next() per each subsequent item.)

// All names and precise API signatures are subject to change
protocol Container {
  associatedtype Element: ~Copyable, ~Escapable
  associatedtype BorrowingIterationState: ~Copyable, ~Escapable

  @lifetime(self)
  borrowing func startBorrowingIteration() -> BorrowingIterationState

  @lifetime(self)
  mutating func nextStorageChunk(
    from state: inout BorrowingIterationState, 
    maximumCount: Int
  ) -> Span<Element>

  // ... plus Collection-like interfaces for interacting with indices
}

This can be considered to be a generalization of the existing collection protocol hierarchy -- but it also adds notable limitations: in particular, it requires all elements of the container to be physically stored in memory, so that spans can be produced that survive as long as the container instance does. (This is to achieve rough performance parity with C/C++ code that operates on unsafe inner pointers. It also allows client code to collect working sets of multiple container chunks: the spans vended all have matching lifetimes, which makes it feasible to insert them into, say, a nonescapable dictionary variant holding items that are yet to be processed.)

I expect that the nextStorageChunk method (whose name is very much a throwaway placeholder) will become the primary way for Swift code to produce Span instances for general processing.

var it = container.startBorrowingIteration()
while true {
  let span = container.nextStorageChunk(from: &it)
  if span.count == 0 { break } // End of iteration
  ... process `span`'s items ...
}

This can be explicitly spelt out (like we sometimes manually call makeIterator()/next() today); but borrowing for-in loops would hide these details, expanding to code like the snippet above.

(The benefits are that (1) we don't need to copy elements to borrow them and (2) while the container type and its borrowing iterator may both be opaque types with non-specializable operations, Span itself is very amenable to optimizations: we go from having to dispatch through a witness table once per each item to once per each piece of contiguous storage. Both of these can be significant performance limitations in current sequence/collection types.)

The properties proposed in SE-0456 are shortcut form that enable this kind of iteration, restricted to only working on fully contiguous container types.

Of course, code that only wants/needs to work on a specific, contiguous container type will sometimes prefer to invoke the proposed properties rather than going through the full start/next dance. But I do expect developers will often continue to prefer to use the generic container algorithms -- similar to how (I hope) people generally prefer to call Sequence.filter or Sequence.map on Array values today, rather than rolling their own loops inside a withUnsafeBufferPointer invocation.)

As such, I think it would be useful if the naming pattern we establish in SE-0456 would be directly applied to the names of the future iterator methods:

  • var span suggests func nextSpan,
  • var storage suggests func nextStorageChunk (or similar)
  • var contents suggests func nextContentsChunk (or similar)
  • var storageSpan suggests func nextStorageSpan.

Of course, any of these would be technically viable names. That said, personally I find func nextSpan way too generic/noncommittal (even more so than I did var span), and I hope we'd rather end up with something more specific.

We can also choose to hide the relationship between the contiguous properties and the upcoming piecewise contiguous iteration methods, say, by going with var span but func nextStorageSpan(). I think that would be a missed opportunity, though.

Potential Sequence amendments

Later on, we also intend to explore the idea of extending Sequence to allow some multipass sequences to provide an optional borrowing iteration fast path, along the same lines as above:

protocol IteratorProtocol /*: Copyable, Escapable */ {
  associatedtype Element /*: Copyable, Escapable */
  mutating func next() -> Element

  @lifetime(self)
  mutating func nextChunk(maximumCount: Int) -> Span<Element>?
  // (Straw-person name; default impl returns nil)
}

(It is also possible we'd decide to turn the nextChunk method into a single-yield coroutine, to allow collection types with frozen iterators to implement it. That would require language extensions even beyond what we foresee we need for the high-perf container protocols.)

Notably, the Span returned here would be borrowing the iterator instance, not the container itself -- so this form of borrowing iteration would allow the vended memory region to be temporarily materialized on demand (and owned by the iterator), to match the semantics of our existing, copyable Collection -- e.g., think cases like Range<Int>. (This is a major performance limitation: for example, it means that clients generic over Sequence would only be able to look at a single span at a time -- the previously returned span cannot survive the next call to nextChunk.)

Therefore, spans returned/yielded by this form of iteration would not necessarily be "storage" chunks in the sense SE-0456 (or the Container protocol draft above) uses that term. I think it would be useful if this difference would be reflected/advertised in their API names -- e.g. the drafts above use nextStorageChunk vs nextChunk.

I think it would especially lovely if the API names would be harmonized across all of these cases, so that it becomes straightforward to see at a glance whether a piece of Swift code is at risk of accidentally regressing to quadratic runtime performance. (Span-returning operations with storage in their names would be guaranteed to have sublinear costs that make them friendlier building blocks; ones that don't may materialize contents on demand with linear complexity, considerably limiting practical use.)

(All this is to enable borrowing iteration of multipass sequences. Mutating iteration and (especially) consuming iteration are also quite important, but they would be based on other span flavors, not Span, and they will not necessarily follow the patterns of the Container protocol draft above.)

Direct unsafe initializers

There is also the matter of how all these span-generating APIs actually end up getting implemented.

The most primitive/fundamental way to create a Span instance is to explicitly invoke a primitive Span initializer, passing it an UnsafeBufferPointer along with a clear set of semantic promises about the initialization state and guaranteed lifetime of the items it references. These initializers are inherently unsafe, but they are also generally unavoidable -- all other ways of generating spans (incl. the properties in SE-0456) ultimately end up calling them to actually create the spans they return.

As soon as we become confident in the shape of these initializers (and we have the syntax to confidently express their lifetime semantics), I expect we will propose them as public API, to allow custom data structure implementations and manually implemented interoperability wrappers to invoke them as needed.

I think of these direct initializers as the primary form of Span construction, and I believe this to be a useful view in general. All span-typed properties and span-returning functions are "merely" safe wrappers around these core initializers.

10 Likes

It’s fascinating to see the initial draft of how borrowing iteration will be structured. I’m curious about the role of the maximumCount parameter in the nextChunk function. How does the caller determine or decide the length of the contiguous data chunk stored in memory?

My reading of Karoy's draft API is that the caller doesn't decide. It's just an upper bound on how many elements are to be in the resulting span. If the contiguous data chunk has more than that many elements, it gets split into multiple spans of that size (or less).

3 Likes

IIUC, any kind of MutableSpan would need to enforce exclusivity during mutations, so any operations which modify a value would need to be mutating and the span itself would need to be stored in a mutable binding (var/inout).

So what would be the difference between a MutableSpan stored in an immutable (let) binding and a regular Span?

Rather than having separate mutable/immutable span types and properties, couldn't mutability be dictated by the kinds of accessors offered by the type and used by the client?

After all, we don't have separate Array/MutableArray types - whether you can modify an array that is part of another variable depends on the accessors it makes available to you.

3 Likes

We went through this with Atomic, and the solution was “you can’t use var”.

1 Like