Vector, a fixed-size array

scanon · October 11, 2024, 2:51pm

A fixed size deque, or a fixed capacity deque? (A fixed-size deque is not very interesting; actually a fixed-size any-container-other-than-array/vector/tuple/whatever-you-want-to-call-it is not very interesting, which is part of why it makes sense to give it a special name: it is in some sense the unique fixed-size container.)

FranzBusch · October 11, 2024, 2:52pm

Yes mixed them up. I meant a fixed capacity deque.

Nobody1707 · October 11, 2024, 3:20pm

For what it's worth, I believe Hylo uses Buffer as the name for its fixed sized arrays. Although they seem to be using the C/Java style T[N] shorthand, which requires an extra set of brackets for each extra dimension. Whereas [T; N] naturally lets you add new dimensions with just a comma, and the [T * N] & [T x N] forms would just need another multiplication sign.

I don't think we should even consider the shorthands that put T after N, since it would cause the placement of T to move when extra dimensions are added. For this reason, I also think that having the Count parameter before Element in the definition of Vector is wrong because it'll either be inconsistent if we add a fixed sized multi-dimensional array (or when users inevitably write one as a wrapper around a Vector) or it'll constrain said multidimensional array to have the element type in the wrong spot.

Joe_Groff · October 11, 2024, 3:29pm

Consider what happens when you index a nested construct. If you have let matrix: [N * [M * T]] then subscripting matrix[n] will index the outer array first, and matrix[n][m] will index the scalar element of type T from one of the inner arrays. Putting the count before the element type allows the order of nested dimensions to follow the order of successive indices. By a naive grammatical interpretation, syntaxes that put the dimension after the element type would compose the dimensions in the opposite order from how they're indexed; T[N][M] naively parses as (T[N])[M] and would be indexed by matrix[m][n] (same with [[T * N] * M] and similar constructs). C, Java, and other such languages special case the parsing of array types to apply the dimensions in backwards order. We could do that, but I would rather have a syntax that "just works".

Nobody1707 · October 11, 2024, 3:34pm

But why would we ever index like that when we actually support multi-parameter subscripts? An actual multidimensional array index with the T first syntax would look like:

let matrix: [T; M, N] = ...
matrix[n, m] = ...

Joe_Groff · October 11, 2024, 3:39pm

Even if there is eventually a standard multidimensional tensor type in the standard library, developers can and will nest this single-dimension vector type to build up ad-hoc multidimensional structures. And that's fine IMO; the only real motivation to have a separate type that's natively multidimensional would be if that type had a different layout policy from what you get by composing the fixed-size one-dimensional type (maybe it's row-major instead of column-major, or does cache-aware tiling of the elements, etc.)

ksluder · October 11, 2024, 4:03pm

I also strongly prefer that the type comes first. Putting the type last is harder to read due to the pileup of closing delimiters, especially when the types are nontrivial:

typealias Header = [(tag: Int, values: [(tag: Int, value: String) * 4]) * 3]

// versus

typealias Header = [3 * [(tag: Int, values: [4 * (tag: Int, value: String)])]

allevato · October 11, 2024, 4:12pm

IMO, neither of those is an improvement over the other and they would benefit from intermediate types/typealiases. I think they're both equally difficult to grok—the location of the dimension isn't the problem here.

fclout · October 11, 2024, 4:14pm

Another motivation to have a native multi-dimensional fixed-size array is that the strictly memory-safe Swift compiler could look at you straight in the eye and give you a flattened Span<T> representing the entire collection and you could slice it any way that you liked (including in ways that don't respect dimension boundaries), whereas this would probably have to break out of the safe model if we only had nested fixed-size arrays.

Joe_Groff · October 11, 2024, 4:18pm

That's an interesting point. In previous discussions we've discussed having a protocol to indicate whether a type is a homogeneous aggregate of one element type. That could be used to provide a safe API for turning a pointer/span over one or more instances of the aggregate into a pointer/span of the element type.

ksluder · October 11, 2024, 5:20pm

If that’s in the cards, then surely it makes sense to model these as tuples and retroactively synthesize the HomogeneousAggregate conformance in a future runtime version?

Endowing tuples with the ability to describe repeated elements would also give the language a way to describe the memory layout of any nominal type, including classes:

class C {
  var elts: (Int * 8)
}

print(MemoryLayout<C>.StorageLayout)
// (elts: (Int * 8))

lorentey · October 12, 2024, 1:37am

"Vector" is an extremely well-established name for an ordered sequence of precisely n items, all of the same type. There is a particularly strong expectation that a type called "vector" would provide an array-like subscript operation. There is an equally remarkable lack of any expectations of resizability: when I see a type called vector, I would not expect it to come with any insert, append or remove operations. The name also rightly discourages operations like concatenation.

The phrase "fixed-size array" is one way to explain the meaning of this term in documentation to someone who has never heard of vectors. Dictionary definitions do not make good names though -- certainly not for something as fundamental as Vector.

(I expect we'll soon start discussing a shockingly large zoo of potential array variants besides the current Array and ContiguousArray, and I fully expect we'll use qualifier prefixes to name them: the prototypes have names like InlineArray, RigidArray, SmallArray, or DynamicArray. Vector isn't like any of these -- it feels like it is its own thing, far more fundamental than any member of the array family. The Array variants all provide roughly equivalent sets of operations, and the naming prefixes directly translate to other container types: it makes perfect sense to define a RigidDeque or an InlineSet. There appears to be no reason to define a fixed-size Deque, or a fixed-size Set though -- Vector is a unique thing, it isn't establishing a pattern.)

"Homogeneous tuple" is another, equally valid, way to describe/define what a vector is -- indeed, Swift has been importing C's stunted "array" types as homogeneous tuples, and we are hoping Vector will substitute for this use. "Tuple" is a great name, but it's very much taken to mean the general type product: types like (Int, String, Bool) are what Swift calls a "tuple". Unlike "vector", the term "tuple" strongly implies heterogeneity, and it lacks an expectation of an indexing subscript. (Swift's tuples can be composed of different parts, and these parts are accessed with member syntax, not subscripts.) HomogeneousTuple would be a terrible name, for the same reason FixedSizeArray would be -- we should not be naming fundamental concepts by their dictionary definitions.

The Standard Template Library in C++ has swapped the names of std::vector and std::array. As quoted in the proposal, Stepanov himself considers this a mistake. Swift exists to fix the mistakes of C and C++, not to embrace and propagate them.

Luckily, we are already calling our array type Array, thanks to our Objective-C legacy. We therefore have a clear opportunity to call our vector type Vector. We should take it.

lorentey · October 12, 2024, 1:48am

"Buffer" is a great name. Unfortunately, the Swift Standard Library is already using it to mean a very different thing:

SE-0447: Span: Safe Access to Contiguous Storage

I don't think I've seen this spelt out anywhere, but the stdlib appears to consistently use the term "buffer" to mean a region of memory with unspecified (i.e., client-managed) initialization state. UnsafeBufferPointer and ManagedBuffer/ManagedBufferPointer do not (in general*) make type-level assumptions on which of their slots contain initialized values -- managing that is entirely outside their mandate, and they expect code that uses them to do custom bookkeeping to prevent misuse. This makes these constructs crucial for building efficient data structures (each of which come with well defined, but incompatible ways of performing that bookkeeping), but it also covers them in a thick layer of ambiguity and danger that makes it wildly inappropriate to use them as a way to exchange or transfer data. I look at the existing types as primarily useful for representing "storage", or data in stasis.

Vector is like a fixed-size array; it's like a homogeneous tuple; it's like a fully initialized buffer; it's like an escapable span that has its own inline storage. These are all valid ways to explain what it is; none of these phrases would work well as the name of the type.

Nothing prevents us from adding a conditional AdditiveArithmetic conformance to Vector. We did consider adding this; we did not include it in the proposal, but there is nothing specifically preventing this later.

However, note that the stdlib's preexisting "actual" SIMD vector types (SIMD2, SIMD3, etc) also do not conform to that protocol. This strongly hints at an underlying issue.

I'd be particularly interested in seeing packages that experiment with implementing operators like + or * (for scaling with a scalar) to Vector. There is a real risk that Swift's expression checker is not strong enough to bear such load, and that the compiler will not emit well-optimized code for complex vector expressions -- it would be lovely if we figured out how to resolve these. This experimentation does not need to (and I believe it really should not) start with an evolution proposal.

This is a valid reason not to add a standard + operator. (However, note that we do not expect to ever offer an operation to concatenate two vectors -- that is not what vectors are for.)

(Reasonable people may also disagree on the meaning of *.)

However, this should not prevent packages from experimenting with adding + and * operators. It also does not prevent Vector from providing named methods for forming linear combinations.

These are also great names. They come with a great disadvantage against Vector, though: they are brand new names, freshly invented with zero preexisting precedent (that I know of -- although "row" is being used in relational databases, in a context that has nothing to do with Vector.). I'd like us to keep them in mind for some actually novel use case in the future.

Even the most vehement critics of the name Vector must concede that it hints strongly at the nature of this type: indeed, while C++ mistakenly calls its array type std::vector, no actual person is ever seriously confused that we call our (much better) equivalent Array. Likewise, no actual person will ever be seriously confused if we called our (much better) std::array type Vector.

The STL's naming mistake simply reinforces the use of the term "vector" to mean a vaguely array-shaped thing.

If this argument was valid, it would apply to every new type we ever add to the stdlib. Luckily, its premise isn't true -- names defined locally (or explicitly imported) shadow all names implicitly imported from the stdlib. Any preexisting (or future) local use of the Vector type name will continue to resolve to its local definition.

(We've gone through all this with Result.)

This is true, but I don't see how this is a negative thing -- part of the job of the Standard Library (of any language) is to establish great names for its fundamental primitives. The stdlib's names are huge part of what defines a language -- it defines the vocabulary of all Swift code.

From what I can tell, the type we're introducing is the best possible use of the name "vector". If someone finds another vector-shaped type, they are more than welcome to name it as such -- they just need to figure out a qualifier that distinguishes it from the stdlib's type.

The stdlib ships with a generic type called Set that is based on a very specific variant of hash tables. This is just one possible way to define a useful set, even if we assume that sets are based on hashing. The stdlib naming its type Set has indeed discouraged other people from using the name Set as is; but it has not prevented us from adding different (sometimes wildly different) set-like types -- we just tend to add qualifiers, like OrderedSet, TreeSet, or SortedSet.

The Standard Library shipping with a Vector type will likewise establish a clear direction for naming similar types. (If there are any; Vector is certainly going to be difficult to compete with on merits.)

List is a great name. All previous usages of it that I've seen (and I've seen quite a lot of them) were resizable, like an array.

(There is also the very prominent usage of the name List in SwiftUI. Like it or not, it practically prevents us from ever using this name for a standard container type.)

That would imply there was something broken with Array. There isn't.

In some early drafts, that was indeed the working name for this type. We've moved on from it because we realized Vector isn't really an array variant. Early feedback also indicated that the adjective "rigid" has some negative connotations that we do not want to associate with Vector.

We now use the Rigid prefix as the provisional naming convention for the fixed-capacity, but heap allocated array, deque, set variants: RigidArray, RigidDeque, RigidSet, etc. (The negative connotations of rigidity are apt there, as these variants are rather annoying to use in practice.)

simonomi · October 12, 2024, 3:39am

What's the reasoning behind this? As far as I can tell, that's exactly what it is (the name of this thread calls it "a fixed-size array"!). It serves the same roll as Array, but with some additional restrictions, and thus better performance. For every use case I can think of, Array could be used instead, at the expense of performance.

To me, it seems best to treat all these various fixed-count/fixed-capacity/etc types in the same way—as variations of their respective 'base' type.

Array, FixedCountArray, FixedCapacityArray
Dictionary, FixedCapacityDictionary
Set, FixedCapacitySet
Span, FixedCountSpan
etc.

lorentey · October 12, 2024, 3:47am

Vector isn't an array variant, because it doesn't come with the expected set of array operations. You cannot insert new elements into a vector; you cannot remove any existing items. Array types are range-replaceable containers -- Vector is not.

Span is already a fixed-count container. There is no such thing as a FixedCountSpan. (Just like there is no such thing as a FixedCountSet, or a FixedCountDeque, or a FixedCountString. Adding "static-countedness" does not seem to be a useful way to generate new variants of existing data structures. We do have StaticString -- its instances have constant size, but this size isn't part of the type. Vector is a brand new thing.)

simonomi · October 12, 2024, 3:58am

Oh right, of course . I guess I was imagining having the count be stored in a generic parameter, but there's not really a big difference there.

For some reason, despite Vector not having all the same array operations, it still 'feels' like the same kind of thing to me. Every other different data structure adds something new: Dictionary has key-value pairs, Set is unordered and without duplicates, and so on. But Vector has a subset of Array's capabilities, with the same kind of subscript, indices, and iteration. The only significant difference I can think of is being stored inline, which I don't think merits being its own separate kind of data structure.

lorentey · October 12, 2024, 4:33am

I think this is a useful feeling, but consider that Array is actually quite far away from Vector -- we have types in the stdlib that are far closer!

The existing type I most associate with Vector is Unsafe[Mutable]BufferPointer. The new Span type that is currently under review is even closer; the MutableSpan type we're drafting will be even closer still. None of these types carry the word "array" anywhere in their names.

Vector is part of a family of fundamental low-level types, all forming vaguely similar shapes: they are all fixed-size containers; they all have integer indices; they share much of their core API, and they are all backed by a single contiguous memory region. Their contents are mutable (as long as we have mutating access), but their structure is not: they do not support resizing. There are very strong commonalities between them, but they are all special in some unique way. They all feel like fundamental constructs that will play extremely prominent roles.

These types are all modeling different aspects of the concept of a "memory region" -- this is such an important concept, it appears it cannot be captured by just a single type!

We do not intend to replace the great name Span with an absurdly bad name like NonescapableFixedSizeArray. UnsafeBufferPointer isn't a great name, but UnsafeFixedSizeArrayWithPotentiallyUninitializedElements would be so much worse!

The similarity of Vector/Span/UnsafeBufferPointer to Array is helpful when explaining these types, but it’s entirely superficial, and we should not over-emphasize it. Vectors, spans, buffers aren't arrays -- we should not call them that.

I promise: the name Vector will grow on you. All great names do!

ibex10 · October 12, 2024, 5:21am

Not only that, it so smoothly rolls off the tongue: saying Vector.

lorentey · October 12, 2024, 7:20am

(re: RigidArray) In some early drafts, that was indeed the working name for this type.

(Fun anecdote: the moment that turned our draft RigidArray type into a bona fide vector was particularly noteworthy. In the early days, we planned our precursor type to be unconditionally noncopyable — until one particularly memorable & productive engineering discussion, where we decided that it needs to become conditionally copyable (despite the immense extra complexity this adds to its implementation).

That very same day, two separate groups of people independently realized that copyability has now turned this type into a true vector, and that Vector is the obviously right name for this construct. This made for a comical scene when members of the two groups met up later in the day, fully expecting that they’d need to convince the other.

This was the only time I ever experienced such a naming miracle; I do not expect it to ever see it happen again. It has left me with a rock solid conviction that Vector is the right name for this shape — I feel it deep in my bones, and it would take quite some effort to convince me otherwise.

In contrast, the name Span was a very difficult birth — it took months of diligent effort to convince people about it. The type’s mediocre (at best) working titles have even made it into some formal pitch documents!)

jberry · October 12, 2024, 7:46pm

As people seem interested in how names of related types fit together and describe those types, and since it sounds like you have a strong inclination of what those future related types should be, can you spitball the names of the members of this family to show how you anticipate the naming of the family tree will fallout? So far, we’ve got Vector.