[Second review] SE-0453: Vector, a fixed size array

Geordie_J · December 10, 2024, 5:36pm

Thank you for the explanation.

I don't see why I wouldn't use Vector for DSP work though, e.g. taking N input values, processing them, ending up with N/2 output values (in a separate Vector), etc.

While I can see the benefit of not having to initialize every value in some circumstances, I don't (yet?) share @John_McCall's concerns about Vector not being suitable for these tasks.

On the other hand, I am slightly concerned by the implication that Vector could be implemented in terms of this other hypothetical type. i.e. it seems Vector is a specialization of the other type, and not vice-versa. Why are we starting with Vector then?

lorentey · December 10, 2024, 8:20pm

Our task is to give this type one name. Suggesting to give it two names instead seems incredibly unhelpful to me. I think syntactic sugar would solve nothing in this case; it would only add headaches, without giving us anything in return.

Yes, we can invent some opaque notation that avoids having to spell out a type name. But gobbledygook like [8 * Int] does not come with a pronunciation guide! How will we talk about it? How will developers search for what it means? What benefit would such a notation provide that cannot be achieved better by a regular type name?

This idea merely adds another (and far more pointless) naming axis to argue about, with the additional complication of being heavily limited by the need to keep the language parseable. This leads to the design of the notation getting largely dictated by (real or imagined) technical constraints, rather than any relevant preexisting tradition or aesthetic considerations. The resulting notation will be entirely unfamiliar to everyone.

(Tuples implement type products, and Vector implements type exponentiation. The obvious shorthand syntax for Vector<10, Int> is therefore Int¹⁰. Obviously, this does not translate well into Swift.

If for some reason we were forced to come up with a shorthand notation anyway, then I'd be partial to [42 * String], because (1) that was the least bad of the few dozen variants we considered, and (2) I'm childishly amused by the idea of wildcard _ substitutions getting rendered as [_ * _]. Of course, this [n * T] notation is suffering from the same discoverability/pronunciation issues as every other option; I do not think it has merit.)

grynspan · December 10, 2024, 8:51pm

It had occurred to me that this type offers a "safer" parallel to UnsafeBufferPointer, but I worry it may make it harder to developers to reason about when to use it vs. Span.

ksluder · December 10, 2024, 10:18pm

I wish the Span discussion hadn’t been split from Vector, because for a long time I have had a gut feeling that there’s some informative relationship between Vector and Span that would lead to a better name for Vector.

xwu · December 10, 2024, 11:04pm

Well, if pow(n, 10) computes the 10th power of n, then should we spell this type Pow<Int, 10>...

(We could then also have a nice ditty to remind folks that Pow's not CoW.)

brendon · December 10, 2024, 11:37pm

I support the proposal as written, including the name Vector. But if we must look for alternatives, Multiset or Bag are the most technically fitting names I could find. As I understand it, the math term Multiset/Bag refers to a set-like entity that allows multiple items of a specific kind. Multiset might be confusing given what a Set is and how different this type is from a Set. Bag feels somewhat meaningless, but then, it does avoid the confusion that other terms may cause.

I'd say Box would be good for this if it didn't have another common meaning in programming already.

lorentey · December 10, 2024, 11:39pm

These would establish a precedent for a particular style of naming that personally I find distasteful. I'd much prefer not to end up being forced to say that this-or-that function should return an "inline" of four integers, or a "repeat" of six floats, or a "storage-of" of ten doubles.

These would be good names, if they weren't already taken to mean things different than the proposed Vector. Unfortunately Swift already uses these terms; hijacking them for the new type would lead to endless confusion.

(I've previously written about Tuple, Buffer. As James has noted, Block means a group of statements.)

To reiterate once again, the label "array" in Swift refers to a particular family of data structures. In the fullness of time, I expect the members of this family to be as follows:

Array<Element> - This is the preexisting, copyable, self-resizing, dynamically allocated array type that implements the copy-on-write optimization. It only supports copyable elements. It has built-in interoperability with NSArray from Objective-C, leading to unpredictable performance.
ContiguousArray<Element> - Array without the Objective-C bits.
InlineArray<count, Element> - Future conditionally copyable, fixed-capacity, nonresizable array type with inline storage.
RigidArray<Element> - Future noncopyable, fixed-capacity, manually resizable, dynamically allocated array supporting noncopyable elements.
DynamicArray<Element> - Future noncopyable, self-resizing, dynamically allocated array supporting noncopyable elements.
SmallArray<inlineCapacity, Element> - Future noncopyable, self-resizing array type with built-in inline storage for a predetermined number of items, dynamically switching to heap allocations when necessary. (Effectively an enum of InlineArray and RigidArray cases, with extra logic to switch between them.)

(Of the four new types listed here, RigidArray and DynamicArray are currently within reach -- the primary blocker is that we cannot define subscript accessors with correct semantics, but that objection also applies to Vector and Span, so I suppose we can simply continue to pretend the problem does not exist, and instead keep having endlessly looping arguments about their names. InlineArray has more technical issues, as we haven't reached an agreement on how it (and types like it) will coordinate with the Swift runtime to tell it where their initialized elements are. Each of these potential new array types have important use cases; but which of these four (if any) we'll actually end up proposing to the stdlib is getting less clear to me with every post I read on this thread.)

All members of this family of types will provide the same operations; they all share largely similar performance characteristics; they have similar use cases.

The proposed Vector is not a member of this type family.

Vector belongs in the same loose type family as Span, UnsafeBufferPointer and UnsafeMutableBufferPointer. The upcoming MutableSpan will join the same category. None of these are array types, and none of them should be named as if they were.

These are great names. If we somehow end up choosing a name other than Vector,
I would be fine with either one of these. I'll throw in my own plan B, too, which is to call the type Run or Spread. (As in, "this function should return a run of 4 integers" or "that struct should store its contents as a spread of 8 floats".)

All of these suggestions suffer from the same fundamental issue -- we are pulling them out of our collective behinds, with zero regard to precedent and tradition.

The type we're proposing is not a novel invention; it isn't a radical new idea. It just reformulates a preexisting shape to fit Swift. The proposal simply chooses the most obvious name for it, adopting centuries of use and embracing decades of practice.

For a newborn programmer, every one of these names will be equally opaque: whether we name the thing Vector, Batch, or Run, they will have to look up its definition to learn what it means.

But what makes Vector so much better is that it immediately gets people with a little more experience into the right frame of mind, whether they are a recovering C++ programmer or were just force-fed a bit of linear algebra at college. It is really difficult to misunderstand the meaning of a "vector of 5 integers" -- the name already puts you in the correct ballpark.

For people with some maths background, the name Vector induces the automatic expectation of copyability, homogeneity, inline storage, and subscripting with integer indices. It strongly hints at the lack of insertion/removal operations, and the lack of concatenation. It fits the type miraculously well; far better than most names.

For C folks, it seems pretty straightforward to explain/understand that C arrays turn into Vector in Swift -- it's certainly easier to grok than the mess that is homogeneous tuples!

This is a good idea! It is a bit of a copout, but truncating the name still preserves the intuitive associations that I value so much about Vector. I don't hate it!

lorentey · December 10, 2024, 11:54pm

Interesting! I think I'd prefer if we reserved these terms for actual multisets/bags, i.e., set-like data structures that provide fast lookup operations.

Nobody1707 · December 11, 2024, 1:43am

I still think Vector is the best name. I don't buy the it'll confuse the C++ programmers argument, as it's well known in the C++ community that std::vector and std::array were named backwards.

I also don't buy the "it's not common enough for a concise name argument" either, as the name Vector<n, T> isn't even concise enough for the average graphics/linear algebra program in the first place. Those will almost certainly be using types (or aliases) named like Vec4, Vec3, Mat4, etc.

Anything heavily using Vector directly (as opposed to being a thin wrapper over one) is almost certainly going to be happening in extensions or constrained generic functions, which will have enough boilerplate that any difference between Vector, StorageOf, and HomogeneousTuple will be largely academic.

Gavin_Eadie · December 11, 2024, 4:38am

Perhaps I'm not sufficiently versed in the details of this long, fascinating debate but isn't another distinction between Array and our proposed new type that the former isn't necessarily homogeneous and the latter is? So the (oft-mentioned, disambiguating) Fixed* while illustrating one difference makes no reference to another?

And, I am not recommending FixedHomogeneousArray .. for what it's worth, for me:

Fixed doesn't say what is "fixed"
the new thing isn't an Array
Clump, Lot, Span .. don't convey it's indexed, composite nature

Nothing is harder than naming things! I dipped into An Exaltation of Larks, by James Lipton for inspiration wherein there's a "scream of swifts" but excuse my levity, I know this is a difficult and serious issue.

For another off the wall suggestion: Series .. multiple things, usually ordered, homogenous but, unfortunately, extendable ..

lorentey · December 11, 2024, 5:27am

I think this is an interesting note, and I am curious to learn more about your reasoning here. For what it's worth, I consider Vector<4, Float> to be a perfectly viable generalization of what e.g. Metal calls its "Vector Data Types". Citing the language specification:

2.2 Vector Data Types

Metal supports a subset of the vector data types implemented by the system vector math library. Metal supported these vector type names, where n is 2, 3, or 4, representing a 2-, 3-, or 4- component vector type, respectively:

• booln
• charn
[...]
• floatn

Metal also supports vec<T,n> where T is a valid scalar type and n is 2, 3, or 4, representing a 2-, 3-, or 4- component vector type.

I find the vec<T,n> shorthand notation to be particularly noteworthy; in my reading, it reinforces that we're on exactly the right track when proposing to name our generalized construct Vector. (And it feels great to be able to propose putting the type arguments in a more sensible order.)

Just like Swift's proposed Vector, Metal's vector types are designed to represent things other than classic "vectors": points, sizes, colors, sequences of booleans, etc. Are you saying that Metal (as well as CUDA, OpenCL, etc.) are also misusing the term "vector"? Or is there something specific about Swift's proposed usage that puts you off? Can you describe what feels wrong about it? To me, they all look the same; our Vector does away with the artificial size limitations, and it uses simple, general-purpose rules for its memory layout, but these do not feel particularly crucial.

For what it's worth, I do find it noteworthy how these specialized languages seem so eager to disregard the linear algebra roots of the term "vector", instead treating it as a generic label for homogeneous tuples holding arbitrary things. Again, in my view, this reinforces that Vector is the correct name for our construct -- if a shader language has no qualms about calling bool3 (or even a thread index!) a "vector", why should we? Then again, I have never been an expert on these languages; I can count the number of times I used them in a professional capacity on two hands.

(E.g. Is there widespread consensus that it's been a mistake to call bool4 a "vector"? Why then are the myriad competing shader languages all continue to do so?)

Why not add those operations directly on Vector?

To reiterate from the discussion pitch, we aren't willing to add these as standard operations to the stdlib, because we haven't figured out how to do that in a way that's optimizable enough, and we don't believe that a set of CGAffineTransform-style discrete operations will suffice. (One specific expectation we have trouble guaranteeing is that expressions like 4 * (a - b) * (B + C) should produce a minimal number of explicit loops.) This is not specific to this particular formulation of Vector; the same issue applies to the standard SIMD vector types, CGPoint, etc. etc.

My personal opinion is that this should obviously work, and it is going to be a (non-critical, but real) library defect that it doesn't:

let a: Vector = [1, 2, 3]
let b: Vector = [3, 2, 1]
print(3 * a + b) // ⟹ [6, 8, 10]

However, we simply do not have a clear path to implementing such expressions in a satisfying enough way in Swift, no matter what type you substitute for Vector.

Whether or not you need to care about deeply optimizing such expressions depends on your use case. Evidently, keeping these as discrete operations is acceptable in some contexts -- Swift has been importing CGPoint.applying(CGAffineTransform) as an opaque, discrete function, and the sky somehow hasn't fallen yet.

I do believe that it makes sense to propose Vector without these traditional operations -- it provides a core primitive that isn't possible to define outside the stdlib, and is independently useful even without such operations. The operations can be defined outside of the Standard Library, and indeed I think they should be!

Accordingly, I highly encourage experimenting with defining such operations on the proposed Vector, in packages outside the stdlib. (It should be possible to do this without engaging in inadvisable things, like retroactively conforming Vector to any standard protocols. If it isn't, then that's an actionable defect!)

For extra fun, Swift developers should also feel empowered to play with nonstandard convenience extensions like this one:

extension Vector where count == 3 {
  var x: Element { 
    get { self[0] }
    set { self[0] = newValue }
  }
  // y, z, r, g, b defined along the same lines

  var xy: Vector<2, Element> { 
     get { [x, y] }
     set { x = newValue.x; y = newValue.y }
  }
  // etc
}

Please go wild!

Yes! I believe signal processing is going to be a viable use case for Vector, just like it has been for C arrays.

At the same time, I also agree with John that functions that take or return such buffers will often be better expressed in terms of taking/returning non-owning Span values.

Vector will work well at holding data at rest -- serving as its storage representation. I expect it will be generally less useful for representing data in transit, because it is so specific: a function taking a vector can only ever be called with one, potentially forcing the caller to copy its data into a new temporary vector just to be able to invoke it.

But Vector being "less useful" for this purpose isn't the same as "not useful at all" -- there are legitimate reasons for people to want to pass around packs of items of a particular size (especially small ones) by value, and I expect Vector to be a useful tool for doing that, whether or not it gets wrapped in a custom type.

Clarification: we aren't planning to introduce a type called FixedCapacityArray, because that name would be ambiguous. There are two variants of "fixed-capacity array" types under consideration: their draft names are InlineArray<capacity, Element> and RigidArray<Element>. They differ in how they represent their storage -- InlineArray has inline storage, while RigidArray uses a heap allocation. (Additionally, InlineArray is expected to be conditionally copyable, while we're planning to keep RigidArray noncopyable.)

As for "RingBuffer", the swift-collections package has been shipping that data structure under the name Deque. (We can of course revisit the name when we propose that type as a stdlib addition. Oh what joy!) InlineDeque and RigidDeque are the two fixed-capacity deque variants that correspond to InlineArray and RigidArray: the draft naming prefixes establish a general naming scheme for these type families.

I do agree with this -- it seems unlikely that functions will generally want to directly take or return InlineArray or InlineDeque values.

But I think this argument doesn't really apply to Vector -- there are legitimate reasons for functions to take precisely n items, and so there will be cases when reaching for Vector will be the most appropriate choice.

Vector provides a viable new option in cases where a Swift function currently has take or return homogeneous tuples. Span is not helping with that at all.

Are these cases rare? It appears this depends on who I'm talking to; it feels a little bit like asking how often do shader functions take int3 values. (Would it make sense to suggest rewriting them to take a pointer/count instead? I would not generally expect so.)

Vector is not a special case of InlineArray; nor is InlineArray a special case of Vector. The implementation of Vector will not be based on InlineArray, nor vice versa. These are independent primitive types, with distinct representations, written from scratch.

(If there is going to be any shared code, it will be across members of the same type family, not across different ones -- e.g., there is a good chance that InlineDeque will share large chunks of its implementation with RigidDeque and Deque; but there is no such opportunity between, say, InlineArray and Vector.)

We are starting with Vector because it has the simplest possible internal layout -- it consists of nothing more than a predetermined number of slots, and all of these slots are always initialized. This makes it straightforward to copy, move or destroy vector values.

InlineArray, InlineDeque, InlineSet, InlineDictionary etc. will require nontrivial additional coordination to figure out which of their slots hold initialized values. Ideally this would be implemented as part of their Swift definition; but we haven't reached a consensus on precisely how to express this. Accordingly, we aren't ready to propose any of these types yet.

ksluder · December 11, 2024, 5:37am

It is not. MSL’s vector data types must be layout-identical to the CPU-side types vended by <simd/simd.h>. On Apple Silicon, simd::float4 and metal::float4 have 16-byte alignment. Vector<4, Float> has 4-byte alignment, the same alignment as Float. This is analogous to simd::packed_float4, but it is a clear proof that Vector was not designed to be a generalization of Metal vector types.

Also, the SIMD and Metal vector types implement swizzling (elf. myFloat4.zxyw) at the language level.

lorentey · December 11, 2024, 5:58am

Indeed. However, as I did explicitly note, I consider these differences to be necessary generalizations to allow Vector to carry more than a handful of element types, and to support more than a strictly limited number of items.

Do you believe that calling bool4 a "vector" is somehow fine because it has an alignment of 4? I find that a deeply curious assertion.

Also, the SIMD and Metal vector types implement swizzling (elf. myFloat4.zxyw ) at the language level.

Do you believe that Vector needs to provide such members to deserve its name? Do they need to be hardwired into the Swift compiler, rather than implemented in Swift? My answer is an emphatical no, to both questions.

sveinhal · December 11, 2024, 8:41am

Given that this type should be discouraged from being used as a currency type, and we should discourage libraries from adding operators on it, and that its main use will be as a in implementation detail for internal storage of other types — I think StorageOf<n, T> is the better name!

No-one would conflate StorageOf<2, Double> as a point or coordinate, and no-one would be tempted to add addition on a typed name "storage".

It would be very convenient to implement both Point<Double>, Size<Int>, PolarCoodinate<Double>, Matrix<Real> and all kinds of types using a StorageOf as backing, and for each of those cases, the author would be free to implement a vector space, when appropriate.

hassila · December 11, 2024, 9:03am

I think this is probably the crux of different views on naming.

There is clearly different views on whether this type will be used as an internal implementation detail of currency types, or be a currency type itself.

This greatly influences the naming. If one is typically never returning this type, then you are not forced to say "this-or-that function should return..." as those functions won't be written.

So perhaps the prudent thing would be to really come to a conclusion there first - will this be a typical user facing currency type, or will it be an internal implementation detail of currency types?

It seems various contributors have fairly different views on this, so maybe try to nail that down first?

Sajjon · December 11, 2024, 9:23am

Many suggestions so far have been limited to "current literature" / naming standards. Someone - or a group of people around the same time - chose to use the word "Array" (a good choice obviously) to express an ordered collection of values. Perhaps it was Konrad Zuse with Plankalkül.

We might come up with a new word, not used by another programming language yet, to name this type - if one feels strongly that Vector would be more confusing than suitable (personally I don't think that).

What about: Row - a row of values, think spreadsheet row. we often visualize arrays as [a, b, c], arranged in a row rather than a column. This makes Row better than something pertaining to list - since when you write a list for e.g. groceries, you typically but them in a "column". I'm not discriminating against columns :) but I think a naming which is close to the visual of let x = [1, 2, 3] as Row<UInt8>, see see the values as row, rather than a column,

Not-that-terrible-suggestions
So here is my (probably not that good) attempt to try to be a little bit more... creative if you will!

Rack, think "spice rack", storage often "linear" of uniformly sized "slots"
Cabinet - a storage unit of fixed size with many drawers to hold stuff (values). Bad: not all drawers must be same size in a cabinet. Bad: more of a matrix than a list
CurioCabinet - like above, but the size of each slot seems to be fixed in size! Bad: more of a matrix than a list
Cubby - fixed size grid of slots you put stuff in. Less known (I'm not a native English speaker and I did not know about it, but that is really not a problem.) Bad: more of a matrix than a list

Worse suggestions

Hive - a structure of cells of same size. Bad: Bees can make it larger, so the structure itself is not fixed size.
Train - a linear structure of "train cars", fixed in size and each car often has same size (you typically do not append cars to the train while moving - i.e. once initialized).
CabinetBox - fixed size, with fixed side number of slots. Bad: very strong medical theme and maybe one does not want to think about pills while coding... also its not really generic, Item == Pill
PeaPod - not really same sized slots and also it grows (thank you mother nature!), also not generic, since Item == Pea....

"but but train / peas / honeycombs has nothing to do with computer science and programming languages". Well, before George Boole we did not have a name for "boolean algebra" either. Before Carl von Linné named all plats, we did not have a canonical name for them them. And YES I understand that the situation here IS different since this thing is called Array in some other languages. So yes someone has already named it. But that is in another language. This is Swift. Surely other people had named plants in other languages before Carl von Linné named them in latin! Same rule applies here... my point is I don' think we should be afraid of being creative and naming it something which creates a good visual representation.

At some point in time we all learned that a key-value structure is called a Dictionary, it is a good analogy! Is the structure in fact a book? No, but it is a good analogy and we learned it and now it comes natural to us all. We should not be afraid to have to learn some new term again.

And then in 10-20 years time when another set of smart people - possibly with a non-empty-intersection with us in here - discuss what to name this type in the new programming language Foobar then they might say: "Swift chose to call it Train and I think we should do that to".

Andropov · December 11, 2024, 11:05am

What would be the encouraged currency type when returning a collection of precisely N elements? I understand (and agree) that for many use cases (Color, PolarCoordinates...) you will want a wrapper type to clarify what the values mean semantically and how to operate with them. But sometimes, one may just want an indexable collection of precisely N elements, with no "extra" semantic context to clarify.

Imagine I want to write some code to train a ML model on batches of images of precisely 32 elements. Would it be ill-advised to bundle the elements in a plain Vector<32, ImageID>? Is is really necessary to create a TrainingBlock type to wrap around Vector<32, ImageID>? Maybe when writing the public API surface of a library, and I'm not even convinced of that. For module-internal code, I'm sure there'd be a lot of code happy to pass Vector around with no wrapper. Is that a bad thing?

For this example, it would feel particularly weird to use StoreageOf<32, ImageID>. Where is it stored? Must it always be stored?

I have been thinking of Dictionary a lot while reading this thread too, but I was on the fence about bringing it up.

Dictionary is a fairly uncontroversial name (as far as I know!) yet it's possible to raise more or less the same issues for the name:

It's not really a dictionary.
While Dictionary can be trivially used as a building block of a "true" dictionary type, it isn't necessarily one (depends on how you use it, like Vector).
It's often an internal type and not as commonly used in API surface as other types, and very often you want to wrap a Dictionary with another type.

Subjectively, I feel like, if Dictionary were to be introduced today, it would be controversial, with at least some people in favor of a more descriptive name (ie KeyValueCollection, HashMap...).

I'm curious how (or if) people against Vector think Dictionary is different. Like others have commented before, I believe this is ultimately going to boil down to vastly different estimations of how widespread use of Vector will be.

ksluder · December 11, 2024, 3:57pm

That makes Vector<4, Float> unfit for the use case SIMD4<Float> and metal::float4 exist to serve: storing values in memory that are valid operands to SIMD instructions.

JuneBash · December 11, 2024, 3:59pm

I think I had a fundamental misunderstanding of what this type is. It sounds like it's actually closer to a tuple of homogenous values than a collection, no? Given that, perhaps Vector really is a better name. I don't like Tuple as a name given the confusion it could cause, but I think it actually gets closer. Maybe HomogenousTuple, but that still leaves plenty of room for confusion. I don't know, I think this is where my participation in the bikeshedding will have to end alongside my understanding of things.

technogen · December 11, 2024, 4:18pm

I'll reiterate that we have a more general naming problem on our hands: Any type with fully run-time functionality may get a companion type with very similar functionality, but with more compile-time behavior. Array is merely the first example of this.

I think it would be very helpful to solve this problem first and then see how it impacts the naming of the proposed type.