Vector, a fixed-size array

scanon · October 17, 2024, 3:01pm

Sorry, that was the result of a sloppy edit while writing in the browser; it should say "dot(a, a) == 0 with a non-zero" (Consider a = [1, i]).

(Originally I had something like "dot(a, b) == 0 with a and b non-orthogonal" but it's hard to make that objection formal because people commonly define "orthogonal" as "the dot product is zero")

GreatApe · October 17, 2024, 6:41pm

Yeah afterwards it hit me that you probably meant exactly the ||(1,i)|| case

DevAndArtist · October 18, 2024, 1:05am

While it might never happen but it would be amazing if through some compiler magic the type was simply an evolution of Array itself.

- Array<T> or Array<T, 0…> dynamic array
- Array<T, 1…> non empty array
- Array<T, 4> or Array<T, 4 … 4> fixed size array
- Array<T, 0 ..< N> an array with min and max amount of values

Basically the secondary parameter becomes a range that covers all cases.

RandomHashTags · October 18, 2024, 7:23am

+1. If it were up to me this is the way I'd implement it, at the compiler level.

intswift · October 18, 2024, 10:29am

I really like this idea. It makes me think, though, whether instead of giving a size of the array it could be expressed as a range of indices. So:

Array<T, 0..<5> // Array with 5 elements from [0] to [4].
Array<T, 7...10> // Array with 4 elements from [7] to [10].

which then brings us onto the question of slices. If a fixed array were expressed in terms of its indices, then would we need a separate fixed-array slice type, or could multiple fixed arrays with differing but overlapping indices not (as an option) share a single underlying array?

As for the bike-shedding of names, "RigidArray" as mentioned earlier by Diggory gets +1 from me. I think it gives a good mental impression of what it is, and the term "rigid" could be a useful term of art to extend to other types too (e.g. "RigidSet", "RigidDictionary").

ibex10 · October 18, 2024, 12:33pm

That will become harder to read in real life.

Compare this:

// 1 D
Array<T, 0..<5>

// 2 D
Array<Array<T, 0..<5>, 0..<5>

// 3 D
Array<Array<Array<T, 0..<5>, 0..<5>, 0..<5>

and this:

// 1 D
Array<5, T>

// 2 D
Array<5, Array<5, T>>

// 3 D
Array<5, Array<5, Array<5, T>>>

Which one is easier to grasp?

duan · October 18, 2024, 6:27pm

Since I didn't find this on my reading: as a property of a class on the heap, would Vector incur reference counting?

Karl · October 18, 2024, 6:33pm

Vector itself would not, since its elements are stored in-line. The class may incur reference counting.

Vector<5, T> is like if you wrote 5 Ts inline as stored properties of the class.

duan · October 18, 2024, 6:36pm

Since this came up in the thread, imagining Vector as the backing storage for tensors, a few more questions:

Is there any way, other than making Vector a property of a class, to force Vector to exist on the heap? Heap is usually the better place to load in 1s-100s gigabytes of weights.
It's common to use mmap to load large weights from disk, and use pointer arithmetic to assign the interesting stored locations in the weights to the buffer. This can all be done with the UnsafePointer & friends pointing at any sort of contagious memory today. Seems like it's not possible to use Vector to load weights from disk this way?

Karl · October 18, 2024, 6:41pm

It sounds like you probably want a Span view over mapped memory, or even just a standard Array, rather than a Vector.

For very large vectors, you’ll have to be careful to manage copying.

For a matrix type you could still use generic integer parameters to enforce shape compatibility at compile-time, even if they use one of these other storage types.

Alejandro · October 18, 2024, 6:44pm

Yeah, just manually allocate it on the heap.

let ptr = UnsafeMutablePointer<Vector<2, Int>>.allocate(capacity: 1)

ptr.initialize(to: Vector(repeating: 0))

This is also equivalent to just allocating a pointer of Int where the capacity is 2.

It would be nice to have some Box/Unique type that does this unsafety for you and exposes a safer interface.

Well mmap returns a pointer to raw memory anyway. Your best bet would be to convert it to a Span which provides the safe interface of interacting with this memory. If you need to mutate this memory, then a hypothetical MutableSpan would be your go to. Vector is an instance of contiguous Elements, not a view into them.

John_McCall · October 18, 2024, 6:58pm

Fixed-size arrays are really far, far less generally useful than I think some people here are imagining. They're an important tool for certain specific kinds of low-level programming, and they'll give us a much more natural solution for importing array fields from C, and that's about it. If you have even a few thousand elements in an array — and it could be much fewer depending on the element — the heap allocation costs of Array are going to be dominated by the cost of initializing the elements.

If you're concerned about heap-management costs of Array, like retains and releases, you need to understand that those are triggered by copying or destroying the array value. So the retains and releases will go away with a fixed-size array, but they'll be replaced by doing separate copies and destructions of all the array elements, which is almost strictly more expensive than a single retain or release. Avoiding array copies is important sometimes, but the critical tool we're working on for that is Span, not fixed-size arrays.

Slava_Pestov · October 18, 2024, 8:01pm

If we also had a reference counted Box for copyable payloads, we could re-imagine indirect enum cases as just syntax sugar for the appropriate kind of wrapper, depending on whether the payload is copyable or not.

duan · October 18, 2024, 8:07pm

To put some of these previous conversations about matrices, dot products etc in perspective: Vector as backing storage for these things are fine as supporting evidences for the naming/shape discussions. In real world, fixed-sized array serves, at best, as a partial solution to this problems, in the sense that it can be part of something like llvm:SmallVector for a CPU backend of a tensor library.
(I'm not interested in the naming discussion at the slightest, but I suppose this amounts to pouring some cold water on all the matrix talk )

Overall I'm very excited to no longer have to deal with C buffers imported as tuples!

Alejandro · October 18, 2024, 10:33pm

@lorentey has been prototyping a Shared variant of this that does just that. Then we'd have std::shared_ptr and std::unique_ptr

Karl · October 18, 2024, 10:38pm

Just to add to this - you can already avoid array copies by using borrow parameters to suppress implicit copying.

Unfortunately, lots of operations require copies when they really shouldn't (like for loops):

// Why do I have to copy an Array to iterate over its contents?

func foo(_ input: borrowing [Float]) {
           `- error: 'input' is borrowed and cannot be consumed

  for _ in input {
           `- note: consumed here
  }
}

For a Vector, this implicit copy (made visible by the explicit borrowing modifier) would create an entirely new buffer and initialise it from the original Vector. So if the vector is very large, the very act of writing a for loop could have a significant implicit performance cost, regardless of what you do in the loop. This is an artefact of the Collection design that will hopefully be remedied by a new protocol. (Actually, this is also a good argument against having Vector conform to Sequence/Collection at all)

The ability to have borrow var variables which we can store and write abstractions around (using lifetime dependencies) will also help avoid copies.

Vector and Array can both benefit from these improvements.

EDIT: To phrase the point on copying another way - currently for loops use consuming iteration, via Sequence's consuming func makeIterator() method.

That works for single-pass sequences, and until now we've been using it for multi-pass collections because we can just copy them and consume the copy. For copy-on-write collections, this is cheap (just a retain/release).

We already know this is insufficient for non-copyable collections, and so we will need to introduce borrowing iteration for them (all collections can already do this via their indexes). The part that is interesting with regards to Vector is that even though it is copyable, because its elements are stored inline the implicit copies necessary for consuming iteration may be surprisingly expensive.

It means every for loop involving a vector will have linear complexity with respect to the length of the vector from that initial copy, even if they break out of the loop body early.

Jumhyn · October 19, 2024, 6:14pm

The Language Steering Group discussed this pitch briefly and wanted to provide some guidance for the pitch discussion. This has been a very active thread with a lot of good feedback provided. Much of the back and forth has centered on naming, and, while the name that we ultimately pick for this type is important, we do want to make sure that traffic about the name does not overwhelm other topics. The name can be decided on largely independently of the fundamental functionality this type provides.

Regarding the naming discussion, the LSG strongly encourages community members to strive to make their commentary amount to more than just a +1 or a -1 for particular names (as many are doing already). Most important are the reasons for selecting a particular name rather than the number of people speaking for or against any of the options. Additionally, for those who argue that Vector is not the appropriate name, it remains very helpful to suggest or argue for concrete alternatives rather than simply speak against Vector.

saagarjha · October 20, 2024, 3:17am

The specific concern I have is that you can iterate over one of these in a for loop and easily run into unitialized elements. Which is the case for, say, UnsafeBufferPointer but that is pretty clearly annotated as such.

lorentey · October 21, 2024, 10:49pm

I have the same objection as before: we are already using the term "homogeneous tuple" to mean, well, a homogeneous tuple. Vector is like that, but it is distinct from it.

This is in addition to the (to me, quite obvious) problem that HomogeneousTuple is not a name -- it's a dictionary definition masquerading as a name. Vector is a core construct, along the likes of Array, Bool, String, Slice or Set; it deserves to have a name that matches its importance.

Can you provide a concrete example of this kind of unsafety to illustrate this concern?

(In general, I personally expect C interop to be inherently unsafe, especially any code that touches C pointer types. To me, it seems self evident that Swift code that wants restricts itself to safe interfaces should not be allowed to directly call C functions. Importing C's vector types as Vector instead of homogeneous tuples does not seem to move the needle on this.)

phoneyDev · October 21, 2024, 11:34pm

I would also like to see this become a fixed size array:
let fixedArray = [1, 2, 3, 4]
with whatever performance benefits can accrue to a stack-allocated or ivar Array.

For me this is a common idiom where a constant array contains the single source of truth for something. In fact when I read fixed-size array this is what I was expecting to see.