I'm posting this in response to @scanon's initiative on *Swift Numerics*, in the hope that this can help shape it.

Over the last year, we had a need for numerical computation for one of our apps, and we decided to implement it in Swift. Swift is very well suited to math & algorithms, and also we've mostly moved to it anyway. I reviewed existing options for linear algebra Swift libraries, but ended up making a new one (named *CeedNumerics*, imports as *Numerics*) to better address our specific needs. I took inspiration in both NumPy and Swift conventions (which sometimes conflicted). Here are a number of things I went through and the design choice I made along the way.

Keep in mind that this is a work-in-progress that wasn't made to be shared so soon, it certainly lacks polish in API and implementation.

Here is the library that I just made available on github:

- CeedNumerics: https://github.com/creaceed/CeedNumerics

## Memory Model & Ownership

Various projects propose different abstractions about storage. Should matrices and vectors be expressed as classes, or structs, using copy-on-write (CoW) just like Swift arrays? Or be backed by Swift arrays? Or should they be an extension of the Array type?

Should we just overload Swift Array type to handle vectors and matrices? Well although this can be useful and practical at times, it also pollutes the already crowded Array namespace. As such, it is not very future-proof, is hard to maintain and use, and hard to read/write. Moreover, although it could make some sense to use arrays for 1D vector containers, it needs additional metadata for 2D or more generally N-D containers, even though all these do share basic behaviors (like per-element addition, etc.). I ended up not following that approach, although some form of extensions on Array could be implemented (and optionally imported, when needed) as a layer on top of the proposed API.

Then maybe a value type with CoW could be a good abstraction, just like Swift's arrays? Considering actual use cases, where tensors or matrices can represent data of 10s or 100s of MB (or more), it appeared that having control over the allocation was a requirement. I did not want that passing around variables could allocate hundreds of megabytes behind my back when one of the owners would modify it.

But would matrices/vectors/tensors be best represented as classes? This would give full control on the allocation. One thing that is not too well expressed with this approach is when we want to have matrices or vectors that reference some shared data. Identity of the data has to be separate from the vector / matrix instances, and working on a single data through a number of vector / matrix "views" makes a lot of sense. Finally, forcing dynamic allocation for the "views" themselves don't bring anything useful. Light-weight structs that are *views* on some *shared data* appears to offer the best approach.

This is how it converged to this, that matrices, vectors, and more generally dimensional types are best represented as an abstraction that is conceptually similar to Swift's MutablePointer with some additional information on how to traverse memory. Like pointers, they would refer to existing memory contents, their types & parameters would determine how data is interpreted, and multiple instances can refer to the same underlying data. This is similar to how NumPy does it, except that NumPy treat them all, vector, matrices, etc. as generic n-dimensional arrays. Swift can do a lot better here by defining concrete types for each, see next point.

You create a matrix from existing memory, or from none in which case it will allocate memory itself. Then you can work on it directly, or iterate on its rows or columns by using vector *views*, or access it through another matrix *view* for a subrange (called a slice). All these refer to the same memory contents of the initial matrix. They are just views, or sampling grid into some allocated data. You can derive a matrix that won't reference the original one by using a `copy()`

operation, which is how memory allocation gets manual control.

This shares some similarity to Swift ArraySlice & Array (with the former, actually). I considered for some time whether we needed both MatrixSlice (the view on data) and Matrix (the data allocator), but the answer was mostly no, as this would make the API more complex for little benefit in my opinion. Indeed, the risk of CoW allocating a lot of data, plus the need to duplicate the APIs on both Matrix & MatrixSlice for instance were cumbersome drawbacks.

Here's an example of what it looks like:

```
let mat = NMatrixd([[1.0,2.0,3.0],
[1.0,-2.0,3.0],
[1.0,2.0,1.0],
[-1.0,2.0,1.0]])
let smat = mat[0~4, 0~3~2].transposed()
let vec = Numerics.range(stop: 12.0)
let rowmat = vec.asMatrix()
let mat = rowmat.reshaping(rows: 3, columns: 4)
let mat2 = rowmat.reshaping(rows: 2, columns: -1)
```

## Dimensional Generics & Type Tree

Having specific types for vector, matrix, and tensor is natural for Swift. NumPy treats them all as `ndarray`

, which can be convenient at times, but that also prevents specific APIs from being written. In plain English, accessing the fifth row of a matrix is clearer than accessing a tensor of dimension 1 at offset (4,0) of a tensor of dimension 2. The latter should be possible, but cases where only matrix/vector are needed shouldn't require the full semantics of tensors. Other API like determinant, or eigenvalues/vectors only make sense for matrices and having matrix-specific APIs absolutely make sense. And it is more efficient because indices and other data don't have to be generic, they become specific (2-int tuple vs. int array). Note that casting one type to another is very easy with the foundation expressed above, as they are just views on the same data: a vector of size 5 can very easily be reinterpreted as a matrix of size (1, 5) (or (5,1)).

```
public protocol NDimensionalArray: NStorageAccessible, CustomStringConvertible {
var shape: [Int] { get } // defined in extension below
var dimension: Int { get }
var size: NativeIndex { get }
var indices: NativeIndexRange { get }
init(storage: Storage, slice: NativeResolvedSlice)
}
public struct NVector<Element: NValue> : NStorageAccessible, NDimensionalArray {
public typealias NativeResolvedSlice = NResolvedSlice
public typealias NativeIndex = NativeResolvedSlice.NativeIndex // Int
}
public struct NMatrix<Element: NValue> : NStorageAccessible, NDimensionalArray {
public typealias NativeResolvedSlice = NResolvedQuadraticSlice
public typealias NativeIndex = NativeResolvedSlice.NativeIndex // NQuadraticIndex
}
```

About the types of represented data, generics can be used to represent Float and Double. But other types also make sense, like Int, Bool and possible compound types like Complex, and RGBA. Extensions can be provided depending on data type capabilities. Sin/cos make sense with floating point types, but not with Int or Bool. Addition make sense with Float, Double, Int, but not Bool. Type segmentation through protocol conformance should allow Swift to provide the right APIs on just the right types with no compromise on performance.

Moreover, vector, matrix and more generally tensors do share common traits. A lot of them. Like adding a constant value to a vector is conceptually the same as adding a constant to every element of a matrix or tensor. Or adding two vectors together (of matching sizes) applies to matrices and tensors as well. They should actually share the implementation as well (code should not be duplicated). Next point even shows how, with appropriate conventions, an accelerated implementation (that is, using Accelerate) can be realized.

Of the specific APIs these type should have, I found that subscripts were needed. A matrix exposes its data through a subscript with 2 integers, which are best for us humans. But also through what I defined as `NativeIndex`

, efficient and known to the protocol (a tuple or a struct), and also a third one using an array of integers `[Int]`

, which is common to all dimensional types, just like NumPy's, but is less efficient.

I ended up having the `NDimensionalArray`

protocol that all dimensional types (vector, matrix, tensor) would conform to. This protocol efficiently defines the common traits of those types as extensions (with implementations), like addition and similar operations which are conceptually the same across types.

## Linear Acceleration Model

One of the key aspects of a library for numerical computation is that it can fully use the hardware it's running on. Although it shouldn't be limited to Apple devices, I took a look at what I know best first, the Accelerate framework. Specific work can be done for other platforms, of course.

I defined an abstraction layer so that Float and Double had the same accelerated functions available to generic implementations. This makes it possible to have a single implementation for the high-level API (thus avoiding duplicating implementation code).

```
// The NAccelerateFloatingPoint protocol defines a number of SIMD-accelerated API that are implemented both on Double and Float (using Accelerate).
// This makes it easier to use those features
public protocol NAccelerateFloatingPoint: NValue, NFloatingPoint {
static func mx_vsub(_ A: PointerType, _ IA: vDSP_Stride, _ B: PointerType, _ IB: vDSP_Stride, _ C: MutablePointerType, _ IC: vDSP_Stride, _ N: vDSP_Length)
}
extension Double: NAccelerateFloatingPoint {
public static func mx_vsub(_ A: PointerType, _ IA: vDSP_Stride, _ B: PointerType, _ IB: vDSP_Stride, _ C: MutablePointerType, _ IC: vDSP_Stride, _ N: vDSP_Length) {
_performanceCheckStride(IA, IB, IC)
vDSP_vsubD(B, IB, A, IA, C, IC, N)
}
}
extension Double: NAccelerateFloatingPoint {
public static func mx_vsub(_ A: PointerType, _ IA: vDSP_Stride, _ B: PointerType, _ IB: vDSP_Stride, _ C: MutablePointerType, _ IC: vDSP_Stride, _ N: vDSP_Length) {
_performanceCheckStride(IA, IB, IC)
vDSP_vsub(B, IB, A, IA, C, IC, N)
}
}
```

Then, as mentioned above, many operations are conceptually the same for vectors, matrices or other dimensional types. Like adding some constant to every element. Studying more closely the Accelerate API, I noticed that most functions could be invoked on memory-based value arrays with a stride parameter (offset between successive values). My understanding was that SIMD would possibly not be available with stride larger than one unit, yet the Accelerate implementation in that case could still possibly be faster than a naive one. I decided to use this "element array with a constant stride" as the foundation for acceleration.

Noticing that tensors of any dimensions could be expressed as a collection of "linear, strided segments" that meet the requirement above, I modeled the acceleration API with that.

```
// This is an example of how the acceleration API is invoked to implement a particular feature
// with Accelerate. It will traverse a, b, and result as "linear segments". If possible, dimensional
// coalescing will be applied to have fewer segments and maximize data parallelization /
// minimize loop overhead.
extension Numerics where Element: NAccelerateFloatingPoint {
public static func subtract<DT: NDimensionalArray>(_ a: DT, _ b: DT, _ result: DT) where DT.Element == Element {
precondition(a.size == b.size && a.size == result.size)
withLinearizedAccesses(a, b, result) { aacc, bacc, racc in
Element.mx_vsub(aacc.base, aacc.stride, bacc.base, bacc.stride, racc.base, racc.stride, numericCast(racc.count))
}
}
}
```

Adding a constant to a matrix (2D) would then be realized with accelerated row by row processing (1D). This generalizes easily to higher dimensions. It also appeared that when the stride is the same across dimensions (stride between elements in a row, and between the last element of a row and the first element of the next row), the accelerated API could be invoked in one shot on the entire data. This is the case in particular for compact data. (stride is always one). I defined the `coalesceable`

property accordingly.

## Masking and Indexing

Similarly defined as in NumPy, masking is getting a bool-based type of the same shape as a given dimensional type variable that stores per element test result. Mask can also be used for subscripting (did not yet implement all cases) or as argument to setter for instance. Indexing is the process of accessing elements through a collection of indices, I only experimented a bit with that.

## Slicing Notation

Slicing notation (start:stop:step) in NumPy and a number of other numerical packages is very practical and I tried to reproduce it in Swift. A slice can be specified partly (ie, start and step without an end), in which case it relies on the array to resolve its sampling points. I introduced the `NSliceExpression`

protocol, the `NSlice`

and `NResolvedSlice`

structs as types to represent a 1D slice, and made Swift ranges implement the `NSliceExpression`

protocol as they are a subset of the capability of a slice.

A slice expression is used to access dimensional types through subscripting, in which case it is resolved against the receiver to provide the requested contents, typically as the same type. Note that the NResolvedSlice conveniently also serves as the basis of the sampling grid on the contents memory, and subscripting with a slice expression is just a matter of resolving it against the receiver's resolved slice (aka. composition of resolved slices).

About the slice syntax, I first tried to replicate the colon notation because it is familiar in math packages. Swift wouldn't let that happen, as it already has too much semantics associated to it. So I looked for a single character that is easily available and not too much charged with meaning, and finally went with a tilde (`start~stop~step`

). A number of combinations have been implemented, like `~, 0~10, 3~~2, ~~2, 3~~`

, etc. I found it surprisingly hard to pick one with those constraints, and a colon still would be better in my opinion, but that would require a change in the language (is that even possible?).

```
// Examples
let mat = NMatrixd(rows: 3, columns: 4)
let newRow = NVectord(size: 4)
mat[row: 1] = newRow
mat[0~4~2, 0~4~2].set(1.5)
mat[0~~2, 0~~2].set(1.5)
mat[~~2, 1~~].set(3.0)
let tsubmat = mat[1...2, 1...2].transposed()
mat[~, 0~~]
mat[~, 0~~] = 2.0*mat
let i=2
mat[~i, (i-1)~].set(7.0)
```

## API Design

For functions operating on various dimensional types, I chose to have them under a common namespace (Numerics) for 3 reasons:

- to easily find them, not having to determine if they are part of a specific type
- to express them symmetrically, as I don't like having a variable pretend to have a specific role, ie I find
`add(a, b)`

superior to`a.add(b)`

when the result is a newly allocated instance. - to centralize them: some method require 0 argument, or 1 argument, or multiple arguments of different types. This allows to not locate APIs based on their argument types or argument order.

About function that need to store output results, the *reference* (designated) function would write output to a preallocated result instance (possibly the same as an input, if in-place is supported - could be a precondition), and *secondary* functions that allocate output are implemented from the reference one.

For operators (like +, -, /, .+), I went with static funcs under each types (although I don't remember exactly why, as this could probably be moved to the NDimensionalArray level if possible).

## Bridges with other CPU/GPU APIs

I'm a great fan of layered APIs, that is, implementing features as modular / separate libraries that can be linked depending on needs (and available hardware or resources). For instance, reading and writing matrix contents from/to a Metal texture can be achieved when the `Numerics+Metal`

library is imported, but would otherwise not be provided as part of the general Numerics API.

It scales well with the numerous CPU or GPU APIs out there like Metal, GL, Vulkan, OpenCV, etc. for which bridges can be implemented. I started thinking about some general concepts and semantics that could serve as a basis for that, but it is probably too early to share.

## Unaddressed Topics

- Fixed-size types that can be allocated on the stack. Generics with integer parameters, like C++ templates allow (
`Vector<16>`

). I believe these should be addressed, but possibly separately from the arbitrarily sized types described here. - More fixed dimension tensors: I only implemented vector, matrix, and generic tensors (any dimensions). But common tensors with specific dimensions, like 5D BSHWC (Batch, Sequence, Height, Width, Channel) could easily be implemented with higher performance (than generic tensors) subscripting / indexing, again with much of the computation code already available in the NDimensionalArray extensions.
- Richer element types: RGBA, Complex, etc.
- "Row Bytes" that are not an integral multiple of element size (element is 16 bytes like
`Complex<Double>`

, but row is a multiple of 8 bytes). - Permutation: I very recently noticed in newer Metal APIs that dimension permutation was expressed at the container level, which allowed lazy evaluation (transpose is only lazily evaluated, when actual data is needed). This is an interesting route to investigate.