[ANN] CeedNumerics released (SN's ShapedArray API discussion)

rsebbe · December 2, 2019, 6:34pm

I'm posting this in response to @scanon's initiative on Swift Numerics, in the hope that this can help shape it.

Over the last year, we had a need for numerical computation for one of our apps, and we decided to implement it in Swift. Swift is very well suited to math & algorithms, and also we've mostly moved to it anyway. I reviewed existing options for linear algebra Swift libraries, but ended up making a new one (named CeedNumerics, imports as Numerics) to better address our specific needs. I took inspiration in both NumPy and Swift conventions (which sometimes conflicted). Here are a number of things I went through and the design choice I made along the way.

Keep in mind that this is a work-in-progress that wasn't made to be shared so soon, it certainly lacks polish in API and implementation.

Here is the library that I just made available on github:

CeedNumerics: https://github.com/creaceed/CeedNumerics

Memory Model & Ownership

Various projects propose different abstractions about storage. Should matrices and vectors be expressed as classes, or structs, using copy-on-write (CoW) just like Swift arrays? Or be backed by Swift arrays? Or should they be an extension of the Array type?

Should we just overload Swift Array type to handle vectors and matrices? Well although this can be useful and practical at times, it also pollutes the already crowded Array namespace. As such, it is not very future-proof, is hard to maintain and use, and hard to read/write. Moreover, although it could make some sense to use arrays for 1D vector containers, it needs additional metadata for 2D or more generally N-D containers, even though all these do share basic behaviors (like per-element addition, etc.). I ended up not following that approach, although some form of extensions on Array could be implemented (and optionally imported, when needed) as a layer on top of the proposed API.

Then maybe a value type with CoW could be a good abstraction, just like Swift's arrays? Considering actual use cases, where tensors or matrices can represent data of 10s or 100s of MB (or more), it appeared that having control over the allocation was a requirement. I did not want that passing around variables could allocate hundreds of megabytes behind my back when one of the owners would modify it.

But would matrices/vectors/tensors be best represented as classes? This would give full control on the allocation. One thing that is not too well expressed with this approach is when we want to have matrices or vectors that reference some shared data. Identity of the data has to be separate from the vector / matrix instances, and working on a single data through a number of vector / matrix "views" makes a lot of sense. Finally, forcing dynamic allocation for the "views" themselves don't bring anything useful. Light-weight structs that are views on some shared data appears to offer the best approach.

This is how it converged to this, that matrices, vectors, and more generally dimensional types are best represented as an abstraction that is conceptually similar to Swift's MutablePointer with some additional information on how to traverse memory. Like pointers, they would refer to existing memory contents, their types & parameters would determine how data is interpreted, and multiple instances can refer to the same underlying data. This is similar to how NumPy does it, except that NumPy treat them all, vector, matrices, etc. as generic n-dimensional arrays. Swift can do a lot better here by defining concrete types for each, see next point.

You create a matrix from existing memory, or from none in which case it will allocate memory itself. Then you can work on it directly, or iterate on its rows or columns by using vector views, or access it through another matrix view for a subrange (called a slice). All these refer to the same memory contents of the initial matrix. They are just views, or sampling grid into some allocated data. You can derive a matrix that won't reference the original one by using a copy() operation, which is how memory allocation gets manual control.

This shares some similarity to Swift ArraySlice & Array (with the former, actually). I considered for some time whether we needed both MatrixSlice (the view on data) and Matrix (the data allocator), but the answer was mostly no, as this would make the API more complex for little benefit in my opinion. Indeed, the risk of CoW allocating a lot of data, plus the need to duplicate the APIs on both Matrix & MatrixSlice for instance were cumbersome drawbacks.

Here's an example of what it looks like:

let mat = NMatrixd([[1.0,2.0,3.0],
                   [1.0,-2.0,3.0],
                   [1.0,2.0,1.0],
                   [-1.0,2.0,1.0]])

let smat = mat[0~4, 0~3~2].transposed()
let vec = Numerics.range(stop: 12.0)
let rowmat = vec.asMatrix()
let mat = rowmat.reshaping(rows: 3, columns: 4)
let mat2 = rowmat.reshaping(rows: 2, columns: -1)

Dimensional Generics & Type Tree

Having specific types for vector, matrix, and tensor is natural for Swift. NumPy treats them all as ndarray, which can be convenient at times, but that also prevents specific APIs from being written. In plain English, accessing the fifth row of a matrix is clearer than accessing a tensor of dimension 1 at offset (4,0) of a tensor of dimension 2. The latter should be possible, but cases where only matrix/vector are needed shouldn't require the full semantics of tensors. Other API like determinant, or eigenvalues/vectors only make sense for matrices and having matrix-specific APIs absolutely make sense. And it is more efficient because indices and other data don't have to be generic, they become specific (2-int tuple vs. int array). Note that casting one type to another is very easy with the foundation expressed above, as they are just views on the same data: a vector of size 5 can very easily be reinterpreted as a matrix of size (1, 5) (or (5,1)).

public protocol NDimensionalArray: NStorageAccessible, CustomStringConvertible {
    var shape: [Int] { get } // defined in extension below
    var dimension: Int { get }
    var size: NativeIndex { get }
    var indices: NativeIndexRange { get }
    init(storage: Storage, slice: NativeResolvedSlice)
}

public struct NVector<Element: NValue> : NStorageAccessible, NDimensionalArray {
    public typealias NativeResolvedSlice = NResolvedSlice
    public typealias NativeIndex = NativeResolvedSlice.NativeIndex // Int
}

public struct NMatrix<Element: NValue> : NStorageAccessible, NDimensionalArray {
    public typealias NativeResolvedSlice = NResolvedQuadraticSlice
    public typealias NativeIndex = NativeResolvedSlice.NativeIndex // NQuadraticIndex
}

About the types of represented data, generics can be used to represent Float and Double. But other types also make sense, like Int, Bool and possible compound types like Complex, and RGBA. Extensions can be provided depending on data type capabilities. Sin/cos make sense with floating point types, but not with Int or Bool. Addition make sense with Float, Double, Int, but not Bool. Type segmentation through protocol conformance should allow Swift to provide the right APIs on just the right types with no compromise on performance.

Moreover, vector, matrix and more generally tensors do share common traits. A lot of them. Like adding a constant value to a vector is conceptually the same as adding a constant to every element of a matrix or tensor. Or adding two vectors together (of matching sizes) applies to matrices and tensors as well. They should actually share the implementation as well (code should not be duplicated). Next point even shows how, with appropriate conventions, an accelerated implementation (that is, using Accelerate) can be realized.

Of the specific APIs these type should have, I found that subscripts were needed. A matrix exposes its data through a subscript with 2 integers, which are best for us humans. But also through what I defined as NativeIndex, efficient and known to the protocol (a tuple or a struct), and also a third one using an array of integers [Int], which is common to all dimensional types, just like NumPy's, but is less efficient.

I ended up having the NDimensionalArray protocol that all dimensional types (vector, matrix, tensor) would conform to. This protocol efficiently defines the common traits of those types as extensions (with implementations), like addition and similar operations which are conceptually the same across types.

Linear Acceleration Model

One of the key aspects of a library for numerical computation is that it can fully use the hardware it's running on. Although it shouldn't be limited to Apple devices, I took a look at what I know best first, the Accelerate framework. Specific work can be done for other platforms, of course.

I defined an abstraction layer so that Float and Double had the same accelerated functions available to generic implementations. This makes it possible to have a single implementation for the high-level API (thus avoiding duplicating implementation code).

// The NAccelerateFloatingPoint protocol defines a number of SIMD-accelerated API that are implemented both on Double and Float (using Accelerate).
// This makes it easier to use those features
public protocol NAccelerateFloatingPoint: NValue, NFloatingPoint {
    static func mx_vsub(_ A: PointerType, _ IA: vDSP_Stride, _ B: PointerType, _ IB: vDSP_Stride, _ C: MutablePointerType, _ IC: vDSP_Stride, _ N: vDSP_Length)
}
extension Double: NAccelerateFloatingPoint {
    public static func mx_vsub(_ A: PointerType, _ IA: vDSP_Stride, _ B: PointerType, _ IB: vDSP_Stride, _ C: MutablePointerType, _ IC: vDSP_Stride, _ N: vDSP_Length) {
        _performanceCheckStride(IA, IB, IC)
        vDSP_vsubD(B, IB, A, IA, C, IC, N)
    }
}
extension Double: NAccelerateFloatingPoint {
    public static func mx_vsub(_ A: PointerType, _ IA: vDSP_Stride, _ B: PointerType, _ IB: vDSP_Stride, _ C: MutablePointerType, _ IC: vDSP_Stride, _ N: vDSP_Length) {
        _performanceCheckStride(IA, IB, IC)
        vDSP_vsub(B, IB, A, IA, C, IC, N)
    }
}

Then, as mentioned above, many operations are conceptually the same for vectors, matrices or other dimensional types. Like adding some constant to every element. Studying more closely the Accelerate API, I noticed that most functions could be invoked on memory-based value arrays with a stride parameter (offset between successive values). My understanding was that SIMD would possibly not be available with stride larger than one unit, yet the Accelerate implementation in that case could still possibly be faster than a naive one. I decided to use this "element array with a constant stride" as the foundation for acceleration.

Noticing that tensors of any dimensions could be expressed as a collection of "linear, strided segments" that meet the requirement above, I modeled the acceleration API with that.

// This is an example of how the acceleration API is invoked to implement a particular feature 
// with Accelerate. It will traverse a, b, and result as "linear segments". If possible, dimensional
// coalescing will be applied to have fewer segments and maximize data parallelization / 
// minimize loop overhead.
extension Numerics where Element: NAccelerateFloatingPoint {
    public static func subtract<DT: NDimensionalArray>(_ a: DT, _ b: DT, _ result: DT) where DT.Element == Element {
        precondition(a.size == b.size && a.size == result.size)
        withLinearizedAccesses(a, b, result) { aacc, bacc, racc in
            Element.mx_vsub(aacc.base, aacc.stride, bacc.base, bacc.stride, racc.base, racc.stride, numericCast(racc.count))
        }
    }
}

Adding a constant to a matrix (2D) would then be realized with accelerated row by row processing (1D). This generalizes easily to higher dimensions. It also appeared that when the stride is the same across dimensions (stride between elements in a row, and between the last element of a row and the first element of the next row), the accelerated API could be invoked in one shot on the entire data. This is the case in particular for compact data. (stride is always one). I defined the coalesceable property accordingly.

Masking and Indexing

Similarly defined as in NumPy, masking is getting a bool-based type of the same shape as a given dimensional type variable that stores per element test result. Mask can also be used for subscripting (did not yet implement all cases) or as argument to setter for instance. Indexing is the process of accessing elements through a collection of indices, I only experimented a bit with that.

Slicing Notation

Slicing notation (start:stop:step) in NumPy and a number of other numerical packages is very practical and I tried to reproduce it in Swift. A slice can be specified partly (ie, start and step without an end), in which case it relies on the array to resolve its sampling points. I introduced the NSliceExpression protocol, the NSlice and NResolvedSlice structs as types to represent a 1D slice, and made Swift ranges implement the NSliceExpression protocol as they are a subset of the capability of a slice.

A slice expression is used to access dimensional types through subscripting, in which case it is resolved against the receiver to provide the requested contents, typically as the same type. Note that the NResolvedSlice conveniently also serves as the basis of the sampling grid on the contents memory, and subscripting with a slice expression is just a matter of resolving it against the receiver's resolved slice (aka. composition of resolved slices).

About the slice syntax, I first tried to replicate the colon notation because it is familiar in math packages. Swift wouldn't let that happen, as it already has too much semantics associated to it. So I looked for a single character that is easily available and not too much charged with meaning, and finally went with a tilde (start~stop~step). A number of combinations have been implemented, like ~, 0~10, 3~~2, ~~2, 3~~, etc. I found it surprisingly hard to pick one with those constraints, and a colon still would be better in my opinion, but that would require a change in the language (is that even possible?).

// Examples
let mat = NMatrixd(rows: 3, columns: 4)
let newRow = NVectord(size: 4)

mat[row: 1] = newRow
mat[0~4~2, 0~4~2].set(1.5)
mat[0~~2, 0~~2].set(1.5)
mat[~~2, 1~~].set(3.0)

let tsubmat = mat[1...2, 1...2].transposed()
mat[~, 0~~]
mat[~, 0~~] = 2.0*mat
let i=2
mat[~i, (i-1)~].set(7.0)

API Design

For functions operating on various dimensional types, I chose to have them under a common namespace (Numerics) for 3 reasons:

to easily find them, not having to determine if they are part of a specific type
to express them symmetrically, as I don't like having a variable pretend to have a specific role, ie I find add(a, b) superior to a.add(b) when the result is a newly allocated instance.
to centralize them: some method require 0 argument, or 1 argument, or multiple arguments of different types. This allows to not locate APIs based on their argument types or argument order.

About function that need to store output results, the reference (designated) function would write output to a preallocated result instance (possibly the same as an input, if in-place is supported - could be a precondition), and secondary functions that allocate output are implemented from the reference one.

For operators (like +, -, /, .+), I went with static funcs under each types (although I don't remember exactly why, as this could probably be moved to the NDimensionalArray level if possible).

Bridges with other CPU/GPU APIs

I'm a great fan of layered APIs, that is, implementing features as modular / separate libraries that can be linked depending on needs (and available hardware or resources). For instance, reading and writing matrix contents from/to a Metal texture can be achieved when the Numerics+Metal library is imported, but would otherwise not be provided as part of the general Numerics API.

It scales well with the numerous CPU or GPU APIs out there like Metal, GL, Vulkan, OpenCV, etc. for which bridges can be implemented. I started thinking about some general concepts and semantics that could serve as a basis for that, but it is probably too early to share.

Unaddressed Topics

Fixed-size types that can be allocated on the stack. Generics with integer parameters, like C++ templates allow (Vector<16>). I believe these should be addressed, but possibly separately from the arbitrarily sized types described here.
More fixed dimension tensors: I only implemented vector, matrix, and generic tensors (any dimensions). But common tensors with specific dimensions, like 5D BSHWC (Batch, Sequence, Height, Width, Channel) could easily be implemented with higher performance (than generic tensors) subscripting / indexing, again with much of the computation code already available in the NDimensionalArray extensions.
Richer element types: RGBA, Complex, etc.
"Row Bytes" that are not an integral multiple of element size (element is 16 bytes like Complex<Double>, but row is a multiple of 8 bytes).
Permutation: I very recently noticed in newer Metal APIs that dimension permutation was expressed at the container level, which allowed lazy evaluation (transpose is only lazily evaluated, when actual data is needed). This is an interesting route to investigate.

scanon · December 2, 2019, 7:02pm

This is really exciting, @rsebbe! Thanks very much for sharing.

taylorswift · December 2, 2019, 8:14pm

great work! i think this problem can be made a lot more tractable if we made a strong distinction between the compute-oriented data structures like size-N matrices, and currency types like Vector4<T>, which don’t have to worry about concepts like slicing and reference/CoW semantics, but do need to worry about issues like conciseness of syntax. for the sake of the type checker, i think it’s fine if adding a constant to a 4096-element matrix is spelled matrix.add(constant: d) and not matrix + d.

with respect to indices, i’ve found tuple-based representations and special index types to be a huge pain and generally not worth it in terms of the amount of packing and unpacking you need to do to use them. from my own experience, Vector2/3/4<Int> is by far the most pleasant representation, and subscripts and interfaces should be based on it.

Concepts like RGBA should really be left up to the domain of imaging libraries which have more of a stake in this concept, and i would be wary of a one-size-fits-all approach. for example, Swift.PNG defines a unified 3–4-element pixel type RGBA<T> and a two-element pixel type VA<T> for its own purposes. in other domains, the interleaved color types are less useful than MCU types which can accommodate different sampling factors (for example, more than one green sample for every red and blue component).

For type segmentation, I have been refining my own Vectors implementation here for a few years now, and it might help to see what worked for me.

rsebbe · December 2, 2019, 9:35pm

being able to write math just as math is a strong selling point for Swift for me, even a core feature. But that could be a separate layer if that makes sense for some reason.

yeah I sometimes felt limited by tuple as well. I started with that, then moved to a struct. VectorN<Int> could make sense for the same reason as we use Int everywhere and not custom types. Hope it generalizes well to Vector5<Int> and above, as 5D tensors are useful.

I'm OK with a 4D-value (or ND-value) instead of RGBA. Note that such cases though can also be represented as a tensor with an additional dimension. I think it makes sense to have both, but we should probably discuss that too.

will read it.

Lantua · December 2, 2019, 10:08pm

I think CoW + forced copying would also be an interesting design.
Since in most cases it won't reduce the number of copies either way (if you gotta mutate, you gotta mutate), the only thing needed would be to copy ahead of time to avoid it in the more critical period.

Using ExpressibleByArrayLiteral could also be useful for NMatrix & NVector (Unless I missed it).

Also, I think Swift kinda moved away from Prefixing classes/structs (it's moved to module instead, heh).

Edit:
Oh, you use value type with class-backed storage. isKnownUniquelyReferenced could be useful.

Karl · December 2, 2019, 10:28pm

I would actually go the other way - data scientists do this kind of thing in Python quite a lot (e.g. normalising values by dividing the whole dataset by a constant). It's a bit jarring at first, but having seen a lot of it recently, I'm warming to it as a nice, concise syntax.

I'd actually like to see those operators in the standard library for all MutableCollection (whose elements support the given operators).

rsebbe · December 3, 2019, 8:18am

CoW is not a good idea in my opinion as you lose control on allocations that can be very large (>100s of MB). It is not just when that allocation occurs, but whether it occurs or not. This is different with strings and more standard arrays, which typically have <1MB of data. I tend to see tensors more like a malloc'ed blocks, and just as CoW isn't used for that (for a good reason), I think it shouldn't be use here. If you can describe what you have in mind, we can have a look at how that plays with some large data (data example: 12-megapixel RGBA image in floating point, ~200MB).

Sure. For matrices though, we'd probably need a way to express its shape (or it's just Mx1 or 1xM).

Sure, this is not meant to serve as reference. Yet I found it practical to remove ambiguities on names that could have a broader meaning than just this package (like Storage, Value, etc.).

You mean to implement CoW?

rsebbe · December 3, 2019, 8:27am

Definitely. Clarity is the goal here. (that does not mean that a function-based classical syntax can't exist)

Not sure that would scale well (crowded namespace, hard to optimize, no handling of higher dimensions). I think specialized types are preferable.

lukasa · December 3, 2019, 11:06am

One of either you or @scanon needs to decide to change your module name, or it will be very risky for library developers to depend on either of your libraries. Module names are (at least today) a global namespace, and if a dependency tree ends up depending on both of your packages then import Numerics can become a source of nasty compile errors.

I disagree. Whether a CoW occurs is strictly dependent on the usage pattern of your program, with some wrinkles around the optimiser's ability to see your intent. More importantly, if a program would CoW with your data type then instead what it will do is mutate global state: this can have surprising and tricky-to-debug issues when people don't expect to see it.

I would recommend that any object that doesn't have identity (as most matrices do not) should be a CoW data type.

Sure it is: SwiftNIO uses CoW for all data sent to and from the network, which frequently involves very large buffers. We have not seen this as a problem but instead as a massive asset. Similarly, Data is a CoW data type, and it frequently represents an abstraction of entire files.

Karl · December 3, 2019, 11:21am

The operator namespace isn't crowded at all as far as Collections go.
I'm not sure it's hard to optimise; it is literally the very definition of SIMD.
Higher dimensions can certainly be handled by wrapper types.

It's a minor point; it was just a coincidence that I was thinking about how nice it would be to have in Swift at around the same you posted this.

Lantua · December 3, 2019, 12:56pm

Could you provide some example where allocation can be avoided altogether? Because my mindset is still stuck on if you gotta mutate, you gotta mutate.

The only time you can avoid allocation, is if you are the only one holding the storage, which can be detected via isKnownUniquelyReferenced (yes, it’s for CoW)

This is why I suggest that (just to provide more context):

but the type is still in module namespace, so you can disambiguate them via Module.Type And you can also control which module to import in each file. So if you’re in the same module you’d never have a problem, unless the name is common and irrelevant to your module, in which case it could be internal, or underscored. And also why you can generally get away with
UniqueModuleName.NotSoUniqueTypeName.

Lantua · December 3, 2019, 1:01pm

Or use Array of Array syntax (and crash if each row/column doesn’t match).

rsebbe · December 3, 2019, 1:16pm

That's not a problem to change it. Actually, we don't even use it, as we have another package internally that reference the same source files.

I would like to see how you handle the main concern I have then: avoiding massive data allocation when CoW occurs. And how would that be part of the whole picture. Tensor data is typically memory-based, as opposed to stream-based or file-based with memory mapping which often can avoid memory allocation altogether. Again typical sizes here are multiple order of magnitudes larger than what is typically handled by CoW so far, which is why I think it is a good time to think whether it should apply as is, or if something else is needed.

rsebbe · December 3, 2019, 1:17pm

sure, absolutely makes sense (did not think of it).

Lantua · December 3, 2019, 1:21pm

I still don’t quite get how your approach could avoid allocation in places where CoW couldn’t. And frankly enough, from what little amount of digging I did, it still looks very similar to CoW to me.

Karl · December 3, 2019, 1:38pm

COW by itself does not cause any allocations. The real question is whether you want reference or value semantics.

It's worth remembering that we used to use reference semantics for things like strings and arrays, too - for example in Obj-C. But then we encountered the problem that the data might change out underneath you if somebody else held a reference, so developers started doing defensive copying.

So with Swift, the lesson was that these should all be value-types and use COW to share storage and avoid copying. If you have a simple architecture like a script, where you load some data and process it, you won't see a difference because you're not sharing strong references between components. If you have a more complex architecture, value semantics allow you to reason about your data without having to perform defensive copies.

And besides, if there really is something about your model/data which means it should be an independent entity with its own lifetime and shared mutations, you can just wrap it in a class. In cases where it comes up, that can really make a lot of conceptual sense, as well as solving the technical problem.

In general, though, I don't see that you've given enough reason not to use value semantics/COW.

rsebbe · December 3, 2019, 1:46pm

Simply put, allocation do not occur automatically under this approach. They are sacred. They are a manual process that you can trigger when creating a new matrix (from scratch) or by using copy(). All other operations by default reference existing memory. This is the same abstraction as a pointer to malloc'd memory, as said previously.

Now if my main concern can be solved under CoW, I'd like to see how. Or else we'll have to document all types with a "do not put it in a var unless you know what you're doing".

The question I'm really asking is whether value semantics does make sense for variable holding 100s of MB of data. I don't think so, but please let's share views on this.

lukasa · December 3, 2019, 1:56pm

I am talking about module names, not type names, which in this case are Numerics and Numerics. This makes import Numerics immediately ambiguous, as there are two possible candidates for which Numerics you mean. It also causes issues with symbol mangling for public symbols, which risks clash.

Indeed. I was pointing out that this package does not have a unique module name.

Lantua · December 3, 2019, 1:56pm

This is what I missed, since I saw that your data is a class-backed struct, I kinda assumed that’s not the case.

I suppose the real question is if it’d make sense to have 2 copies of the 100s MB of data. Because that’s simply what you barely managed to avoid by using class-semantic.

And I think CoW + benchmarking would be a good way to approach it. If you’re careful you can force allocation only as needed by minimizing the amount of references pointing to the data, but will always have the “intuitive” semantic for these data structures.

lukasa · December 3, 2019, 2:02pm

I genuinely don't understand the question.

Value semantics are semantics, they are a discussion only about how you want to interact with the object. The question boils down to: does an instance of this object only meaningfully have a value, or does it meaningfully have an identity. If it only meaningfully has a value then it is a value type, if it meaningfully has identity then it is a reference type.

The performance of these abstractions is not relevant to this discussion because you can transform one into the other. A reference type can be turned into a value type by wrapping it into a struct that copies the reference before all operations, with a bonus performance enhancement if you happen to know which methods actually do mutate the reference object. A value type can be turned into a reference type by wrapping it into a class that mutates a private var and exposing all the same methods.

My recommendation is not to get distracted about the performance of the type, but instead to understand how it should feel to use. Should it feel like an integer, or should it feel like a view controller? Once you have the answer to that question you know what abstraction you should use, and the performance question is secondary and for the actual user of the data type to address.