[ANN] CeedNumerics released (SN's ShapedArray API discussion)

Karl · December 3, 2019, 2:04pm

Again, nothing wrong with making it COW by default and wrapping in a class under specific circumstances:

class Shared<T> { var value: T }

That also makes it blindingly obvious to everybody who uses it that they're dealing with a shared value that may mutate under their feet. They could always pull a value-semantics version out of it by storing value.

Indeed. Couldn't agree more - the question isn't really about COW but about the semantics of the type.

rsebbe · December 3, 2019, 4:09pm

I like the malloc'd memory abstraction better than the view controller. When doing such numerical computations, I often pre-allocate a number of intermediary buffers and then work from that. So yes, they have an identity. Allocating an additional one would make the app crash (memory limit). This is how it goes with machine learning as well: load data, preprocess it, work on it, no ambiguity on the memory requirement.

I think I've answered your question in the initial post: under this design, data has an identity, and vector/matrix/tensor are views on it (not for Vec4, etc.). That choice is motivated by the fact that type associated with massive resource allocation, in my opinion so far, can't be well represented as value types on device with finite resources (I hear you say this should be handled by type users). This approach is similar to how NumPy does it, which I found to be efficient & successful at handling real-world scenarios.

I agree CoW-based type could do it too. But allocations become less visible, and care should be taken when using or designing APIs around those types to avoid copies, but yes, data integrity is guaranteed. This is the dual approach of "shared data, copy-free guaranteed". Which leads to: why do you think that vector/matrix/tensors would be better expressed as value types? Do we have examples of CoW being successfully used in similar cases?

Interesting discussion.

Lantua · December 3, 2019, 4:29pm

Yea, I used your comment to aid in my explanation, not to argue with it. If it’s not obvious (which I just realized how much it isn’t)

lukasa · December 3, 2019, 5:33pm

At this stage, this remains an assertion not backed by evidence. In any case, this is a point where reasonable people absolutely may disagree, and the great thing about OSS is you're allowed to do whatever you want!

I will say: if you are interested in having views on shared data, I would strongly consider having the views be non-mutable. Spooky action at a distance through mutation is a really tricky thing to debug.

Oh gosh I'm sorry, my mistake!

scanon · December 3, 2019, 5:38pm

It's worth noting that mutable views on shared data are almost¹ necessary for efficient implementation of a lot of numerical computations. Consider a multi-threaded algorithm where each worker is updating a tile of a matrix; no two threads write to the same elements (and ideally not even to the same cachelines), but they are writing to the same allocation.

However, I think that it's appropriate to put operations like this behind API that follows the basic .withUnsafeMutable... etc patterns, as it should mostly be implementation details, rather than the primary way users interact with such a module.

¹ There's some exciting progress on ideas like fractional ownership that might give a way out in the long-term, but it's still pretty early-stages.

lukasa · December 3, 2019, 5:55pm

Strongly agreed @scanon, but it makes me nervous to have those be the primary interface by which one accesses the data. Sharing data is fine, but it's usually good to be able to express the difference between "I need to compute on shared data" and "I don't want people to change this data underneath me".

rsebbe · December 3, 2019, 8:28pm

Yes, RO/RW views would make sense.

To be grounded in reality, I'd be interested if you could go through this simple example, on how it could be implemented with a value type. This is a very common use case, well handled in NumPy, easy to read, easy to write. The matrix slicing here returns another matrix from the same shared data, work is performed on it. No allocation, simple lightweight types. I'd be interested to determine, with value semantics, whether it is possible, which types would be involved (subscripting return value type, in particular), and whether allocations can be avoided to achieve it.

let mat = Matrix<Double>(5000, 5000)

mat[1000~4000, 1000~4000] += 1.0

lukasa · December 4, 2019, 9:16am

Unfortunately you picked one of the easier cases to achieve with CoW types. Here's some code that demonstrates the behaviour. Please ignore the use of ManagedBuffer, which I did solely to avoid futzing around with allocations (though it does give us a nice tail-allocation property you may want to investigate in your code), and the fact that these matrix types are deeply useless. This isn't intended as a fully-fledged replacement, just a demonstration of the principle.

internal class MatrixStorage<Type>: ManagedBuffer<MatrixHeader, Type> where Type: Numeric {
    /// Builds an x by y matrix.
    class func buildZeroing(xSize: Int, ySize: Int) -> MatrixStorage<Type> {
        let header = MatrixHeader(xSize: xSize, ySize: ySize)
        let newObject = self.create(minimumCapacity: header.totalBufferSize) { _ in
            return header
        } as! MatrixStorage<Type>

        newObject.withUnsafeMutableBufferPointerToElements { elements in
            elements.initialize(repeating: .zero)
        }
        return newObject
    }

    class func buildCopying(original: MatrixStorage<Type>) -> MatrixStorage<Type> {
        let newHeader = original.withUnsafeMutablePointerToHeader { $0.pointee }
        let newObject = self.create(minimumCapacity: newHeader.totalBufferSize) { _ in
            return newHeader
        } as! MatrixStorage<Type>

        newObject.withUnsafeMutableBufferPointerToElements { newElementBuffer in
            original.withUnsafeMutableBufferPointerToElements { originalElementBuffer in
                _ = newElementBuffer.initialize(from: originalElementBuffer)
            }
        }

        return newObject

    }

    var xSize: Int {
        return self.withUnsafeMutablePointerToHeader { $0.pointee.xSize }
    }

    var ySize: Int {
        return self.withUnsafeMutablePointerToHeader { $0.pointee.ySize }
    }

    var totalBufferSize: Int {
        return self.withUnsafeMutablePointerToHeader { $0.pointee.totalBufferSize }
    }

    func withUnsafeMutableBufferPointerToElements<T>(_ body: (UnsafeMutableBufferPointer<Type>) throws -> T) rethrows -> T {
        return try self.withUnsafeMutablePointers { (header, elements) in
            let size = header.pointee.totalBufferSize
            let elementBuffer = UnsafeMutableBufferPointer(start: elements, count: size)
            return try body(elementBuffer)
        }
    }
}


internal struct MatrixHeader {
    var xSize: Int
    var ySize: Int

    var totalBufferSize: Int {
        return xSize * ySize
    }
}

public struct Matrix<Element> where Element: Numeric {
    private(set) var xRange: Range<Int>

    private(set) var yRange: Range<Int>

    private var storage: MatrixStorage<Element>

    public init(_ xDimension: Int, _ yDimension: Int) {
        self.xRange = 0..<xDimension
        self.yRange = 0..<yDimension
        self.storage = .buildZeroing(xSize: xDimension, ySize: yDimension)
    }

    private init(xRange: Range<Int>, yRange: Range<Int>, storage: MatrixStorage<Element>) {
        precondition(xRange.lowerBound >= 0)
        precondition(xRange.upperBound <= storage.xSize)
        precondition(yRange.lowerBound >= 0)
        precondition(yRange.upperBound <= storage.ySize)

        self.xRange = xRange
        self.yRange = yRange
        self.storage = storage
    }
}

extension Matrix {
    // Note that we are not zero indexed here.
    public subscript(_ xRange: Range<Int>, _ yRange: Range<Int>) -> Matrix<Element> {
        get {
            precondition(xRange.lowerBound >= self.xRange.lowerBound)
            precondition(xRange.upperBound <= self.xRange.upperBound)
            precondition(yRange.lowerBound >= self.yRange.lowerBound)
            precondition(yRange.upperBound <= self.yRange.upperBound)

            return Matrix(xRange: xRange, yRange: yRange, storage: self.storage)
        }

        _modify {
            precondition(xRange.lowerBound >= self.xRange.lowerBound)
            precondition(xRange.upperBound <= self.xRange.upperBound)
            precondition(yRange.lowerBound >= self.yRange.lowerBound)
            precondition(yRange.upperBound <= self.yRange.upperBound)

            // We (the struct) are uniquely owned here, so we can temporarily modify our range.
            let originalXRange = self.xRange
            let originalYRange = self.yRange
            defer {
                self.xRange = originalXRange
                self.yRange = originalYRange
            }

            self.xRange = xRange
            self.yRange = yRange

            yield &self
        }
    }
}


extension Matrix {
    public static func +=(lhs: inout Matrix, rhs: Element) {
        if !isKnownUniquelyReferenced(&lhs.storage) {
            print("copying")
            lhs.storage = .buildCopying(original: lhs.storage)
        }

        // We stride over by the y index.
        let yIndexStride = lhs.storage.xSize
        let (xRange, yRange) = (lhs.xRange, lhs.yRange)

        lhs.storage.withUnsafeMutableBufferPointerToElements { elements in
            for y in yRange {
                for x in xRange {
                    elements[x + (yIndexStride * y)] += rhs
                }
            }
        }
    }
}

If you have a main.swift that imports this code and looks like this:

var mat = Matrix<Double>(5000, 5000)

mat[1000..<4000, 1000..<4000] += 1.0

The this code will never print "copying": we don't have to copy the matrix in order to achieve the goal.

The magic here is the _modify accessor on the subscript, which allows us to tell the compiler that direct modifications on the subscript operation do not need to leave the original object intact. Normally a subscript operation like mat[1000..<4000, 1000..<4000] += 1.0 will call get, then modify the object returned from get, then set it back. With _modify we can yield out an object with temporary lifetime, that exists just long enough for the modification to occur.

This allows us to avoid a CoW operation here. We can nest these modifications arbitrarily, as well.

rsebbe · December 4, 2019, 10:17am

I read about _modify some time ago, I think it was before it was introduced, but I've never used it so far (and I'm not too familiar with what that allows). Thank you for demoing it.

What I was interested in seeing, is how you could work on some part of a type, and whether something similar to ArraySlice was needed. Like iterating (R/O or R/W) on the columns of a matrix and use/modify the matrix data itself. For instance:

var mat = Matrix<Double>(5000, 5000)
for column in 0 ..< mat.size.column {
    var col = mat[~,column]
    col += Double(column)
}

There's some simplicity when you know that you always access the same, single data, unlike with value type where some duplication might occur (which to the beginner, might look like non-determinism). I have this example in mind where in a notebook (think Jupyter) some variable get set to a second let constant, then the original variable is modified, which triggers a CoW allocation, which crashes the notebook kernel as that was referencing a 1GB data. Such variable manipulations would be very common in practice, yet a crash would never occur under NumPy, which makes it well suited to experimentation. This might be hard to replicate under value type.

Is there a consensus that value type is the best approach here?

lukasa · December 4, 2019, 10:25am

At a certain point we get stuck into a conversation about preferred style, and then we enter somewhere without defined true or false answers. Considering your program above, that program naturally does not do what you want when mat is a value type, because col is a separate value than mat, and you never assign it back. This means it must have a different result than the one where it's a view on a reference mat object.

The equivalent value-typed program is:

var mat = Matrix<Double>(5000, 5000)
for column in 0 ..< mat.size.column {
    mat[~,column] += Double(column)
}

And yes, in that program we will not allocate or CoW if you are careful to use the modify accessor. The risks begin if you create too many intermediaries, as the program below may CoW:

var mat = Matrix<Double>(5000, 5000)
for column in 0 ..< mat.size.column {
    var col = mat[~,column]
    col += Double(column)
    mat[~,column] = col
}

Whether this triggers a CoW is down to whether the optimiser is capable of observing that this program could use the _modify accessor. I am honestly not sure whether it can today. The only way I know to guarantee that the _modify accessor will be used is to write everything in a single logical load/store statement. I view this as a non-permanent optimisation limitation, but you may reasonably view it differently.

Lantua · December 4, 2019, 1:46pm

One caveat though, is that inout that replaces the storage could cause a problem.

func foo(_ value: inout Matrix<Int>) {
  value = newMatrix
}

var a = ...
foo(&a[...]) // Now `a` lost its storage after `yield`.

If that happens, when the we will lost the original storage since the yielding value is the only one that has it.

And I couldn't figure it out a way to avoid both copying and have it work with this. Maybe _modify could be changed to work during the proposal.

lukasa · December 4, 2019, 1:49pm

Yes, good catch, that would be problematic. We'd have to update the _modify code above to check whether that happened and, if it did, to copy the bytes back into the original storage.

Lantua · December 4, 2019, 1:51pm

At that point we'd have no strong reference to the original storage right? Since yield &self needs self to be the only one holding storage to avoid copying for any mutation during yield.

lukasa · December 4, 2019, 1:54pm

Yeah, that's right. I think that implementation simply doesn't work: it's not really possible to pass the slice out that way without either retaining the original storage (which will require mutation to CoW) or to accept that you could lose data.

I think we could still get this to work, but it requires more caution in the implementation.

Lantua · December 4, 2019, 1:59pm

I tried to do it yesterday, the closest I can think of is to have UnownedMatrix that uses unowned reference, but then it risks exposing the implementation details, and could have scenarios with bad reference to UnownedMatrix.

The conclusion I came to yesterday is that the ARC needs to get smarter, WAY smarter, or we need to further push ownership design or something similar.

lukasa · December 4, 2019, 2:00pm

Yeah, this is definitely tricky.

Karl · December 4, 2019, 2:52pm

Definitely. Value semantics are pretty fundamental to Swift - in fact, off the top of my head, I can't think of any classes in the standard library . If it doesn't work, all of Swift falls down. And if you don't like value semantics, can't accept it, find them confusing or whatever: you're not going to have much fun with Swift in general. This stuff has to work.

That said, I find the fact that the example above needs to use non-public features like _modify and yield pretty troubling (I wouldn't call it "one of the easier cases to achieve with CoW types" on that basis alone). Points like you raised about the trickiness of the design are also worrying; we should have figured this out by now so we have a simple story to tell when somebody (like OP) comes by, sceptical that value semantics will give them the performance they need.

rsebbe · December 4, 2019, 7:06pm

Let me be clear, I absolutely love value semantics. It's smart, it's clean. I've been using it for > 4 years, and it works very well for all kinds of use cases I've been through so far.

Here though, I fear it could be a problem to use value semantics for vector/matrix/tensor from my experience of using NumPy/DNN and in-app buffer allocations, that could lead to mem crashes, user strategies to avoid CoW, compiler optimization dependency.

There's probably no point in further discussing it though. If value type is picked, I'd definitely be curious to see how that plays out & how problems are solved.

Lantua · December 4, 2019, 9:09pm

Just want to add, that it is indeed an interesting case.

At other places it does drop a whole layer from collection -> element, so read-modify does help avoiding copy. This one is very particular that it drops half a level from collection -> sequence (or not at all—collection -> collection), which is a legit scenario.
Still, it does seem the interplay between

get-set (read-modify) semantic
the difference between modify inplace (+=) and replacement (=), and
ARC

does force the copy. Even Array -> ArraySlice simply use get-set.

Definitely something Swift could improve upon.

saeta · December 5, 2019, 1:19am

Hi all!

I just wanted to chime in that Swift for TensorFlow's Tensor type today has value semantics. We've been exploring this point in design space for a little while and have some experience. A couple high level notes:

When working with automatic differentiation, value semantics compose quite nicely. From my perspective, value semantics is independent of whether the underlying data is "large" (or lives on an accelerator with limited memory capacity) or not. (That said, we're still working on gathering more use cases (both internal and external to Alphabet) to validate this mental model.)
We have encountered instances where mutation of tensors within a composite data structure (e.g. a DNN model) is important, and being able to mutate an underlying buffer is a convenient way to think about things. For now, we use key paths to simulate the multiple pointers to an underlying buffer. Unfortunately, Swift's current implementation of key paths are not the most friendly to work with, and is a bit of a sore spot.
I'd encourage folks to check out the recent open design review [deep link] on SwiftRT by Ed Connell, which covers important aspects including views, shared mutable references (to disjoint subsets of a Tensor) for multithreaded computation, and multiple devices (e.g. accelerators like GPUs and TPUs). (Alas, I believe the video for the meeting has been eaten by cyberspace, but the design docs & code are available. Please do consider joining swift@tensorflow.org to get the calendar invitation for all future S4TF open design meetings.)
Just FYI: We're in the process of reworking significant aspects of Swift for TensorFlow (S4TF) to improve performance. (S4TF's performance today is not representative of where it will be soon, nor value semantics for Tensors.)
We're excited for more powerful ownership model, more reliable semantics (e.g. across different optimization levels), and more! :-)

Happy to chat more!

All the best,
-Brennan