Avoid copying large matrix data within a function

wigging · May 17, 2024, 12:55am

I have some functions that are basically wrappers around Accelerate functions (see below). These functions accept a Matrix type that provides a 2D representation of an Array. If the input matrix is very large I want to avoid making a copy of it within the function. I just want to reference the matrix values within the function. I do not want to modify the input matrix from inside the function.

import Accelerate

struct Matrix<T> {
    let rows: Int
    let columns: Int
    var values: [T]

    init(_ content: [[T]]) {
        self.rows = content.count
        self.columns = content[0].count
        self.values = content.flatMap { $0 }
    }

    init(rows: Int, columns: Int, values: [T]) {
        self.rows = rows
        self.columns = columns
        self.values = values
    }

    init(rows: Int, columns: Int, fill: T) {
        self.rows = rows
        self.columns = columns
        self.values = [T](repeating: fill, count: rows * columns)
    }

    subscript(row: Int, column: Int) -> T {
        get { return values[(row * columns) + column] }
        set { values[(row * columns) + column] = newValue }
    }
}

Here I have a function that divides each element in the matrix by a scalar value. In the function, I create a result array to store the quotient. Then I create a matrix mat for the return value. So I basically have two copies of the quotient inside the function: one as the result array and one as the mat matrix values. Is there a way to avoid creating the result array and just create the matrix that is to be returned?

func divide(_ matrix: Matrix<Double>, _ scalar: Double) -> Matrix<Double> {
    let result = vDSP.divide(matrix.values, scalar)
    let mat = Matrix(rows: matrix.rows, columns: matrix.columns, values: result)
    return mat
}

Here is an example where I have a function that multiplies two matrices. In the function, I create two arrays a and b and a matrix c which is the return value. Is there a way to avoid creating these extra arrays which contain the values of the input matrices?

func multiply(_ matrixA: Matrix<Double>, _ matrixB: Matrix<Double>) -> Matrix<Double> {
    let a = matrixA.values
    let b = matrixB.values
    var c = Matrix(rows: matrixA.rows, columns: matrixB.columns, fill: 0.0)
    
    let m = matrixA.rows    // rows in matrices A and C
    let n = matrixB.columns // columns in matrices B and C
    let k = matrixA.columns // columns in matrix A, rows in matrix B
    let alpha = 1.0
    let beta = 0.0
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, k, b, n, beta, &c.values, n)
    
    return c
}

These functions get the job done but is there a better way to define them to reduce memory usage; especially when dealing with large matrices? Or do I need to refactor my Matrix type to work better with Accelerate functions?

vanvoorden · May 17, 2024, 1:25am

It looks like you are creating two let constant local variables which are copies of the values properties of your Matrix type. These are copy-on-write arrays. The copy that takes place is a constant time operation. It becomes a linear time operation if that copy is then mutated. If you are not mutating the copy you should not see memory growth as if you were copying all n elements. Have you tried defining benchmarks and taking measurements?

wigging · May 17, 2024, 1:33am

Can you elaborate on how I would go about performing benchmarks and measurements? Are you referring to Instruments?

vanvoorden · May 17, 2024, 1:37am

My preference for measuring memory and CPU together would be the Ordo One package.

vns · May 17, 2024, 5:49am

@vanvoorden have made a good suggestion to worry less about copies with arrays in Swift since they has CoW semantic. Additionally to that, if you still find this expensive for some reason, I see in docs that there are versions in vDSP that accept inout parameter for result, which means you can create an empty Matrix and pass it for output. Finally, there are consuming and borrowing, plus ~Copyable, in latest Swift that will help avoid copying structs. But as for arrays, it should be quite efficient on itself.

wigging · May 17, 2024, 1:18pm

The copy-on-write feature came up in another post too. I like the explanation in the Array docs where it says "Multiple copies of an array share the same storage until you modify one of the copies". Got it

Regarding the vDSP.divide function, looks like vDSP.divide(::result:) is what I should use.

In my example, is the input matrix for the function being copied into the function?

func divide(_ matrix: Matrix<Double>, _ scalar: Double) -> Matrix<Double> {
    let result = vDSP.divide(matrix.values, scalar)
    let mat = Matrix(rows: matrix.rows, columns: matrix.columns, values: result)
    return mat
}

vns · May 17, 2024, 2:18pm

Yes, it is relatively popular technique used in Swift to avoid unnecessary copies and keep value semantic of structs instead of using classes. The copy happens only at the time of mutation, before that underlying storage is shared. Array is only one yet most common example of this behaviour.

Yes, it is being copied, but again, since most heavy part there is values property (array), I would assume it costs almost nothing.

I also assume that in terms of optimisation, making this function @inlinable (I may be wrong what’s the best option on inlining, this one needs to be carefully used in interfaces of libraries, more in proposal) can provide more benefits, and this use case seems perfect candidate for that.

After all, benchmark if there are any need for optimisation at first place.

tera · May 17, 2024, 6:15pm

Interestingly vDSP division is implemented via multiplication by the reciprocal:

var numbers = [0.5810522237920217]
var divider = 123.0
var factor = 1/divider
let r1 = vDSP.divide(numbers, divider)[0]
let r2 = vDSP.multiply(factor, numbers)[0]
let r3 = numbers[0] * factor
let r4 = numbers[0] / divider
print(r1)               // 0.004724001819447331
print(r2)               // 0.004724001819447331
print(r3)               // 0.004724001819447331
print(r4)               // 0.0047240018194473305
precondition(r1 == r2)  // ✅
precondition(r1 == r3)  // ✅
precondition(r1 == r4)  // 💣 Precondition failed

Aren't floating * and / operations take the same amount of time these days?

wigging · May 17, 2024, 6:41pm

You could probably use BLAS and/or LAPACK functions to get the inverse of a matrix and then multiply it by another matrix to do division. I image this could be accomplished for scalar-vector division, scalar-matrix division, vector-vector division, and vector-matrix division too. For now I'm going to stick with the vDSP.divide function. I think in general multiplication operations are cheaper than division operations. I would also guess that multiplication is easier to parallelize than division. But I'm no expert on this so maybe someone else will provide a better response.

tera · May 17, 2024, 7:05pm

They could be... However Swift doesn't suddenly "help" me changing:

var number = 0.5810522237920217
print(number / 123.0) // 0.0047240018194473305

into a faster (if that's faster):

let factor = 1/123.0
print(number * factor) // 0.004724001819447331

and IMHO the same should happen with the API called "divide". If I wanted to multiply by a reciprocal I could do just that myself.

Besides, vDSP.divide must be memory-bound operation anyway (meaning that even if the individual FP operation is faster it doesn't necessarily mean the overall operation that is memory-bound is faster).

vanvoorden · May 17, 2024, 7:18pm

wigging:

In my example, is the input matrix for the function being copied into the function?

func divide(_ matrix: Matrix<Double>, _ scalar: Double) -> Matrix<Double> {
    let result = vDSP.divide(matrix.values, scalar)
    let mat = Matrix(rows: matrix.rows, columns: matrix.columns, values: result)
    return mat
}

Maybe… it's possible the compiler could optimize that for you ("pass by reference") in some situations. But assume that this will need to copy the matrix by value in the worst case.

If you want to preserve value semantics and also optimize the copy? You do have the option to write your own copy-on-write data structure. You can write it by hand or use some kind of codegen solution like swift-cowbox (which I ship and maintain).

scanon · May 17, 2024, 7:45pm

vDSP (and vForce, which is a similar library in Accelerate providing transcendental functions) explicitly carve out a policy of "we relax error bounds slightly in order to go faster".¹ Floating-point division has gotten significantly faster in the last decade (especially on Apple silicon, where it is partially pipelined), but multiplication by a reciprocal still has lower latency (once the cost of computing the reciprocal is amortized) and higher throughput, and satisfies a relaxed error bound. There's a tradeoff here, but vDSP has always been very consistent about this. If you want a "real" division, doing division in a for loop gets you that with no fuss.

Not so; for data resident in cache (as is generally the case when a vDSP operation is used as a component of a larger signal-processing algorithm--it's quite rare to want to do a vector divide in isolation), the throughput of streaming loads and stores is higher than the throughput of division (but slower than the throughput of floating-point multiplication), even on newer cores with pipelined division.

¹ This policy, like everything else about vDSP, is extensively documented in the vDSP.h header

vDSP routines are not expected to produce results identical to the
pseudo-code in the descriptions, because vDSP routines are free to
rearrange calculations for better performance.  These rearrangements
are mathematical identities, so they would produce identical results
if exact arithmetic were used.  However, floating-point arithmetic
is approximate, and the rounding errors will often be different when
operations are rearranged.

Generally, vDSP routines are not expected to conform to IEEE 754.
Notably, results may be not correctly rounded to the last bit even for
elementary operations, and operations involving infinities and NaNs may
be handled differently than IEEE 754 specifies.

tera · May 17, 2024, 9:50pm

Good point, I thought of RAM forgetting about the important case of cache. Could you recommend a not too deep into details source that outlines the speed of various operations, memory accesses, caches, etc?

I see, and in this particular case it's easy indeed. IMHO a better strategy would be having an option of whether to prefer the speed or the accuracy, with the current behaviour by default:

vDSP.divide(numbers, divider, options: [....])

And do this consistently

scanon · May 17, 2024, 11:44pm

MKL used to (maybe still does?) do something sort of like that; they had multiple versions of routines like this with a suffix to indicate fastest/least-accurate, middle, and slowest/most-accurate. I’m not sure how many of the routines really had three versions instead of “we reserve the freedom to bind these to three different things in the future,” though.