Support a complex number type for Matrix values

tera · May 14, 2024, 1:15pm

The result matrix could be created once (outside) and used many times. It's internal dimensions could be larger than needed so it could be used with different sized matrices:

struct Matrix <Element> {
    var contents: [Element]
    var rows: Int = 0
    var columns: Int = 0
    init(maxElements: Int) {
        // precondition(rows * columns < contents.count)
        contents = .init(value: 0, count: maxElements)
    }
    func setDimensions(rows: Int, columns: Int) throws {
        guard rows * columns <= maxElements else { throw someError }
        self.rows = rows
        self.columns = columns
    }
    ...
}

BTW, you could use a*b syntax and store the result into a global or a task local variable to avoid its re-allocation: task local will be better than global if you plan to use * across tasks/threads.

wigging · May 14, 2024, 1:51pm

Ah, I see. I'll keep this in mind for the package I'm working on. I can always add more features to support things like this once I have the basics implemented.

scanon · May 14, 2024, 2:34pm

Usually matrix multiplications are happening in the context of some broader algorithm, so that you actually have something like:

for iteration in whatever {
  // do something that modifies a or b or both
  let c = a*b
  // use c for something
}

It is often desirable to hoist the allocation of C out of this loop, so that only computation is required internally¹:

var c = Matrix( ... )
for iteration in whatever {
  // do something that modifies a or b or both
  gemm(a, b, &c)
  // use c for something
}

¹ actually, many BLAS implementations will allocate internal to GEMM; a better BLAS API would have allowed the caller to provide workspace to avoid that.

jim · May 14, 2024, 3:20pm

This makes a lot of sense, but also feels like something a compiler could optimize away. Is that usually the case?

scanon · May 14, 2024, 3:41pm

This is rather hard to optimize away in most languages, because a*b is usually defined as a thing that allocates its result, so hoisting requires a transformation that moves that allocation out of the function that implements a*b, which is usually in some other library/module and not inlined into the containing loop. In some languages you solve this by inlining a wrapper that does allocation and then calls a routine that actually looks like gemm(a, b, &c), so that the compiler has the information necessary to replace the allocation and assignment with "just take a reference to the existing c" (depending on the language and type semantics, you might also need to have some mechanism to tell the compiler that the c being replaced is not referenced by anything else--this is where Swift's newer noncopyable features become relevant).

tera · May 14, 2024, 8:41pm

Do you know why?

scanon · May 14, 2024, 8:57pm

Because you can generally go faster if you "repack" pieces of one or more of A, B, or C into tiles that are laid out optimally for the inner compute kernel (in particular, you repack so that all of the values that you need to touch in the inner compute kernel are contiguous in memory, which maximizes your cache and TLB reach; this gains you more than enough performance to pay for the cost of allocating space to repack into and doing the repacking).

The canonical reference for all this is Goto and van de Geijn's “Anatomy of high-performance matrix multiplication”, which is a big paper but pretty approachable, highly recommended.

Intel has experimented with explicit pack API and a gemm variant that takes the pack storage in MKL so that it can be reused across calls, which is another solution to the problem, with pretty impressive uplift when one of the matrix dimensions is small. (The real moral of the story here is that GEMMs of reasonable size are much too complex of an operation to attempt to shoehorn it into an operator, and deserve to be function calls, otherwise you'll doom most of your use sites to limited performance).

tera · May 15, 2024, 6:02pm

I wrote a quick test to compare the speed of gemm vs MPS (GPU) vs manual naïve matrix multiplication, here are the results:

with gemm being the fastest and manual implementation naive CPU implementation outperforming MPS (GPU) up until around 120x120 matrix dimensions.

The closeup of this graph for small N's looks like:

showing that the naïve manual implementation outperforms gemm up until around 20x20 matrix dimensions.

There's some room for improvement in the naïve manual implementation (e.g. one of the matrices could be transposed for better cash locality).

For large N the situation changes:

The switchover point when GPU starts being faster than gemm is around 2500*2500 matrix dimensions.

The tests were carried out on 2021 MacBook Pro (M1 Pro CPU) with single precision floating point numbers to be compatible with MPS (GPU) API.

Nobody1707 · May 15, 2024, 6:18pm

FloatingPoint refines Numeric which refines ExpressibleByIntegerLiteral, so you can initialize any floating point type with an integer literal, but no actual type conversion is taking place.

wigging · May 16, 2024, 12:43am

What are the specs of the Mac that you ran these tests on?

tera · May 16, 2024, 10:43am

I edited my post above to include this information (see the bottom of the post).

wigging · May 16, 2024, 12:27pm

Regarding the definition of the matrix type I demonstrated in the original post. I noticed that I defined the matrix operator functions in extensions as shown below:

import Accelerate

struct Matrix<T> {
    let rows: Int
    let columns: Int
    var values: [T]

    init(_ content: [[T]]) {
        self.rows = content.count
        self.columns = content[0].count
        self.values = content.flatMap { $0 }
    }

    subscript(row: Int, column: Int) -> T {
        get { return values[(row * columns) + column] }
        set { values[(row * columns) + column] = newValue }
    }
}

extension Matrix {

    static func * (lhs: Matrix, rhs: Matrix) -> Matrix where T == Double {
        // ...
    }
    
    static func + (lhs: Matrix, rhs: Matrix) -> Matrix where T == Float {
        // ...
    }
}

I found that it's not necessary to define these functions in an extension. I can just define them as external functions like shown here:

import Accelerate

struct Matrix<T> {
    // ...
}

func * (lhs: Matrix<Double>, rhs: Matrix<Double>) -> Matrix<Double> {
    // ...
}

func + (lhs: Matrix<Float>, rhs: Matrix<Float>) -> Matrix<Float> {
    // ...
}

These functions don't actually extend the matrix type, so I see no reason to define them in an extension. So is it fine to not define these in an extension or am I missing something here and they need to go in an extension?

scanon · May 16, 2024, 1:13pm

Both will work. Defining them as static methods in an extension is definitely the prevailing style, however.

(It wasn't possible to define operators as static methods until Swift 3; the global function style was the only option. It had some significant downsides, so we added the static method form. That document provides context for the change and highlights reasons why you should generally prefer the static method style.)

tera · May 16, 2024, 1:23pm

E.g. an ability to call private static and instance methods.