Functions that accept arrays or array slices

wgjcv · April 29, 2022, 10:14pm

I am working on a Big Integer package with emphasis on performance. My current design follows what most others have done, i.e. the Big Integer is a UInt64 array with functions contained in a Uint64 array extension. All was well until I needed to implement various multiplication algorithms. In particular, Karatsuba and Toom Cook multiplication. Both of these algorithms are divide and conquer based with Karatsuba breaking down the array into two slices and Toom Cook 3 or more slices. In Swift, the obvious choice for slices is to use array slice. But .... my design is based on an array extension with array arguments, so I am forced to convert slices to arrays which eats up memory and nullifies the performance that can be gained by recursive divide and conquer. What I now need is a way to process both arrays and array slices in the same function. Functions will have two array arguments where one or both can be arrays or slices. Below is how I am considering constructing the functions:

func test<T,V>(_ x: T,_ xPtr: UnsafeMutablePointer<UInt>, _ y: V,_ yPtr: UnsafeMutablePointer<UInt>)
    where T: RandomAccessCollection, T.Element == UInt, T.Index == Int,
          V: RandomAccessCollection, V.Element == UInt, V.Index == Int {
    print("x count", x.count)
    print("x", x[x.startIndex...])
    print("x elements")
    for indx in (0..<x.count) {print(xPtr[indx])}
    print(" ")
    print("y count", y.count)
    print("y", y[y.startIndex...])
    print("y elements")
    for indx in (0..<y.count) {print(yPtr[indx])}
}

var u1: [UInt] = [1,2,3,4]
var u2 = u1[1...]
test(u1, &u1[u1.startIndex], u2, &u2[u2.startIndex])

With the function construct above I can pass arrays or slices for either argument. I also pass a pointer to each so I can have a zero based pointers for the slices. My question is this. Is there better ways to merge arrays and array slices into functions? What do you think of my construct
(based on advice gained on stack overflow).

Karl · April 29, 2022, 11:29pm

So, you have 2 options:

Do everything with slices. A full array is, after all, a full-size slice of itself.
Generics

I prefer the latter, and that seems to be what you're going for too, so great.

No matter which way you go, though, when you work with slices, one thing you'll notice right away is that, unlike full arrays, their indexes do not start at 0. They share indexes with their base array:

let fullArray = [0, 1, 2, 3, 4]

fullArray.startIndex // 0
fullArray.endIndex   // 5
fullArray.count      // 5

let slice = fullArray.dropFirst(1).prefix(2) // ArraySlice<Int>
slice.startIndex  // 1 <-- !!!
slice.endIndex    // 3 <-- !!!
slice.count       // 2 (end - start)

At this point, it becomes unhelpful to think about indexes as integers, and better to think of them more abstractly. For that reason, I would advise dropping the T.Index == Int and V.Index == Int constraints, which will make indexes a bit more awkward to work with, but it basically forces you along the "right way" of using them.

This would also help clean up your generic signature. Dropping those constraints, you're just looking for RandomAccessCollections of UInt - which is nice and flexible and better expresses what you need from these data types, I think?

AlexanderM · April 30, 2022, 12:09am

Here's an example of handling this with generics:

func test<C: Collection>(label: String, _ x: C)
    where C.Element == UInt { // Not technically necessary, but I'll add it anyway
        
    print("\(label).count", x.count)
    print("\(label)[\(label).startIndex...]", x[x.startIndex...])
    print("\(label)'s elements:")
    
    for indx in x.indices { 
        print(x[indx])
    }
}

let u1: [UInt] = [1,2,3,4]
let u2 = u1[1...]
test(label: "u1", u1)
test(label: "u2", u2)

Peter-Schorn · April 30, 2022, 12:17pm

The whole point of generics is to allow you to operate on multiple different types that share at least part of their interfaces. So, yes, making the function generic over arrays and array slices is the best way to write a function that works with both arrays and array slices.

wgjcv · April 30, 2022, 4:23pm

I agree that generics is the solution. I refactored one function to test it and the result is below:

// Returns a tuple containing the high and low components to UInt addition
@inlinable
func addingFullWidth(_ x: UInt,_ y: UInt) -> (high: UInt, low: UInt) {
  let sum = x.addingReportingOverflow(y)
  return (sum.overflow ? 1 : 0, sum.partialValue)
}

// add two [UInt] arrays
func array64Add<C1: RandomAccessCollection, C2: RandomAccessCollection>(_ x: C1,_ xPtr: UnsafeMutablePointer<UInt>,_ y: C2,_ yPtr: UnsafeMutablePointer<UInt>,_ uselastcarry:Bool = true) -> [UInt]
    where C1.Element == UInt, C1.Index == Int, C2.Element == UInt, C2.Index == Int {
    if x.count == 0 || (x.count == 1 && x.startIndex == 0) { return Array(y)}
    if y.count == 0 || (y.count == 1 && y.startIndex == 0) { return Array(x)}
    let (xc, yc) = (x.count, y.count)
    let minc = Swift.min(xc, yc)
    var rIndex = 0
    var result:[UInt]
    var carry: UInt = 0
    if xc >= yc {
        result = Array(x)
        while rIndex < minc {
            (carry, result[rIndex]) = addingFullWidth(result[rIndex], yPtr[rIndex], carry)
            rIndex += 1
        }
        if xc == yc {
            if carry > 0 && uselastcarry { result = result + [1]}
            return result
        }
        while carry > 0 && rIndex < xc {
            (carry, result[rIndex]) = addingFullWidth(result[rIndex], carry)
            rIndex += 1
        }
        if carry > 0 && uselastcarry { result = result + [1]}
    } else {
        result = Array(y)
        while rIndex < minc {
            (carry, result[rIndex]) = addingFullWidth(result[rIndex], xPtr[rIndex], carry)
            rIndex += 1
        }
        while carry > 0 && rIndex < yc {
            (carry, result[rIndex]) = addingFullWidth(result[rIndex], carry)
            rIndex += 1
        }
        if carry > 0 && uselastcarry { result = result + [1]}
    }
    return result
}

var u1:[UInt] = [1,2,3,4]
var u2 = u1[1...]
var u3 = array64Add(u1, &u1[u1.startIndex], u2, &u2[u2.startIndex])
print(u3)

I was happy with the results. In the Add function it is fully transparent whether an argument is array or array. I believe it will be performant but it will be a few weeks into refactoring before I will benchmark. Most of the functions that deal with [UInt] arrays will look similar to the above. If anyone has recommendations on how to make it more performant it would be welcome and appreciated.

wgjcv · April 30, 2022, 4:32pm

Also, I had to include the .Index and .Element in the where clause because of compile errors related to conformance issue (BinaryInteger) and a complaint about a conversion. I think the conversion is no longer in the code but I left the where clause as is.

Peter-Schorn · April 30, 2022, 4:50pm

Are you sure you don't want @inline(__always) instead of @inlineable? See When should both @inlinable and @inline(__always) be used? - #2 by scanon
If addingFullWidth is not referenced from outsite the module it's defined in, then adding @inlinable has no effect.

wgjcv · April 30, 2022, 5:02pm

There is much for me to learn about Swift. I made the change to @inline(__always). Thanks for the recommendation.

SDGGiesbrecht · April 30, 2022, 11:13pm

I am pretty sure @inlinable is the one you want (sort of).

@scannon’s comment in answer to a question about contributing to the standard library, where @inline(__always) is a relevant implementation detail. But for general code, @inline(__always) is only a hidden and experimental feature—hence the underscore.
@inline(__always) is actually asking the entire function to be inlined separately into each individual call site. That is fine for a simple function like && where the implementation is simpler than the function call. But applying it to your much larger function might lead to massive bloating of the binary.
@inlinable simply grants visibility to the implementation to other modules, allowing the compiler to heuristically decide what to do. Most of the time it results in a single, specialized function being generated for each type, and then that function is used by all call sites in the module. The compiler already does this sort of optimization within the module no matter what. In the future, it should also do it automatically between modules when the two modules are compiled together from source. But until then we can “misuse” @inlinable to let the compiler see through, as long as we remember that we will need to remove or rethink it if we try to vend the same code as a binary, since it takes on additional implications in that context. (This is why I wrote “sort of” at the very beginning; in reality you do not want to use either annotation and you want the compiler to just do it for you, but it is not actually capable of that yet.)

It does if any of the call sites are themselves in an @inlinable context.

ben-cohen · April 30, 2022, 11:24pm

Note, this code's performance is potentially going to be significantly impacted by the way it builds its result array. result = result + [1] is a O(n) operation, which builds a whole new array, replacing the old one. Ideally you'd do result.append(1) to do an in-place append.

But this leads to the other thing, which is this function is allocating storage. Sometimes that's what you want, but usually for this kind of thing you want to implement this as the equivalent of +=, and then implement + in terms of that for the times when the user really does want a fresh array allocated.

This suggests what you want to do is implement this function as a mutating function on self that takes a right-hand side of another array:

extension RandomAccessCollection
where Self: MutableCollection,
      Self: RangeReplaceableCollection,
      Element == UInt {
    // in-place add rhs to self
    mutating func array64Add<
        RHS: RandomAccessCollection
    >(_ rhs: RHS),_ uselastcarry:Bool = true) where Element == UInt {
        
    }
}

I'd also suggest moving off using the pointer. Write the code generically, get it as clean as possible with the right algorithmic complexity. Then if it's slow because the optimizer isn't turning generic array subscript ops into loads during specialization, start to incorporate use of withUnsafeBufferPointer or withContiguousStorageIfAvailable to speed it back up. But start with clean code first, get the algorithm right in principle, then optimize for performance.

ben-cohen · April 30, 2022, 11:27pm

btw this uses RangeReplaceableCollection because that can grow when needed, which you do if the right-hand side needs more digits. It's unfortunate though, because you can't then use this algorithm on UnsafeBufferPointer<UInt> (because they can't grow... they're just a pointer and a length) which is a big benefit if you've gone to the trouble of writing it generically. So it's worth trying to externalize the requirement somehow so the caller needs to ensure enough space. At least in internal methods.

tera · May 1, 2022, 1:06am

Is it possible to teach swift that result = result + [1] and result.append(1) is the same thing, so it does either in O(1)?

wgjcv · May 1, 2022, 2:12am

I created the below function to test above. It was harder to satisfy the compiler. And I ran into a case that still fails when it executes (see comment in code, self = result ...). This appears to be a case where the arguments being array slice and array is an issue.

I cannot see how the compiler can optimize all the .startIndex indexing. For a Big Integer of a few words long, maybe not an issue. But if you are computing a billion digits of pi that's a lot of additional operations.

// Returns a tuple containing the high and low components to UInt addition
 @inline(__always)
func addingFullWidth(_ x: UInt,_ y: UInt) -> (high: UInt, low: UInt) {
  let sum = x.addingReportingOverflow(y)
  return (sum.overflow ? 1 : 0, sum.partialValue)
}

extension RandomAccessCollection
where Self: MutableCollection,
      Self: RangeReplaceableCollection,
      Element == UInt,
      Index == Int {
    
    // in-place add rhs to self
    mutating func Add<R: RandomAccessCollection> (_ rhs: R,_ uselastcarry:Bool = true)
    where R.Element == UInt, R.Index == Int {
        if self.count == 0 || (self.count == 1 && self.startIndex == 0) {
            self = Array(rhs) as! Self
            return
        }
        if rhs.count == 0 || (rhs.count == 1 && rhs.startIndex == 0) {
            self = Array(self) as! Self
            return
        }
        let (sc, rc) = (self.count, rhs.count)
        let (si, ri) = (self.startIndex, rhs.startIndex)
        let minc = Swift.min(sc, rc)
        var rIndex = 0
        var carry:UInt = 0
        if sc >= rc {
            while rIndex < minc {
                (carry, self[rIndex + si]) = addingFullWidth(self[rIndex + si], rhs[rIndex + ri], carry)
                rIndex += 1
            }
            if sc == rc {
                if carry > 0 && uselastcarry { self.append(1)}
                return
            }
            while carry > 0 && rIndex < rc {
                (carry, self[rIndex + si]) = addingFullWidth(self[rIndex + si], carry)
                rIndex += 1
            }
            if carry > 0 && uselastcarry { self.append(1)}
        } else {
            var result = Array(rhs)
            while rIndex < minc {
                (carry, result[rIndex]) = addingFullWidth(result[rIndex], self[rIndex + si], carry)
                rIndex += 1
            }
            while carry > 0 && rIndex < rc {
                (carry, result[rIndex]) = addingFullWidth(result[rIndex], carry)
                rIndex += 1
            }
            if carry > 0 && uselastcarry { result.append(1)}
            self = result as! Self // Could not cast value of type 'Swift.Array<Swift.UInt>' (0x7fff80bb7628) to 'Swift.ArraySlice<Swift.UInt>' (0x7fff80bb79b0).
        }
    }
}

wgjcv · May 1, 2022, 2:31am

I really doubt that performant code is possible without involving pointers. Libraries like GMP get their performance from a number of things (good algorithms, bare metal C with lots of pointer usage, vectorization using SIMD and other processor instructions). I do not think that we will get a Swift package that will compete with GMP because those developers fight down to the last processor cycle, but we can and will get closer. My current code is often time in the 3-5x times slower range, but can be improved. I can also hope for future Apple Silicon features (I know just a dream). I could not find any functionality in Accelerate that were of any use.
I have been unable to create a pointer to a generic argument in the function. The compiler has refused all attempts. I don't know if it even possiblle to get a pointer to a generic argument. May require the compiler to know the specific type at compile time which it can't.
Just a rant, but when Swift was designed with different behaviors for array and array slice, that was a bad day.

Karl · May 1, 2022, 3:27pm

Generic code is quite nice for performance tuning, because it allows you to focus on writing your algorithm at an abstract level against an opaque interface (say, "some RandomAccessCollection of UInts), and you can plug-in different implementations.

There is nothing about pointers that magically makes them fast. Pointers are just addresses in memory, and of course everything in your process' memory has an address. For example, Array uses pointers behind-the-scenes, but it layers validation on top to ensure memory safety, with the goal being that the optimiser will remove any redundant validation and it ends up being as fast as an unchecked pointer. There are two things it adds:

Temporal/lifetime safety through ARC. This can be an overhead, but the compiler is becoming much, much smarter about optimising it through "ownership" and other performance predictability features.
Spatial safety through bounds-checking. Ultimately this relies on the compiler's ability to reason about index values and how they evolve. Personally, I've found that Swift's use of signed integers can be very limiting here, and I've had great success addressing memory buffers using unsigned integers instead.

My own results show that the compiler is so much better at reasoning about unsigned integers, we can get much more predictable bounds-checking elimination this way. Moreover, the compiler can better understand the access pattern, allowing optimisations like automatic vectorisation.

This shouldn't be very surprising - proper collection algorithms obviously include their own bounds-checking (they don't rely on trapping at runtime!); we just need to find a way for the compiler to understand it better. I don't know if that means adding unsigned-indexed buffers to the standard library, or improving the optimiser (maybe both?).

Other than those two things, it is literally a pointer. We rely on the optimiser to remove them (and it is improving) but there are also things you can experiment with. It's an ongoing process to make those optimisations more accessible and predictable without jumping through special hoops; obviously we want idiomatic code to also be fast. So as far as portable code goes, I do actually think you can get at/close to C levels of performance - and you also do better, getting that kind of performance while also getting checked accesses to memory, which is a huge deal!

I should note that all of this is enabled by generics. The way I've been able to experiment with things like unsigned indexes at all is because my algorithms were all generic and didn't care what the actual index type was, so I could try them with an Array, and then with a wrapper which gives the Array unsigned indexes. That's why it's good to select your generic constraints carefully. Do you really need integer indexes? Remember that RandomAccessCollection gives you O(1) index construction from an integer offset anyway.

IMO, this is a really pleasing way to write code and optimise algorithms. It cleanly separates improvements to the generic algorithm itself from improvements to the data structure and how accesses are communicated to the compiler (and which validation we need it to perform for us).

As for non-portable code (processor-specific instructions), IIRC that is also something we want to add in some shape eventually. I don't think there are imminent plans for it, though. @scanon would know more.

ben-cohen · May 1, 2022, 3:33pm

Certainly. In fact we already have a form of this: a += [b] is rewritten as a.append(b). But I have a lot of misgivings about that optimization.

Relying on these optimizer heroics to let people write the “wrong” thing performance-wise, instead of encouraging them to understand what the code is actually doing in principle, is very dangerous. These kind of optimizations are great until they’re not, at which point people fall off a performance cliff. You can hard-code a = a + b to be rewritten efficiently when a is an Array and b is a literal, but not necessarily when a is another type (like swift-collection’s Deque) or when b is not a literal but another dynamically-generated Array of unknown size.

So if by having the simple case optimized, you let the community dismiss the concerns of why their code is inefficient, because they prefer how the inefficient version looks and the compiler sweeps the problem under the rug, you create a worse problem where the attitude “just write the code that you like, and the compiler will fix it” becomes ingrained.

ben-cohen · May 1, 2022, 4:13pm

The problem stems from things like this:

   self = Array(self) as! Self

When writing the code generically, you cannot do this. The compiler will let you write it, because as! means "don't question me, this is guaranteed to work" but it is absolutely not guaranteed to work – this conversion will crash at runtime if Self is anything other than an Array.

In this particular case, you can rewrite this line as self = Self(rhs). This is possible because RangeReplaceableCollection provides this functionality (the ability to initialize the type from another Sequence of any type with the same Element).

In this particular case, that operation is fine. But later on you do this:

            var result = Array(rhs)
[...]
            self = result as! Self

The same fix can be applied (var result = Self(rhs)). This will compile and work – but I don't recommend taking that fix. The reason being that you've again fallen back to solving the problem by allocating memory and replacing the left-hand side with it – something that will kill performance if this function is itself being called in a tight loop. Instead, the code needs to be rewritten so that it operates only to the left hand side.

It may be that the suggestion of generics was a bit premature here. I agree it's the desired end-state, but it's probably worth taking a different path to get there.

What I would do is go back to implementing this purely on a concrete Array type. But take a rigorous approach to what you are trying to achieve: write a function that you are confident does not allocate any memory so long as there is space in self. Any use of Array(...) should be eliminated. At the same time, try to avoid using literals like 0 for the index, instead using only relative things like startIndex.

Once you've done that, the next challenge is to write it so that overflows are detected and returned separately from the in-place addition. That way you can factor out the "growing the Array" part. Maybe some kind of precondition that the caller must include sufficient zero-initialized space to flow into in self, with some helper methods that do this for you before calling them add method.

Once you've done that, you should be able to use the same code with both Array and UnsafeMutableBufferPointer. You can wrap the body of the implementation in self.withUnsafeMutableBufferPointer { /* implementation goes here */ }. This should allow you to use a microbenchmark to determine if the implementation benefits from using pointers instead – but without having to bake the pointers into the function signature itself.

Finally, once everything is working nicely, you can reintroduce the generics and this code can be made to work for various inputs including arrays, slices, unsafe buffers, dequeues etc.

p.s. it looks like you're missing a helper function when posting this code, as there's a version of addingFullWidth that takes a third argument that means your snippets don't compile stand-alone

ben-cohen · May 1, 2022, 4:21pm

Sigh. I see the compiler suggests doing this as a fixit. That's not good.

taylorswift · May 1, 2022, 4:39pm

as i understand it, += takes a inout, so it is equivalent to calling a mutating member like append(_:) on it.

for what it's worth i used to cargo cult append(_:) everywhere and never use += but i'm starting to question this. += is certainly easier to refactor just because it's easier to manipulate in a text editor.

along the same lines, is there any benefit to using CollectionOfOne instead of [b]? a while ago i was told this was faster and now CollectionOfOne is everywhere in my code, to questionable benefit.

ben-cohen · May 1, 2022, 4:47pm

This is diverging from the OP so might be better as separate thread but...

There's really two things here:

avoiding allocating an unnecessary array for [b]
loop unrolling for appending a small fixed number of elements

In both cases, there are other paths than the bespoke a += [b] optimization. [b] ought to be caught by stack allocation, and once all the array code is inlined, it ought to unroll by itself. But both optimizations are quite brittle, so an earlier bespoke optimization just for appending is more likely to succeed and also possibly compiles faster.