Introducing `ParallelSequence` and `ParallelCollection` to automatically execute `map`, `filter` and other functions on multiple threads

alessioburatti · May 30, 2019, 1:40pm

I had to use Java for a couple of things during my Master Thesis and I found out a very nice API that Swift does not provide yet which is parallelStream().

Now, for those of you who are not familiar with higher order functions in Java, basically if you need to map or filter a sequence, you first have to invoke .stream() on the sequence itself. The cool part is that if you invoke .parallelStream(), your sequence gets automatically divided into chunks and the whole execution gets parallelized on all the core of your machine.

I wanted to replicate this in Swift and so I studied the implementation of LazySequence and LazyCollection to have a starting point to work on.

I came up with this draft that is highly inspired by the implementation of Lazy which, for now, supports map and filter: https://github.com/Buratti/Parallel

Please note that I posted this code just to let you see what the main idea is, as there are a couple of problems with the current implementation that I will discuss later.

The usage is exactly the same as lazy:

 let someCollection = 0..<30_000_000
 let douples = someCollection.parallel.map { $0 * 2 }
 print(doubles[1]) // 2

The above code will split the Range<Int> in n parts, apply the transform function on each of the n parts on a different thread in parallel and then flatten the result in a new ParallelCollection.

Current problems:

Since my current implementation is just a draft, I used Foundation.Thread but if we wanted to add parallel in the Standard Library we would need to work with pthreads and, as far as I know, it is not possible to use them to execute code that at a certain point will need to work with generic types.
I also tried to use SwiftPrivateThreadExtras with no luck.
As far as I know, neither Foundation.Thread nor DispatchQueues allow to rethrow errors.
There might be confusion on the combination of Parallel and Lazy and their behaviour should be deeply analyzed in order to decide what happens in cases like myArray.lazy.parallel.map, myArray.parallel.lazy.map or myArray.parallel.lazy.parallel.lazy.map.
It would be up to the user to synchronize the access to shared states inside of the given closure, so for example code like the following

var globalState: MutatingState = ...
var result = someCollection.parallel.map { val in
    globalState.change()
    return val.someOperation()
}

would need to be written as

var globalState: MutatingState = ...
let synchronized = Synchronized()
var result = someCollection.parallel.map { val in
    synchronized {
        globalState.change()
    }
    return val.someOperation()
}

(Out of topic: you can find my example of Synchronized here.

Conclusion

As @hartbit suggested to me, this idea might make more sense to implement once we have first-level concurrency features in the language, but I'd still like to discuss it in the community and hear your opinion about it.

GarthSnyder · May 30, 2019, 8:14pm

This looks nice. And the current alternative that fills this niche (DispatchQueue.concurrentPerform) is somewhat clunky.

My initial thought was "nice, but highly likely to get trapped behind the 'general concurrency solution' elephant in the Swift pipeline." But on reflection I wonder if that's necessarily true. This feature seems relatively simple, separable, and orthogonal.

Is there an efficiency reason to prefer raw threads over GCD? GCD seems like the most natural underlying implementation, and in particular, it would be nice to be able to specify, e.g., .parallel(myQueue).

Jon_Shier · May 30, 2019, 8:33pm

Even if that's true (which I don't necessarily think it is), this feature will necessarily intersect with Swift parallelism design story, just like any other multithreaded code. This is especially true if it's integrated into the standard library, given the requirement for ABI stability of new features. While this is a nice convenience, I don't think it's important enough to paint Swift into a corner just to get it out earlier.

alessioburatti · May 30, 2019, 9:11pm

Excuse me for my ignorance, I am new to this part of Swift.
I proposed raw threads because it was my understanding that inside of the Standard Library only Swift's expressions and other Standard Library's types could be used.
If it is fine to import Foundation or Dispatch then I don't see any reason why we shouldn't use GCD.

Again, since I am new to the development of Swift itself, wouldn't it be possible to implement this feature complying to the API stability requirements? If not, would it be possible to do so, once that first-level concurrency features are implemented in the language?

Jon_Shier · May 30, 2019, 9:14pm

My main concern is that, once implemented and shipped as part of the language, the ABI impact will be locked, meaning we may not be able to switch to, say, a native Swift async implementation when it becomes available. Now, if Swift had a proper Experimental package like other languages, we should ship this feature whenever and then only lock it down when we're sure things won't need to change. But it doesn't and it doesn't seem likely that it will, so we need to ensure the implementation is done in a way that is as future proof as possible.