When you talk about GPU acceleration, will it just be CUDA (NVIDIA Jetson), or also OpenCL? I'm not a big fan of NVIDIA for making GPGPU synonymous with "NVIDIA-only" pretty much everywhere. It took a while to open up TensorFlow and PyTorch to Apple GPUs, but then it's still restricted to a small subset of platforms for the average consumer. Are you planning to use OpenCL anywhere for kernels? If so, the SwiftOpenCL repository I'm currently planning/prototyping might be insightful.
I have a diagram of how this fits in to hardware-accelerating Swift numerical computing frameworks. OpenCL is a key component of that. I plan to present this to the Swift numerics working group that was recently established, and the "Swift Matrix Library" is a framework we are planning for the Linear Algebra capabilities that S4TF lacks.
@Chris_Lattner3 I share the feeling that TensorFlow's C++ code base is not that flexible, but the work Google put into X10 made it easier to generate graphs from eager-like code. Rather, I see CTensorFlow as problematic because it can't compile for iOS (except for inference-only with TFLite) and PluggableDevice only works when you're going to build the Python bindings. Making a more cooperative backend from scratch allows it to be compiled much faster, and through SwiftPM instead of Bazel. Also, since the new backends are written in Swift, they should be more maintainable and have a lower barrier to entry for anyone wanting to contribute. What are your thoughts on this - does that "lower barrier to entry for contributors" align with the vision you had for Swift when initially creating the language?
To add my 2 cents to the discussion: While developing SwiftFusion, one of the biggest problem we had is also with the performance of the Tensor interface - it is super slow, and I remember Marc and Dave writing a few fixed size Matrix types, even without any BLAS backend, which boosted the performance tenfold. In my perspective, this is actually due to a suboptimal distribution of work: the basic tensor type should just be a plain data type like a numpy array, and the tensor fusion (XLA) stuff should be operating on a different type that can interact with plain tensors.
Other problems include the lack of const generics, so it is not easy to make fixed-sized matrices of arbitrary dimensions (this is mitigated by using gyb). We also encountered some problems involving associated types, which should be a lot better since 5.7 has primary associated types in.
Swift is currently also not able to generate bare-metal code, despite the fact that LLVM can emit PTX and others. This is not too big of an issue though, as all the other ML frameworks rely on manually tuned kernels. Also, if we can do JIT, emitting kernels at runtime with LLVM is definitely a great way to improve performance.
Finally, I still see Swift as a very potent competitor in the market of scientific computing - while SwiftFusion cannot be called optimized in any way, it already has performance figures largely on par with optimization frameworks written in C++ using Eigen. I am looking forward to the next release :)
@fan a lot of the performance concerns you outlined have been discussed by the Swift Numerics Working Group on Slack. We are planning to create a Swift linear algebra library that compiles down to native code and has LLVM optimizations, eliminating the multiple layers of indirection present in S4TF’s Tensor. Would you mind hopping on the Slack channel (posted on the Numerical/ML working group thread started by @Troy_Harvey) so we can discuss making SwiftFusion a client library? Hearing about your experience with CPU-side performance would also help us figure out our priorities.
Also, the discussion on this thread is veering off from its intended purpose of discussing AutoDiff progress. How about future participants try to move discussions to the SNWG’s Slack (when appropriate)? We have a thread titled “autodiff” which could host discussions related to automatic differentiation.
Very cool, makes a ton of sense. Given full control over the stack, I can see how this would be a very nice design! Just to be clear, I wasn't trying to suggest that you "should" use GPUs or accelerators (I don't know enough about your domain etc), I was just saying that the forced tie in to TensorFlow and XLA was a huge problem, and it seems like one you've deftly solved by ignoring it and building your own thing
I mean, that "easier to generate graphs" is really the problem - you shouldn't want that. There is nothing inherently better to "graphs" for a differentiable swift like approach. Once you have a fast host language (unlike python) there is no reason that a "graph interpreter" is faster than the dynamic host program. You might as well be eager mode all the way, and get the benefits of dynamism etc. This allows swift to be a great cpu language and compose with the offload approach of your choice (or not bother).
Good question. Our toolchain can be thought of as a superset of what Modelica tries to accomplish. There are several pieces in PassiveLogic's frameworks that work together to solve physics simulations, but offer a deep-learning-like approach to physics solving. Many of these are slated for open source, and Marin on our team will be leading. These parts are:
Quantum - A digital twin graph language. This is open sourced. We have a dozen large industrial, technology, and building partners (Nvidia, Belimo, Brookfield, the Department of Energy, PNNL, etc), and a growing roster of partners. You can think of this as encoding physics similar to Modelica, but in a graph structure, and organized as an existential ontology ("Who am I" "what do I do", "why do I do it", etc).
Quantum Solver - the Swift compute engine that solves graphs, physics, simulations, and AI problems with regard to graph network & GNNs.
IntrospectionKit - the Swift library that builds on keypaths to provide general object queries, metadata introspection, generators & caching, broadcasting, Lenses, Lens pipes (telescopes), and metagraph support.
Differentiable Swift Compiler - provides industry leading differentiation support for arbitrary code.
Apropo to your comment, I just had an inbound collaboration request from Berkeley National Labs to work with them on Quantum & Differentiable Swift interop with Modelica. This has been on our long term roadmap, but this could make it more of a near term initiative. If this is a topic area you'd be interested in let me know. Berkeley drives much of the Department of Energy management on model predictive control, building & energy simulation tools, and the like.
We have on our roadmap to bring the nicely designed S4TF NN APIs back to life. But it hasn't been a high priority since we are focused on graph compute. For the same reason general purpose Matrix and BLAS hasn't been our current focus. Algebraic graph compute is more flexible and generic, but not as optimized for the tensor use cases.
One interesting data set we have is a specialized thermodynamic simulator we've written. We had a fast hand tuned C version. Our Swift version as of 3 years ago was 98% the speed of c fixed arrays, using Swift contiguous array types. Then we optimized it further using Swift SIMD types, at a 3X speed gain over the C version. Recently we tested a standard Swift array version, and it equalled the performance of the hand tuned Swift SIMD version. Compiler improvements have replaced months of custom optimizations. Not that this eliminates the need for fixed array types, but for CPU dispatch the compiler is now doing much of the heavy lifting.
You'll need to clarify this more. What I'm doing is bringing S4TF back to life, hosting active repositories in the s4tf GitHub organization. That statement sounded like you are doing the same thing that I am. Do you mean to bring the TensorFlow Swift package itself back to life, or just a neural network API that's extremely similar to S4TF? If there's any overlap in our goals, perhaps my work on S4TF could contribute to the new NN framework that you are creating. It would be nice for my work to get out of the status of being "unofficial" or "a lost cause" not endorsed by Google, and have it be part of something official that's truly going to be used by a lot of people.
Thanks for the feedback Chris. Accelerators are definitely going to be important, it was just first things first for getting all of the infrastructure in place. Simultaneous compiler framework, and application development was an interesting juggle... .
I'm interested in any input you have on the accelerator front, and tooling for it. We will be announcing a fun partnership in the next few weeks with one of the big accelerator companies, but things are just in the planning phase.
I was surprised when you said that, given that Google invested a lot of time making X10 and collaborating with PyTorch on the LazyTensor implementation. The MLIR graph compiler (which evolved from XLA) speeds up machine learning measurably, enough for Apple to use it inside Metal Performance Shaders Graph. Even though you're very experienced with compilers, I strongly doubted the quoted statement's veracity. It turns out that the S4TF graph interpreter only decreases computation time by 30-40% on average. This is just one example of ML, but PyTorch has survived for very long with eager mode as the default.
This explanation about graph optimizations is fairly long, so I'm condensing it into a drop-down.
It seems that LazyTensor (the graph interpreter) is tied to TPUs more than anything else. For these accelerators, you cannot dispatch operations eagerly. The XLA instruction set (which LazyTensor revolves around) was tailor-mode for TPUs, not for CPUs and GPUs. In an old S4TF Colab notebook, they reduced execution time on CPU/GPU by 75% (1 - 1/400%), which is impressive. But that's a best case, and the table above represents a more general case.
The improvement is modest, but why does the program run faster at all? My explanation pertains to ML, but I'm interested in whether the same rules apply to PassiveLogic's work with graph construction. When you have a chain of pointwise operators, like the Mish activation function, you dispatch multiple unique primitive operators to a backend (tanh, multiply, exponent, etc). For each operator, you incur overhead from calling into the GPU driver, and from storing tensors in RAM between each operator. In graph mode, a lot of those inefficiencies are optimized away. Sequences of unary operators can be "fused", or lowered down into GPU shader code. The big benefit doesn't come from creating native GPU shaders, but instead that data is staying in GPU registers between each operator.
This graph optimization is possible because you know whether a tensor is temporary. For example, the output of the "tanh" function is consumed by the "multiply" function. In Swift, ARC makes the output of "tanh" deallocate while calling into "multiply". I am working on a GPU backend for S4TF that can utilize this feature of ARC, applying graph optimizations on-the-fly. It exploits the delay between encoding and execution on the GPU to minimize driver overhead, maximizing sequential throughput*. Instead of encoding operators immediately, it stores them in a massive pipeline. The backend can scan up to 100 eagerly dispatched enqueued operations and apply graph optimizations before sending them to the GPU. A good analogy is out-of-order execution on a CPU, which scans a stream of instructions (reorder buffer) long before they execute, then does cool optimizations like vectorization. All of this happens under-the-hood without requiring things like LazyTensorBarrier() in the frontend.
*To clear up possible confusion, these are two separate optimizations. The first reduces driver overhead by two orders of magnitude, letting you run more primitive operations per second. The second optimization is the opportunity apply graph optimizations. The second optimization is possible because of how the first optimization is implemented.
This also solves the problem of compiling control flow in X10. Because the graph optimizations take negligible time to apply, you can unroll a massive loop and re-apply the optimization on every iteration. You also don't have to wait for the XLA compiler to take an extremely long time to process your graph. The first few operations execute almost immediately, and graph optimizations start happening when the operator queue gets backlogged.
Earlier on this thread, I tried steering discussion away from tangential topics, and back to its main purpose: AutoDiff. This comment is going very far down a tangent, but here's my reasoning: Swift's AutoDiff meant you no longer had to construct a graph to apply automatic differentiation. Now, you don't even need to construct a graph to perform optimizations. Graphs have become basically obsolete for ML, unless you're using a TPU or some highly optimized setup like multi-GPU, which the average person doesn't have access to.
TL;DR - Because of Swift's unique characteristics as a host language, you can apply the optimizations that make graph mode fast without having an actual graph interpreter.
There isn't yet a formal working group for Differentiable Swift, or for ML-in-Swift efforts in general, but I think this thread is a reasonable place to ask this question. We're putting together a project-wide "roadmap" blog post looking forward to the next year of Swift development, and as part of that we're soliciting 2–4 paragraphs from a bunch of different corners of the project about the major things we're hoping to achieve in this timeframe. If the people here are willing to contribute that (say, by September 6th), I'd be happy to include it in the post.
It should be focused mostly on deliverables in official Swift projects, although briefly covering / linking to related projects may be fine.
Hey, John! Thanks for the opportunity, glad to hear about the interest. I can only speak for the group of people I work with, but we believe the next year will be an important one for differentiable Swift and related language features. We plan on investing in upstreaming improvements to the Swift language in these core areas:
Robustness: We will focus on fixing issues in differentiable Swift that impact production applications as we encounter them, but we are observing fewer of those over time. There are several other known issues (many with simple reproducers) present in Swift GitHub issues that can also be addressed.
Automatic differentiation performance: We'd like to significantly improve the host-side performance of compiled code using differentiable Swift. Nominally, the compiler-generated backwards pass through arbitrary Swift functions should be as fast (or only slightly slower) than the original version (forward pass) of those functions. At present, the compiler-generated backwards pass is orders of magnitude slower in many cases, so we have optimizations planned over the next year to make the backwards pass much faster (one example). We'd like to see if we can make Swift automatic differentiation as fast as, or even faster than, alternative implementations like Enzyme.
Swift key path performance: While not strictly a part of differentiable Swift, when optimizing strongly typed models in Swift, key paths become extremely important for introspection of these models. Fast key path traversal is vital for many uses of strongly-typed models, so we are hoping to upstream performance improvements in this area. As a first step, we've been working to add a robust set of key path benchmarks to the compiler suite.
Outside of this roadmap, the company I work for is gearing up for a beta launch of a large suite of Swift-heavy products built on the concept of generalized autonomy. These products depend on differentiable Swift to enable gradient-descent-based physics models for use in control and optimization of automation systems. Over the next year, we plan on a staged release into general availability where Swift will be powering the automation of many large and small buildings.
The current text is as follows; the edits are mostly to remove uses of "we" (a little vague/ambiguous in context) and to start with a short imperative sentence, which is the style we're using for the other bullets. Let me know if you want any changes.
Improve robustness by fixing issues in differentiable Swift that impact production applications as they are encountered. Fewer and fewer of these issues are being observed over time, but there are still some known issues (many with simple reproducers) in the issue tracker.
Significantly improve the host-side performance of compiled code using differentiable Swift. Nominally, the compiler-generated backwards pass through Swift functions should be as fast (or only slightly slower) than the original version (forward pass) of those functions. At present, the compiler-generated backwards pass is orders of magnitude slower in many cases; there are some planned optimizations over the next year that should make the backwards pass much faster (one example).
Implement performance improvements to key paths. While this is not strictly a part of differentiable Swift, when optimizing strongly typed models in Swift, key paths become extremely important for introspection of these models. Fast key path traversal is vital for many uses of strongly-typed models, so the hope is to upstream performance improvements in this area. As a first step, there’s been an effort to add a robust set of key path benchmarks to the compiler suite.
As a technical update, work has continued on some of the performance and robustness objectives laid out above. For those interested, I can summarize some of the pull requests that have gone in recently:
On the performance front, @aslpolished and landed a pull request originally developed by @dan-zheng to enable peephole optimizations that unblock optimizations of simple calls to gradient(at:of:). For simple differentiable functions (calculations involving scalar types, no control flow, etc.) in optimized builds, this can lead to a backwards pass that's as fast as the forward pass. Prior to these optimizations, the backwards pass for these functions could be up to 130 times slower, so this is a huge improvement for these simple cases.
My coworker Martin Cwikla upstreamed the rigorous keypath benchmarks I'd mentioned before. As an initial optimization, he has a pull request open now that precomputes the offset from the root to the value referenced by the keypath for structs with trivially-typed internals. This yields significant improvements in several keypath benchmarks, ranging from 13 - 64X. More work remains, but early results are encouraging.
In terms of making differentiable Swift more robust, @aslimproved the system for custom derivative lookups, which allowed custom derivatives for functions previously defined in other modules to be defined and then used in the same module (as well as fixing other cases). He also identified and fixed a source of segmentation faults around differentiable functions, which allowed a series of tests to be re-enabled.
Work is ongoing, just though it was worth providing an update for anyone following along.
Hi @Brad_Larson thanks so much for your work here and for the updates you've been posting. I read this thread earlier with a lot of enthusiasm and have been wondering how to sink my teeth into it properly (noting that this would be the first ML training project I'd be undertaking myself).
In particular, I'm wondering whether there are any plans to implement the Layer and optimiser APIs from the TensorFlow package? Or if you have alternative suggestions? I'm aware that differentiable Swift is supposed to be bigger than "just" neural nets, but I'm finding it very difficult to get my head around the implications of that without any higher-level APIs to play around with first.
What I'd really like to do is recreate a CNN we originally trained using TF 'proper', and deploy it to devices (iOS, Wasm, Android) with the help of import _Differentiation. With that, I'm hoping we could:
Remove TFLite from our stack (provided the performance is same or better)
More easily allow on-device training
Further improve our cross-platform story
Does that seem realistic to you, given that our models do run on the CPU today? Is the language feature at a stage where we can expect to match performance of the likes of TFLite and its highly-optimised XNNPack CPU delegate? Is it even a good use case?
edit: I should add that I'm not at all averse to getting my hands dirty, going off the beaten track, etc. Mostly curious whether it's even in the ballpark of being worth it at this point. And whether there are standardised solutions to, e.g. dense layers, activation functions, convolutions, etc., that I should be aware of (or whether I should just try to copy/paste some TensorFlow code initially).
Sorry for the slow reply, was on the road for a bit. Glad to hear about the interest.
There are maybe different concerns when talking about the performance and robustness of differentiable Swift as a language feature itself, versus higher-level frameworks built upon it. A higher-level framework oriented around a Tensor-like type that dispatches to an accelerator or has optimized vector operations might have different performance bottlenecks and may only use a subset of the language features differentiable Swift provides. The overhead that we're trying to eliminate at the language level in the backwards pass of compiled differentiable Swift functions might be trivial compared to that involved in the preparation and dispatch of big parallel operations.
For traditional CNNs, your performance will probably be determined by how well data can be batched for parallel operations, how well available hardware can be used, can operations be fused, can memory be re-used, etc. Established frameworks have put a lot of time and effort into those parts of the process, and they can be hard to beat for their sweet spot of traditional neural networks.
Differentiable Swift shines in the situations that need a fast host language, tremendous flexibility, and / or the ability to blur the lines between application logic and machine learning code. If a custom operation would let you perform a calculation much faster, you can simply write one in Swift instead of waiting for a framework's authors to build and deploy one. We've seen instances on mobile where that was a huge win over existing solutions. At PassiveLogic, differentiable Swift is absolutely vital to getting great performance at the edge for our physics-based simulations, where it is ~117X faster in one internal comparison vs. PyTorch for a representative simulation, and 189X faster than TensorFlow for the same. Our code seamlessly mixes application logic in Swift with optimization processes.
Also, while differentiable Swift currently can be used on iOS, Android, and SwiftWASM (I've deployed to all three), support is what I would call "highly experimental" on those platforms and may not pass your comfort threshold for everyday production use. It's certainly fun to play with there, though.
That's a bit of a roundabout way of saying that I'm not sure whether replacing your existing TensorFlow or TFLite CNN model with something built in a framework layered on differentiable Swift would justify the rewrite work today. There are certainly cases where we've seen large performance advantages by building something in Swift that is outside of the comfort zone of existing ML frameworks, but that may not be your situation from what you describe.
Thanks a lot for your reply and for the great practical-level info @Brad_Larson, much appreciated! And I've taken your disclaimer to heart.
Nevertheless, I believe it's worth a shot – above all for our Wasm platform – in order to reduce overheads and dependency complexity, and because a successful deployment there could mean further unifying our codepaths across all of our platforms.