Ongoing work on differentiable Swift

Chris_Lattner3 · July 12, 2022, 4:29pm

Hi @philipturner, I assure you I'm quite familiar with all the technology you mention. :)

John_McCall · August 17, 2022, 11:06pm

There isn't yet a formal working group for Differentiable Swift, or for ML-in-Swift efforts in general, but I think this thread is a reasonable place to ask this question. We're putting together a project-wide "roadmap" blog post looking forward to the next year of Swift development, and as part of that we're soliciting 2–4 paragraphs from a bunch of different corners of the project about the major things we're hoping to achieve in this timeframe. If the people here are willing to contribute that (say, by September 6th), I'd be happy to include it in the post.

It should be focused mostly on deliverables in official Swift projects, although briefly covering / linking to related projects may be fine.

Brad_Larson · August 19, 2022, 9:52pm

Hey, John! Thanks for the opportunity, glad to hear about the interest. I can only speak for the group of people I work with, but we believe the next year will be an important one for differentiable Swift and related language features. We plan on investing in upstreaming improvements to the Swift language in these core areas:

Robustness: We will focus on fixing issues in differentiable Swift that impact production applications as we encounter them, but we are observing fewer of those over time. There are several other known issues (many with simple reproducers) present in Swift GitHub issues that can also be addressed.

Automatic differentiation performance: We'd like to significantly improve the host-side performance of compiled code using differentiable Swift. Nominally, the compiler-generated backwards pass through arbitrary Swift functions should be as fast (or only slightly slower) than the original version (forward pass) of those functions. At present, the compiler-generated backwards pass is orders of magnitude slower in many cases, so we have optimizations planned over the next year to make the backwards pass much faster (one example). We'd like to see if we can make Swift automatic differentiation as fast as, or even faster than, alternative implementations like Enzyme.

Swift key path performance: While not strictly a part of differentiable Swift, when optimizing strongly typed models in Swift, key paths become extremely important for introspection of these models. Fast key path traversal is vital for many uses of strongly-typed models, so we are hoping to upstream performance improvements in this area. As a first step, we've been working to add a robust set of key path benchmarks to the compiler suite.

Outside of this roadmap, the company I work for is gearing up for a beta launch of a large suite of Swift-heavy products built on the concept of generalized autonomy. These products depend on differentiable Swift to enable gradient-descent-based physics models for use in control and optimization of automation systems. Over the next year, we plan on a staged release into general availability where Swift will be powering the automation of many large and small buildings.

John_McCall · August 20, 2022, 12:24am

Thanks! We may need to distill this down, but at the very least I can link to it, and this is very useful to know.

John_McCall · September 30, 2022, 9:45pm

The current text is as follows; the edits are mostly to remove uses of "we" (a little vague/ambiguous in context) and to start with a short imperative sentence, which is the style we're using for the other bullets. Let me know if you want any changes.

Differentiable Swift

Improve robustness by fixing issues in differentiable Swift that impact production applications as they are encountered. Fewer and fewer of these issues are being observed over time, but there are still some known issues (many with simple reproducers) in the issue tracker.
Significantly improve the host-side performance of compiled code using differentiable Swift. Nominally, the compiler-generated backwards pass through Swift functions should be as fast (or only slightly slower) than the original version (forward pass) of those functions. At present, the compiler-generated backwards pass is orders of magnitude slower in many cases; there are some planned optimizations over the next year that should make the backwards pass much faster (one example).
Implement performance improvements to key paths. While this is not strictly a part of differentiable Swift, when optimizing strongly typed models in Swift, key paths become extremely important for introspection of these models. Fast key path traversal is vital for many uses of strongly-typed models, so the hope is to upstream performance improvements in this area. As a first step, there’s been an effort to add a robust set of key path benchmarks to the compiler suite.

Brad_Larson · October 3, 2022, 3:22pm

That looks great to me, thanks!

Brad_Larson · October 3, 2022, 11:56pm

As a technical update, work has continued on some of the performance and robustness objectives laid out above. For those interested, I can summarize some of the pull requests that have gone in recently:

On the performance front, @asl polished and landed a pull request originally developed by @dan-zheng to enable peephole optimizations that unblock optimizations of simple calls to gradient(at:of:). For simple differentiable functions (calculations involving scalar types, no control flow, etc.) in optimized builds, this can lead to a backwards pass that's as fast as the forward pass. Prior to these optimizations, the backwards pass for these functions could be up to 130 times slower, so this is a huge improvement for these simple cases.

My coworker Martin Cwikla upstreamed the rigorous keypath benchmarks I'd mentioned before. As an initial optimization, he has a pull request open now that precomputes the offset from the root to the value referenced by the keypath for structs with trivially-typed internals. This yields significant improvements in several keypath benchmarks, ranging from 13 - 64X. More work remains, but early results are encouraging.

In terms of making differentiable Swift more robust, @asl improved the system for custom derivative lookups, which allowed custom derivatives for functions previously defined in other modules to be defined and then used in the same module (as well as fixing other cases). He also identified and fixed a source of segmentation faults around differentiable functions, which allowed a series of tests to be re-enabled.

Work is ongoing, just though it was worth providing an update for anyone following along.

Geordie_J · December 6, 2022, 3:33pm

Hi @Brad_Larson thanks so much for your work here and for the updates you've been posting. I read this thread earlier with a lot of enthusiasm and have been wondering how to sink my teeth into it properly (noting that this would be the first ML training project I'd be undertaking myself).

In particular, I'm wondering whether there are any plans to implement the Layer and optimiser APIs from the TensorFlow package? Or if you have alternative suggestions? I'm aware that differentiable Swift is supposed to be bigger than "just" neural nets, but I'm finding it very difficult to get my head around the implications of that without any higher-level APIs to play around with first.

What I'd really like to do is recreate a CNN we originally trained using TF 'proper', and deploy it to devices (iOS, Wasm, Android) with the help of import _Differentiation. With that, I'm hoping we could:

Remove TFLite from our stack (provided the performance is same or better)
More easily allow on-device training
Further improve our cross-platform story

Does that seem realistic to you, given that our models do run on the CPU today? Is the language feature at a stage where we can expect to match performance of the likes of TFLite and its highly-optimised XNNPack CPU delegate? Is it even a good use case?

edit: I should add that I'm not at all averse to getting my hands dirty, going off the beaten track, etc. Mostly curious whether it's even in the ballpark of being worth it at this point. And whether there are standardised solutions to, e.g. dense layers, activation functions, convolutions, etc., that I should be aware of (or whether I should just try to copy/paste some TensorFlow code initially).

Brad_Larson · December 8, 2022, 8:54pm

Sorry for the slow reply, was on the road for a bit. Glad to hear about the interest.

There are maybe different concerns when talking about the performance and robustness of differentiable Swift as a language feature itself, versus higher-level frameworks built upon it. A higher-level framework oriented around a Tensor-like type that dispatches to an accelerator or has optimized vector operations might have different performance bottlenecks and may only use a subset of the language features differentiable Swift provides. The overhead that we're trying to eliminate at the language level in the backwards pass of compiled differentiable Swift functions might be trivial compared to that involved in the preparation and dispatch of big parallel operations.

For traditional CNNs, your performance will probably be determined by how well data can be batched for parallel operations, how well available hardware can be used, can operations be fused, can memory be re-used, etc. Established frameworks have put a lot of time and effort into those parts of the process, and they can be hard to beat for their sweet spot of traditional neural networks.

Differentiable Swift shines in the situations that need a fast host language, tremendous flexibility, and / or the ability to blur the lines between application logic and machine learning code. If a custom operation would let you perform a calculation much faster, you can simply write one in Swift instead of waiting for a framework's authors to build and deploy one. We've seen instances on mobile where that was a huge win over existing solutions. At PassiveLogic, differentiable Swift is absolutely vital to getting great performance at the edge for our physics-based simulations, where it is ~117X faster in one internal comparison vs. PyTorch for a representative simulation, and 189X faster than TensorFlow for the same. Our code seamlessly mixes application logic in Swift with optimization processes.

Also, while differentiable Swift currently can be used on iOS, Android, and SwiftWASM (I've deployed to all three), support is what I would call "highly experimental" on those platforms and may not pass your comfort threshold for everyday production use. It's certainly fun to play with there, though.

That's a bit of a roundabout way of saying that I'm not sure whether replacing your existing TensorFlow or TFLite CNN model with something built in a framework layered on differentiable Swift would justify the rewrite work today. There are certainly cases where we've seen large performance advantages by building something in Swift that is outside of the comfort zone of existing ML frameworks, but that may not be your situation from what you describe.

Geordie_J · December 8, 2022, 11:56pm

Thanks a lot for your reply and for the great practical-level info @Brad_Larson, much appreciated! And I've taken your disclaimer to heart.

Nevertheless, I believe it's worth a shot – above all for our Wasm platform – in order to reduce overheads and dependency complexity, and because a successful deployment there could mean further unifying our codepaths across all of our platforms.

rex-remind · December 12, 2022, 7:19pm

For clarity (and sorry if this was already covered or implied and I missed this) does your implementation use Accelerate framework or similar still, or is everything hand written? Are there internal benchmarks when leveraging Accelerate or similar?

And thank you to you and your team for all the incredible work here

fan · December 19, 2022, 8:14pm

Cannot speak for Brad, but I would imply that he means "program differentiation" i.e. differentiating the result of a simulator, etc.

BTW, I am also trying to bring back the SwiftFusion project back to life, by stripping out the TensorFlow Swift dependency. We beat (by a small threshold) a C++ impl for a sparse graph optimization workload without any matrix libraries, and that is before the S4TF project shutdown...and I heard from Brad that they got a lot more speedups in the two years passed :)

rex-remind · December 20, 2022, 7:16pm

Sorry, I could have been more clear. Though I too can't speak for him, I think it's clear from what he said that autodiff/program differentiation is at work, but my understanding is one could still use Accelerate/SIMD/what-have-you for basic operations even with autodiff. By hand written, I meant not using any acceleration at all and all basic operators are either written by hand (and then auto-differentiated) or use some std lib (i.e. not hardware accelerated at all) (and then auto-differentiated). That's what I'm curious about, though maybe that piece is also implied here.

We beat (by a small threshold) a C++ impl for a sparse graph optimization workload without any matrix libraries.

That's really promising

Brad_Larson · December 20, 2022, 9:04pm

Sorry for the slow reply, got behind after some travel.

For our current simulator code, we're targeting aarch64 Linux devices and doing development / testing on macOS. Therefore, we can't rely on a platform-specific framework like Accelerate. We do use Swift SIMD types where it makes sense, but there's no explicit underlying framework. It's straightforward Swift code that we're differentiating through for simulation and control path optimization. Our physics and control teams write the natural Swift code that describes the equations at work, and language-integrated automatic differentiation handles the rest.

That's not to say that we wouldn't start layering on top of another framework to simplify dispatch to accelerators at some point, just that we're not using an existing one right now.

rex-remind · December 21, 2022, 7:45pm

Incredible, thanks.

Brad_Larson · June 21, 2023, 6:03pm

It's been a few months since my last update, but a lot has still been going on around differentiable Swift on the compiler side. In line with our goals for differentiable Swift in 2023, many patches have been upstreamed to fix issues with and improve performance of differentiable Swift code. These include:

Correcting an issue with a peephole optimization for simple differentiable code: https://github.com/apple/swift/pull/62012
Fixing a performance regression in autodiff code due to missed inlining opportunities: Ensure that partial_apply of partial_apply does not produce conservative global side effects by asl · Pull Request #62351 · apple/swift · GitHub
A robust fix for crashers involving duplicate debug info within differentiable functions containing control flow: [AutoDiff] Refine debug info emitted for adjoint buffers by asl · Pull Request #62779 · apple/swift · GitHub
Adding a missing diagnosis for non-differentiable functions returning Void: Diagnose differentiable functions returning Void w/o inout arguments. by asl · Pull Request #63080 · apple/swift · GitHub
Removing linear map structs in favor of plain tuples, a really nice simplification that fixed some core issues and potentially unblocks performance improvements: https://github.com/apple/swift/pull/63444
Further simplifications of linear map tuples: [AutoDiff] Unwrap the top level of linear map tuple when it is possible by asl · Pull Request #63770 · apple/swift · GitHub
Fixing broken derivative registration of accessors: Enable propagation of @differentiable attribute from storage declarations to setters by asl · Pull Request #63988 · apple/swift · GitHub
Fixing bad activity analysis on terminal values: Ensure autodiff code does not ignore `getSingleTerminatorOperands` return value by asl · Pull Request #64200 · apple/swift · GitHub
Adding missing VJP / JVPs for floating-point constructors: Add missed vjp / jvp functions for floating-point constructors by asl · Pull Request #64417 · apple/swift · GitHub
Fixing a runtime segmentation fault when a pullback is used more than once: [AutoDiff] Fix use after free when pullback is used multiple times by asl · Pull Request #64647 · apple/swift · GitHub
Removing replication of adjoint buffers in differentiable functions with control flow: [AutoDiff] Do not propagate same adjoint buffer multiple times by asl · Pull Request #64963 · apple/swift · GitHub
Preserving substitution maps while calculating derivatives: Preserve substituions maps while calculating derivative types by asl · Pull Request #65451 · apple/swift · GitHub
Fixing a runtime segfault with certain differentiable functions around loops: Remove 'readnone' attribute from autoDiffCreateLinearMapContext by asl · Pull Request #66203 · apple/swift · GitHub

I have to thank @asl for almost all of effort on the fixes above, he did phenomenal work over the last few months. I may have also mentioned it before, but my coworker Martin was able to upstream his optimizations for keypath access as well as some new keypath-related benchmarks, an area of Swift that we're also heavy consumers of.

Differentiable Swift remains an incredibly important area of the Swift language for my team at PassiveLogic. We've been growing a team of compiler engineers to work on and upstream improvements for this and other areas of Swift that are key to our autonomous control systems. You should start seeing some new names on pull requests, and we're excited with how Swift is evolving. In particular, I think the intersection of Swift macros and differentiability presents opportunities to simplify code and add powerful new capabilities.

sspringer · June 21, 2023, 9:50pm

This is very interesting and I would like to hear more about it (maybe this could make an interesting Swift blog entry)? I know autodiff from machine learning and Swift is not the obvious choice as a machine learning tool (to put it mildly), so it is interesting that autodiff in Swift is still useful.

Brad_Larson · June 22, 2023, 4:25pm

We've written a few posts about differentiable Swift itself and why it's neat, but we definitely could elaborate on the specific use cases that we find so exciting. I go into this in one of my posts above, but differentiable Swift really shines for machine learning and optimization applications that are not well served by existing frameworks. It definitely can be used for traditional neural networks (we did quite a bit of that in Swift for TensorFlow), but differentiable Swift uniquely enables certain classes of problems. It also lets you merge production and ML / optimization code such that you don't see where one begins and the other ends.

But yeah, I am behind in putting together a public repository of differentiable Swift examples that demonstrate use cases, following the excellent example of the Swift macros sample repository.

Troy_Harvey · June 22, 2023, 10:20pm

First I must say, well done Brad and PL compiler Team! The progress is so awesome.

Very good question. The reason PassiveLogic is investing so heavily in the differentiable Swift compiler is we are very focused on edge based generative autonomous systems. Swift has a unique set of qualities that don't exist in any other language or framework:

Industrial systems AI. How do we merge systems programming & AI into a scalable solution. Nobody would claim you're going to build an industrial system, application code, or system frameworks out of python. We need a modern compiled systems language to do that. That means Swift or Rust. And Swift is the only language with a serious effort to build generalized differentiable computing compiler support at the moment.
Edge based inferencing AND training. There are many applications where backroom training, and edge-based inferencing won't get us to where we need to go. We are particularly interested in systems that train themselves at the edge.
Post deep-learning AI. Current frameworks have a very narrow POV. If you don't fit into a tensor shaped homogeneous MATMUL matrix, your problem might be SOL. We are developing a generative platform for autonomous systems. This work is not going to happen with just conventional deep learning frameworks.
Speed. Did I say is was freaking fast? We don't need to worry about the dispatch limitations of Tensorflow, PyTorch, or JAX. Right now differentiable Swift is between 2 to 3 orders of magnitude faster as solving our problem sets than those frameworks. Again this is crucial at the edge.

Hopefully that helps!
Troy

marvin-hansen · August 4, 2023, 10:21am

Hi,

this is a super interesting project and I definitely appreciate all the great work happening here although I am not very familiar with all the details of differentiable programming given so much has happened in the meantime. I know my post is slightly off topic, but please hear me out.

Back in late 2020, I came across the original Google work on Swift4Tensorflow and started experimenting with applying Swift protocols to computational causality. When the Google project was archived in 2022, it wasn't clear to me if differentiable programming has a future in Swift so eventually I ported my project to Rust which isn't as elegant as Swift, but traits in Rust are similar to protocols and support default implementation.

The original Swift Jupyter Notebook with the basic idea is still in the repo

Meanwhile, things have evolved quite a bit towards hypergeometric protocol based causality in Rust, so I just want to share my humble project here because, back in 2020, everything started with protocol based differential programming.

You never know who follows your traces when you walk in the sand. Thank you all.
Marvin