Differentiable Swift is an experimental language feature that is currently being pitched in Differentiable programming for gradient-based machine learning . The original design grew out of the Swift for TensorFlow project at Google, and is now being maintained and advanced by the community. I'm a former member of the Swift for TensorFlow team, and I currently work for a company called PassiveLogic that makes heavy use of differentiable Swift in our physics simulations and control software for buildings (and more).
In reading comments in the pitch thread for differentiable Swift, I realized that we hadn't done a great job lately of publicly documenting the work that has been going on to advance this feature within the Swift toolchain. I'm hoping to start this thread as a place where we can post fixes and other enhancements as they're upstreamed, as well as track and discuss known issues.
Over the past year, significant work has gone into identifying, reproducing, and fixing bugs in differentiable Swift that stood in the way of specific applications. As a result of that effort, we at PassiveLogic are now able to deploy our Swift-based simulation and control software into production. It heavily relies on differentiable Swift to perform gradient descent through strongly-typed physics models. We've seen outstanding results compared to non-gradient-descent optimizers - if you like optimization and haven’t tried out automatic differentiation yet, you should!
I'm merely the one starting this thread, and these patches have been a community effort. Richard Wei as one of the primary authors of differentiable Swift has continued to drive the design and stabilize the interface. I'd personally like to thank Dan Zheng, another of the primary authors, for his ongoing advice and reviews. At PassiveLogic, we've partnered with Access Softek, Inc. and Anton Korobeynikov there has done an excellent job in upstreaming patches for many of the issues we've encountered. I'd also like to make special mention of the work that Philip Turner has done to isolate reproducers and to upstream tests and some fixes for issues that he's observed while updating the Swift for TensorFlow APIs for current Swift toolchains.
This is a rough list of outstanding differentiable Swift issues, if you're interested in areas that still may need work. I'll kick off this thread with a general overview of patches that have gone in over the last year or so, and I'd like to continue to update it with issues as they appear and are addressed. I'm hoping this will be a good resource for anyone else in the community who might be interested in following along with the progress of differentiable Swift. We'd also love to engage with community members who are using differentiable Swift in their own projects, or who might be interested in doing so.
I'd like to highlight some of the patches that have gone in over the last year, in a few different categories. Hopefully this will illustrate the work being put in to refine and extend differentiable Swift. I apologize upfront if my categorization of the patches is imperfect, this is just a quick grouping of them.
To start with, these are some of the patches that have corrected numerical inaccuracies in gradients:
There are plenty more pull requests than those I've mentioned here, but I wanted to at least start with an overview of the work that has been happening lately. Again, thanks to everyone who has been contributing to the ongoing refinement of differentiable Swift.
The proposal linked above is only for bare-bones differentiation. Differentiability of Standard Library types will be an entirely separate proposal, and we have a long way to go on that front. I don't mean there hasn't been much effort there already. But imagine developing a derivative for almost every API-public function in the Swift Standard Library (or even 10% of it). That task has such a massive scope and has been barely discussed.
Based on past precedents, we have every reason to assume AutoDiff will be perpetually restricted to development toolchains. I hope that's not the case. But PassiveLogic is now the only primary contributor to AutoDiff, with much less people than Google's S4TF team. I don't see myself making major contributions to AutoDiff anymore, unless it blocks S4TF from compiling. If a new person in the Swift community came along and contributed, it would really speed things up. Even if you have little or no experience with the C++ side of the Swift compiler (like me a few months ago), there are areas to help. I would even volunteer a great deal of time to guide anyone through getting started with contributing to AutoDiff.
Recently, I have tried to use AutoDiff in Swift Playgrounds. With the recent activity around AutoDiff in the Swift compiler, Swift 5.7 should be the first time S4TF can compile on a release toolchain (using philipturner/differentiation to do so). However, my repo uses a compiler flag that's forbidden on Swift Playgrounds: -parse-stdlib. I tried every workaround possible, but this is a rare situation where I can't bypass a restriction on technology. If you own an iPad, you can't run libraries depending on AutoDiff natively unless there's a Mac tethered to it. In the link below, I explain the problem in much greater detail.
Previously, most users of AutoDiff could get away with it being experimental. That includes Google with the original S4TF, PassiveLogic, me with the new S4TF (mostly), and others in the community. But Swift Playgrounds is the first time I encountered a use case where the only option is getting AutoDiff all the way through Swift Evolution. Does anyone else have a use case that requires AutoDiff no longer being an "experimental" feature?
I'd like to highlight, that we are very committed to Swift auto-diff as an organization. For the technology development we are doing in heterogeneous neural nets, deep physics, generalized autonomy, and edge based compute... there aren't other languages or solutions that would easily fit as alternates. Swifts infrastructure and feature set still stand alone in the AI world for solutions beyond generic deep learning.
This work is in support of a large team of developers and research scientists pushing the boundaries of Swift as an embedded compute language, platform language, and AI language... and features like auto-diff are key(path) to the teams work. :)
I'd like to note the PassiveLogic team working on Swift AI technology is many times larger than the S4TF team at its peak, with a much larger budget, and very committed backers (and more on the way). Our Swift frameworks development, AI, compiler, and modeling team in total is on the order of 50 people, and expect that to double in the next year or so. We don't require others to join in the development effort... but we'd love it! If community members are interested in working on Swift compiler infrastructure in the auto-diff, key path & introspection, or the like... let us know.
The difference is we are highly focused on internal technology and product. The good news, is we have a killer application for Swift AI (which Google was always in search of, and struggling to justify S4TF when it didn't actually have much to do with TensorFlow). The bad news, is we've been heads down developing cool stuff, and haven't had much time to be public about what we are doing and organize open source initiatives.
That being said being more involved publicly in the open source community is a goal. We've recently been adding to the staff to support open source efforts. So expect more of that to happen not only in the compiler, but with higher level frameworks.
I'd like to hyperlink the Swift Numerical/ML Working Group to this thread. That would be a great place for people to discuss interest in using Swift for AI and AutoDiff. Personally, I would like to know more about PassiveLogic's plans to make open-source code. We have a meeting planned for tomorrow, open to anyone!
This is amazing progress @Brad_Larson and @Troy_Harvey, congratulations! I'm super curious what you're using for an ML backend - are you still tying into TF, or have you moved onto something more suitable for eager execution?
From my perspective, the major challenge S4TF faced was that the TF runtime was ... not great ... at Eager mode, and S4TF's design was truly great at being eager first. The S4TF team put in tons and tons of effort trying to graph-mode-ify things to work around the runtime issues, work on various alternate runtimes etc. However, the more logical solution (unachievable due to being part of google of course) would have been to use PyTorch or some other more cooperative runtime.
Hey Chris, thanks! As I said earlier, it's not just us, a number of people have been helping to push this forward. On our side, it helped to have a very large differentiable Swift codebase with just about every edge case you can find, so as to fuzz out crashers and numerical issues.
The models we're working on at PassiveLogic are strongly-typed Swift representations of the actual physics at work in buildings and all the their constituent equipment. These very heterogenous models can train on far less data than a big bag of undifferentiated neurons and provide much safer bounds on control paths, but they don't present as immediately obvious a means of doing massively parallel calculations.
As a result, we're not yet dispatching a lot to accelerators and instead are first pushing hard to improve host-side differentiable Swift performance. I think we've got a decent path towards increasing compiled backwards pass speed of our models by two orders of magnitude via planned toolchain improvements. Most of that should also benefit differentiable Swift that involves accelerators. As one example, at Google we found that a surprising amount of the slowness in Swift for TensorFlow's eager mode around certain models was due to unnecessary zero tangent vectors being materialized and participating in meaningless addition operations. By preventing those additions from being dispatched, we more than tripled the GPU eager mode performance of specific models. I believe handling these zero additions at a more fundamental level would benefit Swift autodiff performance both CPU-side and on accelerators, and that's on our roadmap.
That said, we absolutely do want to take advantage of accelerators with our models, and have been exploring ways to do that. We definitely have the benefit of the lessons learned from Swift from TensorFlow and many other frameworks as we figure out what design would make the most sense for our needs.
Just to augment Brad's response. Our implementation is pure Swift. We are taking a new approach, which is pretty different from deep learning where models are monolithic. We use runtime composable model fragments (like you would want to in any code base), that use typed interfaces to connect up the pieces in to an application model. Because these model fragments are pre-trained, the user after composing their application ends up with a largely pre-trained completed application model. So our final training costs are very low. Then secondarily, we continuously learn as we inference in the application.
These composable pre-trained kernels define the existential behavior of "things". Behaviors define the underlying requirements of actors (not the Swift kind), and actors can be composed into multiple higher level types: components, equipment, assemblies, sub-systems, systems, etc.
We have an in-house stack that handles the different layers in this process, and several different user applications for building, exploring, and composing the models. At runtime we compile these kernel functions, and link them into a runtime, and inference. We've been focused on CPUs first, but now that we have the compiler and frameworks integrating, we will be starting on accelerators. A few notable things on that front:
Dispatch Our hierarchical typed networks enable the dispatcher multiple levels of granularity to pick from, starting with flat graphs, to behavior graphs, to component graphs, to system graphs, etc. One of the challenges of a TensorFlow like approach is that you have flat tensor graphs and must try to heuristically aggregate to make dispatch more efficient, and of course the heuristics for coalescing operators is an endless pursuit. We can pick the level of dispatch graphs from our recursively hierarchical format.
Accelerators. While tiling these graphs on GPUs will see some large gains, we see an eventual need for developing silicon that is built for graph processing and MIMD operations to get the most out of the work we are doing.
As far as current focus goes, the team is working on a bunch of ambitious goals:
AI Frameworks. We are working on Swift differentiability together with a group of of AI frameworks that build out "4 legs of the stool": Navigation (deductive/inductive graph inferencing), Introspection (reflection, mutation, lensing, meta-graphing), MetaInferencing(abduction, runtime latent inferencing, constraints, etc), and Solving (chaining, competitions, generative learning, graph dispatch, distributed cluster management, multi-graph tie-ups). We have several assets we are targeting for open source here.
Digital Twins. We've been developing a computable digital twin language called Quantum. This is being open sourced, with many industry partners. It is a physics-based digital twin graph encoding that is the underpinnings of the AI frameworks above. It is broad in its ambitions to describe and compute real world things and how they interact in a generalized way.
Autonomous Platform Our platform team is building on top of these frameworks for real time autonomy, automation, sensor fusion, and I/O.
Edge Hardware Our hardware team is building the edge compute platforms that run the whole stack.
User Software Our user software team is building tools that make AI accessible to real people (not just developers). These tool enable engineers to make digital twins, and enabling end customers to build their own custom autonomous systems, and AI as a service queries.
When you talk about GPU acceleration, will it just be CUDA (NVIDIA Jetson), or also OpenCL? I'm not a big fan of NVIDIA for making GPGPU synonymous with "NVIDIA-only" pretty much everywhere. It took a while to open up TensorFlow and PyTorch to Apple GPUs, but then it's still restricted to a small subset of platforms for the average consumer. Are you planning to use OpenCL anywhere for kernels? If so, the SwiftOpenCL repository I'm currently planning/prototyping might be insightful.
I have a diagram of how this fits in to hardware-accelerating Swift numerical computing frameworks. OpenCL is a key component of that. I plan to present this to the Swift numerics working group that was recently established, and the "Swift Matrix Library" is a framework we are planning for the Linear Algebra capabilities that S4TF lacks.
@Chris_Lattner3 I share the feeling that TensorFlow's C++ code base is not that flexible, but the work Google put into X10 made it easier to generate graphs from eager-like code. Rather, I see CTensorFlow as problematic because it can't compile for iOS (except for inference-only with TFLite) and PluggableDevice only works when you're going to build the Python bindings. Making a more cooperative backend from scratch allows it to be compiled much faster, and through SwiftPM instead of Bazel. Also, since the new backends are written in Swift, they should be more maintainable and have a lower barrier to entry for anyone wanting to contribute. What are your thoughts on this - does that "lower barrier to entry for contributors" align with the vision you had for Swift when initially creating the language?
To add my 2 cents to the discussion: While developing SwiftFusion, one of the biggest problem we had is also with the performance of the Tensor interface - it is super slow, and I remember Marc and Dave writing a few fixed size Matrix types, even without any BLAS backend, which boosted the performance tenfold. In my perspective, this is actually due to a suboptimal distribution of work: the basic tensor type should just be a plain data type like a numpy array, and the tensor fusion (XLA) stuff should be operating on a different type that can interact with plain tensors.
Other problems include the lack of const generics, so it is not easy to make fixed-sized matrices of arbitrary dimensions (this is mitigated by using gyb). We also encountered some problems involving associated types, which should be a lot better since 5.7 has primary associated types in.
Swift is currently also not able to generate bare-metal code, despite the fact that LLVM can emit PTX and others. This is not too big of an issue though, as all the other ML frameworks rely on manually tuned kernels. Also, if we can do JIT, emitting kernels at runtime with LLVM is definitely a great way to improve performance.
Finally, I still see Swift as a very potent competitor in the market of scientific computing - while SwiftFusion cannot be called optimized in any way, it already has performance figures largely on par with optimization frameworks written in C++ using Eigen. I am looking forward to the next release :)
@fan a lot of the performance concerns you outlined have been discussed by the Swift Numerics Working Group on Slack. We are planning to create a Swift linear algebra library that compiles down to native code and has LLVM optimizations, eliminating the multiple layers of indirection present in S4TF’s Tensor. Would you mind hopping on the Slack channel (posted on the Numerical/ML working group thread started by @Troy_Harvey) so we can discuss making SwiftFusion a client library? Hearing about your experience with CPU-side performance would also help us figure out our priorities.
Also, the discussion on this thread is veering off from its intended purpose of discussing AutoDiff progress. How about future participants try to move discussions to the SNWG’s Slack (when appropriate)? We have a thread titled “autodiff” which could host discussions related to automatic differentiation.
Very cool, makes a ton of sense. Given full control over the stack, I can see how this would be a very nice design! Just to be clear, I wasn't trying to suggest that you "should" use GPUs or accelerators (I don't know enough about your domain etc), I was just saying that the forced tie in to TensorFlow and XLA was a huge problem, and it seems like one you've deftly solved by ignoring it and building your own thing
I mean, that "easier to generate graphs" is really the problem - you shouldn't want that. There is nothing inherently better to "graphs" for a differentiable swift like approach. Once you have a fast host language (unlike python) there is no reason that a "graph interpreter" is faster than the dynamic host program. You might as well be eager mode all the way, and get the benefits of dynamism etc. This allows swift to be a great cpu language and compose with the offload approach of your choice (or not bother).
Good question. Our toolchain can be thought of as a superset of what Modelica tries to accomplish. There are several pieces in PassiveLogic's frameworks that work together to solve physics simulations, but offer a deep-learning-like approach to physics solving. Many of these are slated for open source, and Marin on our team will be leading. These parts are:
Quantum - A digital twin graph language. This is open sourced. We have a dozen large industrial, technology, and building partners (Nvidia, Belimo, Brookfield, the Department of Energy, PNNL, etc), and a growing roster of partners. You can think of this as encoding physics similar to Modelica, but in a graph structure, and organized as an existential ontology ("Who am I" "what do I do", "why do I do it", etc).
Quantum Solver - the Swift compute engine that solves graphs, physics, simulations, and AI problems with regard to graph network & GNNs.
IntrospectionKit - the Swift library that builds on keypaths to provide general object queries, metadata introspection, generators & caching, broadcasting, Lenses, Lens pipes (telescopes), and metagraph support.
Differentiable Swift Compiler - provides industry leading differentiation support for arbitrary code.
Apropo to your comment, I just had an inbound collaboration request from Berkeley National Labs to work with them on Quantum & Differentiable Swift interop with Modelica. This has been on our long term roadmap, but this could make it more of a near term initiative. If this is a topic area you'd be interested in let me know. Berkeley drives much of the Department of Energy management on model predictive control, building & energy simulation tools, and the like.
We have on our roadmap to bring the nicely designed S4TF NN APIs back to life. But it hasn't been a high priority since we are focused on graph compute. For the same reason general purpose Matrix and BLAS hasn't been our current focus. Algebraic graph compute is more flexible and generic, but not as optimized for the tensor use cases.
One interesting data set we have is a specialized thermodynamic simulator we've written. We had a fast hand tuned C version. Our Swift version as of 3 years ago was 98% the speed of c fixed arrays, using Swift contiguous array types. Then we optimized it further using Swift SIMD types, at a 3X speed gain over the C version. Recently we tested a standard Swift array version, and it equalled the performance of the hand tuned Swift SIMD version. Compiler improvements have replaced months of custom optimizations. Not that this eliminates the need for fixed array types, but for CPU dispatch the compiler is now doing much of the heavy lifting.
You'll need to clarify this more. What I'm doing is bringing S4TF back to life, hosting active repositories in the s4tf GitHub organization. That statement sounded like you are doing the same thing that I am. Do you mean to bring the TensorFlow Swift package itself back to life, or just a neural network API that's extremely similar to S4TF? If there's any overlap in our goals, perhaps my work on S4TF could contribute to the new NN framework that you are creating. It would be nice for my work to get out of the status of being "unofficial" or "a lost cause" not endorsed by Google, and have it be part of something official that's truly going to be used by a lot of people.
Thanks for the feedback Chris. Accelerators are definitely going to be important, it was just first things first for getting all of the infrastructure in place. Simultaneous compiler framework, and application development was an interesting juggle... .
I'm interested in any input you have on the accelerator front, and tooling for it. We will be announcing a fun partnership in the next few weeks with one of the big accelerator companies, but things are just in the planning phase.
I was surprised when you said that, given that Google invested a lot of time making X10 and collaborating with PyTorch on the LazyTensor implementation. The MLIR graph compiler (which evolved from XLA) speeds up machine learning measurably, enough for Apple to use it inside Metal Performance Shaders Graph. Even though you're very experienced with compilers, I strongly doubted the quoted statement's veracity. It turns out that the S4TF graph interpreter only decreases computation time by 30-40% on average. This is just one example of ML, but PyTorch has survived for very long with eager mode as the default.
This explanation about graph optimizations is fairly long, so I'm condensing it into a drop-down.
It seems that LazyTensor (the graph interpreter) is tied to TPUs more than anything else. For these accelerators, you cannot dispatch operations eagerly. The XLA instruction set (which LazyTensor revolves around) was tailor-mode for TPUs, not for CPUs and GPUs. In an old S4TF Colab notebook, they reduced execution time on CPU/GPU by 75% (1 - 1/400%), which is impressive. But that's a best case, and the table above represents a more general case.
The improvement is modest, but why does the program run faster at all? My explanation pertains to ML, but I'm interested in whether the same rules apply to PassiveLogic's work with graph construction. When you have a chain of pointwise operators, like the Mish activation function, you dispatch multiple unique primitive operators to a backend (tanh, multiply, exponent, etc). For each operator, you incur overhead from calling into the GPU driver, and from storing tensors in RAM between each operator. In graph mode, a lot of those inefficiencies are optimized away. Sequences of unary operators can be "fused", or lowered down into GPU shader code. The big benefit doesn't come from creating native GPU shaders, but instead that data is staying in GPU registers between each operator.
This graph optimization is possible because you know whether a tensor is temporary. For example, the output of the "tanh" function is consumed by the "multiply" function. In Swift, ARC makes the output of "tanh" deallocate while calling into "multiply". I am working on a GPU backend for S4TF that can utilize this feature of ARC, applying graph optimizations on-the-fly. It exploits the delay between encoding and execution on the GPU to minimize driver overhead, maximizing sequential throughput*. Instead of encoding operators immediately, it stores them in a massive pipeline. The backend can scan up to 100 eagerly dispatched enqueued operations and apply graph optimizations before sending them to the GPU. A good analogy is out-of-order execution on a CPU, which scans a stream of instructions (reorder buffer) long before they execute, then does cool optimizations like vectorization. All of this happens under-the-hood without requiring things like LazyTensorBarrier() in the frontend.
*To clear up possible confusion, these are two separate optimizations. The first reduces driver overhead by two orders of magnitude, letting you run more primitive operations per second. The second optimization is the opportunity apply graph optimizations. The second optimization is possible because of how the first optimization is implemented.
This also solves the problem of compiling control flow in X10. Because the graph optimizations take negligible time to apply, you can unroll a massive loop and re-apply the optimization on every iteration. You also don't have to wait for the XLA compiler to take an extremely long time to process your graph. The first few operations execute almost immediately, and graph optimizations start happening when the operator queue gets backlogged.
Earlier on this thread, I tried steering discussion away from tangential topics, and back to its main purpose: AutoDiff. This comment is going very far down a tangent, but here's my reasoning: Swift's AutoDiff meant you no longer had to construct a graph to apply automatic differentiation. Now, you don't even need to construct a graph to perform optimizations. Graphs have become basically obsolete for ML, unless you're using a TPU or some highly optimized setup like multi-GPU, which the average person doesn't have access to.
TL;DR - Because of Swift's unique characteristics as a host language, you can apply the optimizations that make graph mode fast without having an actual graph interpreter.