Differentiable programming for gradient-based machine learning

@Brad_Hilton the post is currently blocked by the spam filter, but I'm trying to gain momentum for a resurrection of Swift for TensorFlow.


@philipturner You probably wont as it seems that S4TF team have no problem with Swift as language or community support but that was some Google vs Apple stuff/concerns [Considering Apple closing everything inside their own ecosystem we can speculate concerns about Swift future or lack of interest from Apple in department of deep learning as Apple machines up to date lack any proper modern machine learning capabilities compared to NVIDIA and even AMD].

There is smth from s4tf repo

Thanks for the question @Willian-Zhang and we're also sorry that this project has been archived. I won't directly disagree with @ProfFan 's assessment, but I will say that the decision to archive this project was not due to a lack of interest or resources at Google, but were unfortunately due to concerns outside Google's control. Alas, I believe that's all that can be said at this point. Sorry!

You probably wont as it seems that S4TF team have no problem with Swift as language or community support but that was some Google vs Apple stuff/concerns

I know that the S4TF team was very dedicated to their work. I'm working to resurrect S4TF precisely because it's supposed to be impossible to overturn Google's decision. @BradLarson can attest to pulling off something "impossible". I am hoping that community support can help spread the word about this and bring in more contributors.

lack of interest from Apple in department of deep learning as Apple machines up to date lack any proper modern machine learning capabilities compared to NVIDIA and even AMD

Apple added ML hardware acceleration to both the CPU (A13 - AMX) and GPU (A14/M1 - simdgroup_matrix), just like NVIDIA added tensor cores and Intel's Xe GPUs have matrix cores. At the M1 event, Apple made a big deal about the M1 having Python TensorFlow acceleration. The CPU acceleration is not documented because it's an extension to the ARM instruction set, which is not officially allowed by people using ARM's CPU designs.

And for reference, NVIDIA only added hardware acceleration for ML with the GTX 2000 series (circa 2018). The push for ML hardware acceleration is a very recent phenomenon.


GL with that i hope you can bring back S4TF to live :smiley:

NVIDIA have CUDA (and CUDNN) for what 13 years? I don't speak about pure on device compute accelerators as GPU was good enough for DL training for last 6 years with proper software support (at least this is as far as i can remember :slight_smile: ).

Just saying as person who are working on deployment of DL models on apple ecosystem that there is enormous challenges for us who want to have comparable performance to modern even low-mid end AMD/NVIDIA GPUs and using Apple devices for inference (training is almost not possible if models aren't extremally small) and when performance is a must have is just not possible at the moment to deploy on macos and performance different is getting bigger last few GPU generations.

I love Swift but it is cursed (for us ML/DL folks) of having Apple as main force pushing it forward. Outside of small apple focused buble there is VERY small community and I'm pretty sure Deep learning community wont increase for years after drop of S4TF (which for sure was peak of Swift for ML) and seeing that our open source community is extremally small compared to other modern languages (Rust/Julia/Kotlin even Nim and Elixir seems to have bigger ML community). I don't see how we can convince people from outside to give it another try :confused:

But if u have clear path good luck with that and I hope that it'll be possible for me to pick up Swift without feeling guilty about wasting my time :smiley:

We're going for a highly optimized Metal backend that could perform on par with Apple's Python TensorFlow. We'll also add ops Apple's backend doesn't support, including 3D convolutions and FFTs. Swift definitely could become a contender to Python again in the ML community if the resurrection succeeds.


Lets first check if Swift will even be contender for Julia there is big way ahead then maybe aim at python :smiley:

About metal backend yes its cool and all but m1 gpu is still not worth using for most modern DL models as training would be just too slow even inference can be slow compared to a lot cheaper nvidia devices -> for example top performance on python TF metal

ResNet50 M1 Max 32c 128bs 140 img/sec

3090 basic fine tuned Pytorch -> 2584 img/sec

18.5x diff and its still small model in terms of size.

In price of one apple device i can get 1.5x 3090 even using current prices and 2.5 in terms of using MSRP.

Maybe in next few years they can catch up if NVidia dose not improve by much :confused:

Still if we are going for highly optimized metal backend its already over for most of the Deep Learning community as CUDA is a name of the game for almost everyone outside Google and TPU.

Could you run the NVIDIA GPU in Python TensorFlow as well? Different frameworks might have different performance. Also, try ensuring that the NVIDIA GPU runs in 32-bit IEEE single precision instead of TF32 or BF16. Apple's backend might be underutilizing the GPU, so a more direct comparison would determine why there's such a massive difference.

In theory, running the test how I described above will change the performance to a factor of 2.5 to 3.5.

I mean we should compared real performance not some imaginary peak as TF32 is always used if possible in pytorch i can confirme its TF32. (Float 16 have even higher performance ofc)

There is some repo about comparing max tuned TF-metal to untuned TF GitHub - tlkh/tf-metal-experiments at pythonrepo.com you can compare results there .

I'm trying to figure out if there's a problem with MPS that I can solve in the S4TF backend. The GitHub repository might not have given the two GPUs an accurate comparison (there could be other factors they didn't realize were affecting performance), which makes it difficult for me to draw conclusions for the purpose of optimization.

@philipturner Would you mind moving this discussion to the Related Projects forum? We'd like to keep this thread focused on autodiff. Thanks!

Sorry for spam guys feel free to move my posts to some other discussion topic :stuck_out_tongue:

@rxwei I have one question, is there some timeline for adding autodiff into stable release swift version (5.7 maybe?) or its still too early to tell?

We are working on that right now. Stable autodiff is a requirement before we take any action to resurrect S4TF. There are plenty of bugs, but it should be fixed in a month or two when I start devoting time to the effort.

We (Apple) do not yet have a timeline for autodiff to share, but if members from the community like @philipturner would like to help push forward the implementation, I'm more than happy to provide guidance. Tackling bugs that are blocking third-party clients would be a great starting point, but note that in order for autodiff to be eligible for Swift Evolution, we still need to:

  • Complete the ABI [0]
  • Fix any issues when compiling with library evolution
  • Reorganize the _Differentiation module [1]
  • Support differentiating additional language constructs such as _modify [2] and @_alwaysEmitIntoClient functions [3]
  • Support specifying availability on differentiability [4]
  • Most importantly, discover more major use cases

@rxwei from what @BradLarson told me, it’s fine to proceed with work on S4TF once at least VJP works 100% fine. However, will we also need JVP and linear and transpose maps in order to get it fully merged like @machineko is asking about?

I don't want to go into discussing clients outside of the Swift open source project here, but for autodiff itself to be eligible for Swift Evolution, we do not need to implement transposition or forward-mode differentiation that was described in the manifesto. The initial implementation should be in line with the proposal in this thread, that is, reverse-mode differentiation (based on VJPs) that we have already implemented today.


For more major use cases, I think we should look at physics. My AR demo was about that.

I spent two months in summer 2020 creating a simulation called MultiPendulum. It required solving a system of linear equations, but taking the derivative of the solution operation. Very complex, and ended up with O(n^4) algorithmic complexity and with lot of hand-tuned optimizations - it took days to come up with that algorithm.

That specific derivative example isn’t something Swift autodiff could have optimized on its own, but it just shows how engrained derivatives are in physics simulations. The O(n^4) part was to find partial derivatives of momentum with respect to all other pendulums. I was also using adaptive-step Runge-Kutta, where momentum and position were treated as distinct entities, with their derivatives being used to find their change each time step. There was also an error correction function computed from a division of differentials of the hamiltonian and something else.

This post is very long, but I just mentioned 3 unique examples of derivatives in my simulation. If I could rebuild some of the higher-level components of the physics simulation part of AR MultiPendulum and push an update to the App Store, that would constitute a new application. We could also apply autodiff to some entirely different physics simulation, as there as many examples out there to choose from.


Very cool feature... vs very niche use...

I wish there was a formula we could use to formally assess features. e.g. "a feature of coolness X, that requires an added language complexity Y and that will be used in Z percent of real world apps. combine X, Y, Z into an overall mark N, and check N against threshold T, if bigger the feature should be included into the language."

One possible way forward is to release the cut down feature in a library form (whatever is expressible without language support), ship it and see how well it is received. Then based on feedback and frequency of its usage make an executive decision whether it's worth inclusion as a language feature or not.

As a question from a potential user: Does the “no timeline to share” mean that things will move slowly in the worst case, and we don’t have to fear that autodiff features are going to break when other Swift features evolve?

The background is that I am currently evaluating for my research lab, whether we should bet on the unique selling point of “autodiff for a compiled language” for some of our projects. We could probably live with the current rough state for some time, but breakage of features would be fatal. To some extent, this is a chicken&egg problem: We could potentially provide some visible use cases, but I am uncertain whether I really should invest at that time.

1 Like

@binderh could you give more specific detail on which projects you think autodiff could be used for?

1 Like

In machine learning, combining neural networks with models that impose some structure, such as differential equations, currently is a big topic. Among many other application domains, this is very useful in biomedicine for incorporating knowledge on the processes to be modeled. Yet, differential equations are only one example of a suitable model class, and probably many others will become available in the next few years. One limitation for more flexible incorporating model classes into neural networks is that specification in PyTorch can be challenging, and a real differential programming language is needed. In addition, other frameworks typically require that estimation has to work without updating array elements (see, e.g., the “mutating arrays not supported” topic in the Julia/Zygote community, but also the corresponding issue with JAX). This is a limitation that Swift autodiff doesn’t seem to have. Thus a much broader class of models would potentially be feasible with Swift, and we have several biomedical examples (ranging from single-cell data to large clinical registries) where this could be demonstrated. In addition, implementations in Swift can be compiled into libraries that can easily be deployed into R, Python, Julia, or even web/WASM-based workflows.