Differentiable programming for gradient-based machine learning

We're going for a highly optimized Metal backend that could perform on par with Apple's Python TensorFlow. We'll also add ops Apple's backend doesn't support, including 3D convolutions and FFTs. Swift definitely could become a contender to Python again in the ML community if the resurrection succeeds.


Lets first check if Swift will even be contender for Julia there is big way ahead then maybe aim at python :smiley:

About metal backend yes its cool and all but m1 gpu is still not worth using for most modern DL models as training would be just too slow even inference can be slow compared to a lot cheaper nvidia devices -> for example top performance on python TF metal

ResNet50 M1 Max 32c 128bs 140 img/sec

3090 basic fine tuned Pytorch -> 2584 img/sec

18.5x diff and its still small model in terms of size.

In price of one apple device i can get 1.5x 3090 even using current prices and 2.5 in terms of using MSRP.

Maybe in next few years they can catch up if NVidia dose not improve by much :confused:

Still if we are going for highly optimized metal backend its already over for most of the Deep Learning community as CUDA is a name of the game for almost everyone outside Google and TPU.

Could you run the NVIDIA GPU in Python TensorFlow as well? Different frameworks might have different performance. Also, try ensuring that the NVIDIA GPU runs in 32-bit IEEE single precision instead of TF32 or BF16. Apple's backend might be underutilizing the GPU, so a more direct comparison would determine why there's such a massive difference.

In theory, running the test how I described above will change the performance to a factor of 2.5 to 3.5.

I mean we should compared real performance not some imaginary peak as TF32 is always used if possible in pytorch i can confirme its TF32. (Float 16 have even higher performance ofc)

There is some repo about comparing max tuned TF-metal to untuned TF GitHub - tlkh/tf-metal-experiments at pythonrepo.com you can compare results there .

I'm trying to figure out if there's a problem with MPS that I can solve in the S4TF backend. The GitHub repository might not have given the two GPUs an accurate comparison (there could be other factors they didn't realize were affecting performance), which makes it difficult for me to draw conclusions for the purpose of optimization.

@philipturner Would you mind moving this discussion to the Related Projects forum? We'd like to keep this thread focused on autodiff. Thanks!

Sorry for spam guys feel free to move my posts to some other discussion topic :stuck_out_tongue:

@rxwei I have one question, is there some timeline for adding autodiff into stable release swift version (5.7 maybe?) or its still too early to tell?

We are working on that right now. Stable autodiff is a requirement before we take any action to resurrect S4TF. There are plenty of bugs, but it should be fixed in a month or two when I start devoting time to the effort.

We (Apple) do not yet have a timeline for autodiff to share, but if members from the community like @philipturner would like to help push forward the implementation, I'm more than happy to provide guidance. Tackling bugs that are blocking third-party clients would be a great starting point, but note that in order for autodiff to be eligible for Swift Evolution, we still need to:

  • Complete the ABI [0]
  • Fix any issues when compiling with library evolution
  • Reorganize the _Differentiation module [1]
  • Support differentiating additional language constructs such as _modify [2] and @_alwaysEmitIntoClient functions [3]
  • Support specifying availability on differentiability [4]
  • Most importantly, discover more major use cases

@rxwei from what @BradLarson told me, it’s fine to proceed with work on S4TF once at least VJP works 100% fine. However, will we also need JVP and linear and transpose maps in order to get it fully merged like @machineko is asking about?

I don't want to go into discussing clients outside of the Swift open source project here, but for autodiff itself to be eligible for Swift Evolution, we do not need to implement transposition or forward-mode differentiation that was described in the manifesto. The initial implementation should be in line with the proposal in this thread, that is, reverse-mode differentiation (based on VJPs) that we have already implemented today.


For more major use cases, I think we should look at physics. My AR demo was about that.

I spent two months in summer 2020 creating a simulation called MultiPendulum. It required solving a system of linear equations, but taking the derivative of the solution operation. Very complex, and ended up with O(n^4) algorithmic complexity and with lot of hand-tuned optimizations - it took days to come up with that algorithm.

That specific derivative example isn’t something Swift autodiff could have optimized on its own, but it just shows how engrained derivatives are in physics simulations. The O(n^4) part was to find partial derivatives of momentum with respect to all other pendulums. I was also using adaptive-step Runge-Kutta, where momentum and position were treated as distinct entities, with their derivatives being used to find their change each time step. There was also an error correction function computed from a division of differentials of the hamiltonian and something else.

This post is very long, but I just mentioned 3 unique examples of derivatives in my simulation. If I could rebuild some of the higher-level components of the physics simulation part of AR MultiPendulum and push an update to the App Store, that would constitute a new application. We could also apply autodiff to some entirely different physics simulation, as there as many examples out there to choose from.


Very cool feature... vs very niche use...

I wish there was a formula we could use to formally assess features. e.g. "a feature of coolness X, that requires an added language complexity Y and that will be used in Z percent of real world apps. combine X, Y, Z into an overall mark N, and check N against threshold T, if bigger the feature should be included into the language."

One possible way forward is to release the cut down feature in a library form (whatever is expressible without language support), ship it and see how well it is received. Then based on feedback and frequency of its usage make an executive decision whether it's worth inclusion as a language feature or not.

As a question from a potential user: Does the “no timeline to share” mean that things will move slowly in the worst case, and we don’t have to fear that autodiff features are going to break when other Swift features evolve?

The background is that I am currently evaluating for my research lab, whether we should bet on the unique selling point of “autodiff for a compiled language” for some of our projects. We could probably live with the current rough state for some time, but breakage of features would be fatal. To some extent, this is a chicken&egg problem: We could potentially provide some visible use cases, but I am uncertain whether I really should invest at that time.

1 Like

@binderh could you give more specific detail on which projects you think autodiff could be used for?

1 Like

In machine learning, combining neural networks with models that impose some structure, such as differential equations, currently is a big topic. Among many other application domains, this is very useful in biomedicine for incorporating knowledge on the processes to be modeled. Yet, differential equations are only one example of a suitable model class, and probably many others will become available in the next few years. One limitation for more flexible incorporating model classes into neural networks is that specification in PyTorch can be challenging, and a real differential programming language is needed. In addition, other frameworks typically require that estimation has to work without updating array elements (see, e.g., the “mutating arrays not supported” topic in the Julia/Zygote community, but also the corresponding issue with JAX). This is a limitation that Swift autodiff doesn’t seem to have. Thus a much broader class of models would potentially be feasible with Swift, and we have several biomedical examples (ranging from single-cell data to large clinical registries) where this could be demonstrated. In addition, implementations in Swift can be compiled into libraries that can easily be deployed into R, Python, Julia, or even web/WASM-based workflows.


That is definitely an important use case, and it could help Swift autodiff/S4TF stand out from other languages/frameworks. @rxwei we now have complex physics simulations and biomedicine as additional major use cases for autodiff.

1 Like

@binderh have you checked out my thread on enabling differentiation in Swift release toolchains? Swift for TensorFlow Resurrection: Differentiation running on iOS

(post deleted by author)


I feel great respect to your eternal fight efforts in this area.

Don't take this as a discouragement, just one humble opinion of a member of this list.

Have no idea how others feel about this. My view on this and other features like this:

  • Swift is already quite large and complicated language.
  • We can't add everything into Swift to satisfy everyone.
  • There are costs to every new language feature (compiler complexity, app complexity, user education, etc)
  • Swift should stay general purpose language without extending into niche domains and areas that would be only used in, say, 0.1% of all apps.
  • Can a feature be done in a library form (even if it's not totally perfect that way compared to being built into the language itself)? If yes it should stay in a form of separate library.

All IMHO above, and I'm happy to be proven wrong.