Differentiable programming for gradient-based machine learning

Without .zeroTangentVector, I'm not sure how to implement something like this (reading from a KeyPath location differentiably:

@differentiable(where Object: Differentiable, Object == Object.TangentVector, Member: Differentiable)
public func readFrom<Object, Member>(
    _ object: Object,
    at member: WritableKeyPath<Object, Member>
) -> Member {
    return object[keyPath: member]
}

@derivative(of: readFrom)
public func vjpReadFrom<Object, Member>(
    _ object: Object,
    at member: WritableKeyPath<Object, Member>
) -> (value: Member, pullback: (Member.TangentVector) -> Object.TangentVector)
where Object: Differentiable, Object == Object.TangentVector, Member: Differentiable
{
    return (value: object[keyPath: member], pullback: { downstream in
        var zeroes = object.zeroTangentVector
        zeroes[keyPath: member] = downstream
        return zeroes
    })
}

On the backward pass, I use object.zeroTangentVector to materialize a zeroed object, which is then updated with a partial derivative only at the read location. A similar situation should occur with array subscript reads. Does someone see a way to do this without zeroTangentVector?

In this case you would probably need to define a protocol that inherits from Differentiable and requires var zeroTangentVector: TangentVector. The Object generic parameter in vjpReadFrom(_:at:) would need to be constrained to that protocol.

I'm not sure how useful it would be to operate with CUDA on the language level, if it can be solved on the library level with something like SwiftRT.

3 Likes

Yep, I agree with Max above.

Deep learning involves (1) automatic differentiation and (2) hardware acceleration. The two are orthogonal, especially for Swift's language-integrated differentiable programming approach – and I believe for Julia too (where composition of orthogonal math + autodiff + acceleration libraries seems used to good effect).

To achieve (2): you can write math functions that are either GPU-specific or that have a parameterized backend (CPU vs GPU vs X).

Then, to achieve (1): you can register derivatives for those functions, and automatic differentiation works compositionally for composite functions calling those differentiable primitives.

3 Likes

To echo with both Machineko and Dan's perspective:

I have worked on another project called Terra, which is a very small DSL built on top of Lua which utilizes LLVM to generate very efficient computation kernels on GPUs. Obviously Lua does not have any GPU codegen, but the combination of a DSL and Lua creates a very powerful GPU language currently powering many big HPC applications. Note that this is more like a hybrid of having a codegen and not having one - we are not lowering LuaJIT bitcode, but generating scheduled LLVM bc that is then compiled by calling the CUDA libs.

One key difference here with Swift is that we are dealing with a compiled language, with no run-time access to the AST. Thus, I don't know if it is possible to create a DSL in Swift that can both act as real Swift code (which executes on CPU) and can be lowered to MLIR/bitcode. I believe @GeorgeL (in LLVM discourse) is working on MLIR bindings for Swift, so that may be of great interest to everyone here.

Oh, I just realized that this ^ all have already been covered by the amazing (but suspended) Swift as Syntactic Sugar for MLIR by @burmako .

2 Likes

Thanks for sharing your perspectives. However I might suggest starting a new thread about it, because this thread is focused on the differentiable programming feature proposal.

3 Likes

Apologies if this question is off topic but I'm coming from the S4TF side of things and am catching up. Is a deep learning VM image in the works for GCP similar to the experimental S4TF version (e.g. swift-latest-gpu-ubuntu-1804)? In other words, will it be possible for a deep learning model using the differentiable facilities proposed here to run on GCP?

I'd recommend asking this question on their mailing list. This thread is about the differentiable programming feature proposal.

Will do. Thanks!

The Enzyme group at MIT (enzyme.mit.edu) recently proposed their project to LLVM. In it they claim that automatic differentiation is 4.2 times faster if AD is performed after LLVM's optimization passes. I wonder if the approach proposed here can be combined with AD in LLVM it self?

4 Likes

In the fullness of time, absolutely. What's proposed here is the high-level language syntax and semantics, including its ABI implications and such, not the underlying transformation. Although we perform the AD transformation in SIL today, it will be possible to delegate this transformation to LLVM as a performance improvement. I've had some conversations with Enzyme's authors earlier and it seems possible to start some experiments today (contributors welcome!).

13 Likes

Sorry to derail a bit: I'm curious about details here.

Differentiable programming in Swift today seems to focus on language-integration & idiomaticness and good UX & compiler diagnostics. This is different from Enzyme, which seems less language-integrated (operating on a shared IR) and more performance-oriented (operating on optimized LLVM IR).

Today, Swift automatic differentiation has multiple phases: ⓵ differentiation analyses (namely activity analysis), ② differentiation diagnostics, and ⓷ derivative code generation.

I wonder how to best integrate with Swift-IR-external transformation tech like Enzyme. Is the idea that we should:

  • Preserve existing informative ahead-of-time differentiation diagnostics by keeping the existing Swift-aware (SIL) analyses and diagnostic checks (⓵ and ②)? This acts as verification of differentiability.
  • Use Enzyme just for ⓷, generating derivative functions – at the LLVM IR level. This requires Enzyme to do its own analyses (like activity analysis) – feels like there's superficial redundancy (not true redundancy), but it's hard for me to think of a more efficient approach.
    • One theoretical concern is that ② differentiation verification & diagnostics and ⓷ transformation will become further distanced, which may lead to impedance mismatch bugs. I thought about this concern when entertaining whether to move activity analysis to operate on the Swift AST instead of SIL (to avoid dealing with difficult details of SIL like Array-related intrinsics, etc). I suspect this "distance" may not be an issue in practice, with good engineering.
5 Likes

That's the idea. Even when everything happens in SIL, I think our current implementation could have benefited from separating the diagnostics and the transformation as two passes for better hygiene. If we are to delegate the transformation to LLVM, it should still be possible to lower activity analysis results using metadata and/or intrinsics. Making LLVM/Enzyme understand our activity analysis is kind of required, because we have user-level syntaxes such as @noDerivative which directly affect activity analysis. After all, the idea is to delegate the transformation, not the semantics.

2 Likes

I haven't looked closely at Enzyme - how does it handle the annoyances of ABI exposure in LLVM IR? Things like byref/byval attributes etc?

Does the work on the Differentiable programming stopped ?

1 Like

Work continues, you can see the latest AutoDiff PRs here

8 Likes

I got differentiation working on an iOS demo app. More info is in this post and the GitHub repo.

7 Likes

Nice!

@Brad_Hilton the post is currently blocked by the spam filter, but I'm trying to gain momentum for a resurrection of Swift for TensorFlow.

2 Likes

You probably wont as it seems that S4TF team have no problem with Swift as language or community support but that was some Google vs Apple stuff/concerns

I know that the S4TF team was very dedicated to their work. I'm working to resurrect S4TF precisely because it's supposed to be impossible to overturn Google's decision. @BradLarson can attest to pulling off something "impossible". I am hoping that community support can help spread the word about this and bring in more contributors.

lack of interest from Apple in department of deep learning as Apple machines up to date lack any proper modern machine learning capabilities compared to NVIDIA and even AMD

Apple added ML hardware acceleration to both the CPU (A13 - AMX) and GPU (A14/M1 - simdgroup_matrix), just like NVIDIA added tensor cores and Intel's Xe GPUs have matrix cores. At the M1 event, Apple made a big deal about the M1 having Python TensorFlow acceleration. The CPU acceleration is not documented because it's an extension to the ARM instruction set, which is not officially allowed by people using ARM's CPU designs.

And for reference, NVIDIA only added hardware acceleration for ML with the GTX 2000 series (circa 2018). The push for ML hardware acceleration is a very recent phenomenon.

8 Likes