Differentiable programming for gradient-based machine learning

Without .zeroTangentVector, I'm not sure how to implement something like this (reading from a KeyPath location differentiably:

@differentiable(where Object: Differentiable, Object == Object.TangentVector, Member: Differentiable)
public func readFrom<Object, Member>(
    _ object: Object,
    at member: WritableKeyPath<Object, Member>
) -> Member {
    return object[keyPath: member]
}

@derivative(of: readFrom)
public func vjpReadFrom<Object, Member>(
    _ object: Object,
    at member: WritableKeyPath<Object, Member>
) -> (value: Member, pullback: (Member.TangentVector) -> Object.TangentVector)
where Object: Differentiable, Object == Object.TangentVector, Member: Differentiable
{
    return (value: object[keyPath: member], pullback: { downstream in
        var zeroes = object.zeroTangentVector
        zeroes[keyPath: member] = downstream
        return zeroes
    })
}

On the backward pass, I use object.zeroTangentVector to materialize a zeroed object, which is then updated with a partial derivative only at the read location. A similar situation should occur with array subscript reads. Does someone see a way to do this without zeroTangentVector?

In this case you would probably need to define a protocol that inherits from Differentiable and requires var zeroTangentVector: TangentVector. The Object generic parameter in vjpReadFrom(_:at:) would need to be constrained to that protocol.

Are there any plans to add some GPU (CUDA probably however i know Nvidia and apple dont like each other :smiley: ) support to Swift as without it a lot of cool use cases of differentiable programming still wont be possible :?

Maybe some other sub-organization like juliagpu

I'm not sure how useful it would be to operate with CUDA on the language level, if it can be solved on the library level with something like SwiftRT.

4 Likes

Yep, I agree with Max above.

Deep learning involves (1) automatic differentiation and (2) hardware acceleration. The two are orthogonal, especially for Swift's language-integrated differentiable programming approach – and I believe for Julia too (where composition of orthogonal math + autodiff + acceleration libraries seems used to good effect).

To achieve (2): you can write math functions that are either GPU-specific or that have a parameterized backend (CPU vs GPU vs X).

Then, to achieve (1): you can register derivatives for those functions, and automatic differentiation works compositionally for composite functions calling those differentiable primitives.

4 Likes

To echo with both Machineko and Dan's perspective:

I have worked on another project called Terra, which is a very small DSL built on top of Lua which utilizes LLVM to generate very efficient computation kernels on GPUs. Obviously Lua does not have any GPU codegen, but the combination of a DSL and Lua creates a very powerful GPU language currently powering many big HPC applications. Note that this is more like a hybrid of having a codegen and not having one - we are not lowering LuaJIT bitcode, but generating scheduled LLVM bc that is then compiled by calling the CUDA libs.

One key difference here with Swift is that we are dealing with a compiled language, with no run-time access to the AST. Thus, I don't know if it is possible to create a DSL in Swift that can both act as real Swift code (which executes on CPU) and can be lowered to MLIR/bitcode. I believe @GeorgeL (in LLVM discourse) is working on MLIR bindings for Swift, so that may be of great interest to everyone here.

Oh, I just realized that this ^ all have already been covered by the amazing (but suspended) Swift as Syntactic Sugar for MLIR by @burmako .

3 Likes

Thanks for sharing your perspectives. However I might suggest starting a new thread about it, because this thread is focused on the differentiable programming feature proposal.

3 Likes

Just last replay im talking about libs created by "core" Swift team (like in case of julia-gpu) not to add GPU support to language :slight_smile:

Apologies if this question is off topic but I'm coming from the S4TF side of things and am catching up. Is a deep learning VM image in the works for GCP similar to the experimental S4TF version (e.g. swift-latest-gpu-ubuntu-1804)? In other words, will it be possible for a deep learning model using the differentiable facilities proposed here to run on GCP?

I'd recommend asking this question on their mailing list. This thread is about the differentiable programming feature proposal.

Will do. Thanks!

The Enzyme group at MIT (enzyme.mit.edu) recently proposed their project to LLVM. In it they claim that automatic differentiation is 4.2 times faster if AD is performed after LLVM's optimization passes. I wonder if the approach proposed here can be combined with AD in LLVM it self?

4 Likes

In the fullness of time, absolutely. What's proposed here is the high-level language syntax and semantics, including its ABI implications and such, not the underlying transformation. Although we perform the AD transformation in SIL today, it will be possible to delegate this transformation to LLVM as a performance improvement. I've had some conversations with Enzyme's authors earlier and it seems possible to start some experiments today (contributors welcome!).

11 Likes

Sorry to derail a bit: I'm curious about details here.

Differentiable programming in Swift today seems to focus on language-integration & idiomaticness and good UX & compiler diagnostics. This is different from Enzyme, which seems less language-integrated (operating on a shared IR) and more performance-oriented (operating on optimized LLVM IR).

Today, Swift automatic differentiation has multiple phases: ⓵ differentiation analyses (namely activity analysis), ② differentiation diagnostics, and ⓷ derivative code generation.

I wonder how to best integrate with Swift-IR-external transformation tech like Enzyme. Is the idea that we should:

  • Preserve existing informative ahead-of-time differentiation diagnostics by keeping the existing Swift-aware (SIL) analyses and diagnostic checks (⓵ and ②)? This acts as verification of differentiability.
  • Use Enzyme just for ⓷, generating derivative functions – at the LLVM IR level. This requires Enzyme to do its own analyses (like activity analysis) – feels like there's superficial redundancy (not true redundancy), but it's hard for me to think of a more efficient approach.
    • One theoretical concern is that ② differentiation verification & diagnostics and ⓷ transformation will become further distanced, which may lead to impedance mismatch bugs. I thought about this concern when entertaining whether to move activity analysis to operate on the Swift AST instead of SIL (to avoid dealing with difficult details of SIL like Array-related intrinsics, etc). I suspect this "distance" may not be an issue in practice, with good engineering.
5 Likes

That's the idea. Even when everything happens in SIL, I think our current implementation could have benefited from separating the diagnostics and the transformation as two passes for better hygiene. If we are to delegate the transformation to LLVM, it should still be possible to lower activity analysis results using metadata and/or intrinsics. Making LLVM/Enzyme understand our activity analysis is kind of required, because we have user-level syntaxes such as @noDerivative which directly affect activity analysis. After all, the idea is to delegate the transformation, not the semantics.

2 Likes

I haven't looked closely at Enzyme - how does it handle the annoyances of ABI exposure in LLVM IR? Things like byref/byval attributes etc?

Terms of Service

Privacy Policy

Cookie Policy