Without .zeroTangentVector, I'm not sure how to implement something like this (reading from a KeyPath location differentiably:
@differentiable(where Object: Differentiable, Object == Object.TangentVector, Member: Differentiable)
public func readFrom<Object, Member>(
_ object: Object,
at member: WritableKeyPath<Object, Member>
) -> Member {
return object[keyPath: member]
}
@derivative(of: readFrom)
public func vjpReadFrom<Object, Member>(
_ object: Object,
at member: WritableKeyPath<Object, Member>
) -> (value: Member, pullback: (Member.TangentVector) -> Object.TangentVector)
where Object: Differentiable, Object == Object.TangentVector, Member: Differentiable
{
return (value: object[keyPath: member], pullback: { downstream in
var zeroes = object.zeroTangentVector
zeroes[keyPath: member] = downstream
return zeroes
})
}
On the backward pass, I use object.zeroTangentVector to materialize a zeroed object, which is then updated with a partial derivative only at the read location. A similar situation should occur with array subscript reads. Does someone see a way to do this without zeroTangentVector?
In this case you would probably need to define a protocol that inherits from Differentiable and requires var zeroTangentVector: TangentVector. The Object generic parameter in vjpReadFrom(_:at:) would need to be constrained to that protocol.
Deep learning involves (1) automatic differentiation and (2) hardware acceleration. The two are orthogonal, especially for Swift's language-integrated differentiable programming approach – and I believe for Julia too (where composition of orthogonal math + autodiff + acceleration libraries seems used to good effect).
To achieve (2): you can write math functions that are either GPU-specific or that have a parameterized backend (CPU vs GPU vs X).
Then, to achieve (1): you can register derivatives for those functions, and automatic differentiation works compositionally for composite functions calling those differentiable primitives.
To echo with both Machineko and Dan's perspective:
I have worked on another project called Terra, which is a very small DSL built on top of Lua which utilizes LLVM to generate very efficient computation kernels on GPUs. Obviously Lua does not have any GPU codegen, but the combination of a DSL and Lua creates a very powerful GPU language currently powering many big HPC applications. Note that this is more like a hybrid of having a codegen and not having one - we are not lowering LuaJIT bitcode, but generating scheduled LLVM bc that is then compiled by calling the CUDA libs.
One key difference here with Swift is that we are dealing with a compiled language, with no run-time access to the AST. Thus, I don't know if it is possible to create a DSL in Swift that can both act as real Swift code (which executes on CPU) and can be lowered to MLIR/bitcode. I believe @GeorgeL (in LLVM discourse) is working on MLIR bindings for Swift, so that may be of great interest to everyone here.
Thanks for sharing your perspectives. However I might suggest starting a new thread about it, because this thread is focused on the differentiable programming feature proposal.
Apologies if this question is off topic but I'm coming from the S4TF side of things and am catching up. Is a deep learning VM image in the works for GCP similar to the experimental S4TF version (e.g. swift-latest-gpu-ubuntu-1804)? In other words, will it be possible for a deep learning model using the differentiable facilities proposed here to run on GCP?
The Enzyme group at MIT (enzyme.mit.edu) recently proposed their project to LLVM. In it they claim that automatic differentiation is 4.2 times faster if AD is performed after LLVM's optimization passes. I wonder if the approach proposed here can be combined with AD in LLVM it self?
In the fullness of time, absolutely. What's proposed here is the high-level language syntax and semantics, including its ABI implications and such, not the underlying transformation. Although we perform the AD transformation in SIL today, it will be possible to delegate this transformation to LLVM as a performance improvement. I've had some conversations with Enzyme's authors earlier and it seems possible to start some experiments today (contributors welcome!).
Sorry to derail a bit: I'm curious about details here.
Differentiable programming in Swift today seems to focus on language-integration & idiomaticness and good UX & compiler diagnostics. This is different from Enzyme, which seems less language-integrated (operating on a shared IR) and more performance-oriented (operating on optimized LLVM IR).
Today, Swift automatic differentiation has multiple phases: ⓵ differentiation analyses (namely activity analysis), ② differentiation diagnostics, and ⓷ derivative code generation.
I wonder how to best integrate with Swift-IR-external transformation tech like Enzyme. Is the idea that we should:
Preserve existing informative ahead-of-time differentiation diagnostics by keeping the existing Swift-aware (SIL) analyses and diagnostic checks (⓵ and ②)? This acts as verification of differentiability.
Use Enzyme just for ⓷, generating derivative functions – at the LLVM IR level. This requires Enzyme to do its own analyses (like activity analysis) – feels like there's superficial redundancy (not true redundancy), but it's hard for me to think of a more efficient approach.
One theoretical concern is that ② differentiation verification & diagnostics and ⓷ transformation will become further distanced, which may lead to impedance mismatch bugs. I thought about this concern when entertaining whether to move activity analysis to operate on the Swift AST instead of SIL (to avoid dealing with difficult details of SIL like Array-related intrinsics, etc). I suspect this "distance" may not be an issue in practice, with good engineering.
That's the idea. Even when everything happens in SIL, I think our current implementation could have benefited from separating the diagnostics and the transformation as two passes for better hygiene. If we are to delegate the transformation to LLVM, it should still be possible to lower activity analysis results using metadata and/or intrinsics. Making LLVM/Enzyme understand our activity analysis is kind of required, because we have user-level syntaxes such as @noDerivative which directly affect activity analysis. After all, the idea is to delegate the transformation, not the semantics.
You probably wont as it seems that S4TF team have no problem with Swift as language or community support but that was some Google vs Apple stuff/concerns
I know that the S4TF team was very dedicated to their work. I'm working to resurrect S4TF precisely because it's supposed to be impossible to overturn Google's decision. @BradLarson can attest to pulling off something "impossible". I am hoping that community support can help spread the word about this and bring in more contributors.
lack of interest from Apple in department of deep learning as Apple machines up to date lack any proper modern machine learning capabilities compared to NVIDIA and even AMD
Apple added ML hardware acceleration to both the CPU (A13 - AMX) and GPU (A14/M1 - simdgroup_matrix), just like NVIDIA added tensor cores and Intel's Xe GPUs have matrix cores. At the M1 event, Apple made a big deal about the M1 having Python TensorFlow acceleration. The CPU acceleration is not documented because it's an extension to the ARM instruction set, which is not officially allowed by people using ARM's CPU designs.
And for reference, NVIDIA only added hardware acceleration for ML with the GTX 2000 series (circa 2018). The push for ML hardware acceleration is a very recent phenomenon.