Differentiable programming for gradient-based machine learning

rxwei · December 21, 2020, 10:21am

Not sure what exactly you are referring to by "primitive pullbacks". These two methods aren't pullbacks and aren't yet defined as differentiable. These methods already handle zero values correctly by checking whether one side is zero, and therefore would work correctly when called by compiler-generated tangent accumulations.

Sure, they can be made differentiable. Note that array differentiation is out of scope for this proposal due to the size of the proposed language components. If this proposal is accepted, it would be great to work out the details of array differentiation in a follow-up proposal.

Yes. Same for things like SIMD <-> Array conversion. Linear algebra functions such as matmul also need some attention. If you define the pullback of matmul in terms of matmul itself, you'd need to check for zero incoming tangent and broadcast it to the original result shape, before matmul'ing it with an original argument.

@scanon would it make sense to add an isZero computed property to AdditiveArithmetic to make this more discoverable?

Chris_Lattner3 · January 3, 2021, 10:20pm

rxwei:

I have updated the definition of Differentiable in the proposal to the following. Namely, zeroTangentVectorInitializer has been removed, and move(along:) has been renamed to move(by:) . Thanks for everyone's feedback!
public protocol Differentiable {
    /// A type that can be used to represent derivatives with respect to a
    /// value whose type is `Self`. Mathematically, this is equivalent to the
    /// tangent bundle of the differentiable manifold represented by the
    /// differentiable type.
    associatedtype TangentVector: Differentiable & AdditiveArithmetic
        where TangentVector == TangentVector.TangentVector

    /// Moves `self` by the given direction. In Riemannian geometry, this is
    /// equivalent to exponential map, which moves `self` on the geodesic
    /// surface by the given tangent vector.
    mutating func move(by direction: TangentVector)
}
The compiler-generated pullback code will always use TangentVector.zero as the zero tangent vector

This is a great change Richard, thank you! Does this eliminate the bump pointer allocated closure instances?

-Chris

dabrahams · January 9, 2021, 2:09am

rxwei:

    /// Moves `self` by the given direction. In Riemannian geometry, this is
    /// equivalent to exponential map, which moves `self` on the geodesic
    /// surface by the given tangent vector.
    mutating func move(by direction: TangentVector)

The word “direction” goes with “along.” Something like “offset” would be more appropriate here.

scanon · January 9, 2021, 3:12am

Should we give a blessed name to the TangentVector associated type conformance (i.e. Differentiable & AdditiveArithmetic where TangentVector == Self), just to make it slightly easier for people to think about / write?

Agreed. step would be OK too, I think.

dan-zheng · January 9, 2021, 3:25am

I believe a blessed name for Self == Self.TangentVector cannot be represented as a typealias and must be represented as a protocol. A marker protocol requires explicit conformances, which significantly hurts usability.

Differentiability marker protocols can be useful though. An example of such a "differentiable scalar type" marker protocol is TensorFlowFloatingPoint from Swift for TensorFlow. It is useful as the defacto differentiability generic requirement in @differentiable declaration attributes for Tensor operations:

// Much less verbose than:
// @differentiable(
//   where Scalar: Differentiable & BinaryFloatingPoint,
//   Scalar.TangentVector == Scalar
// )
@differentiable(where Scalar: TensorFlowFloatingPoint)
func +<Scalar: Numeric>(
  lhs: Tensor<Scalar>, rhs: Tensor<Scalar>
) -> Tensor<Scalar> { ... }

// The convention: all functions on `Tensor<Scalar>` are differentiable
// `where Scalar: TensorFlowFloatingPoint`. This makes it easier to
// reason about differentiability generic requirements, since they're
// uniform across the board.

rxwei · January 9, 2021, 6:49am

Thanks for pointing this out. The proposal has been updated to adopt "offset".

dan-zheng · January 9, 2021, 6:56am

A quick question and observation: I wonder if there is some mathematical basis for understanding the name "offset"?

I think I've been colored by Swift and its standard library and compiler codebase, where "offset" typically refers to integer offsets used for e.g. pointer arithmetic and collection indexing. Is it the best name for the argument to Differentiable.move(by:)?

"Offset" is certainly math terminology too: b is the offset in y = mx + b. References to "offset" in math: [0] [1]. Maybe there exists a different parameter name that is less overloaded and more math-appropriate. ("step"?)

I thought the old name "direction" was more math-appropriate - were it not for the fact that along: in move(along: direction) is not the best parameter label.

rxwei · January 9, 2021, 7:16am

IMO offset is the right word because an offset has both a direction and a magnitude. There's already some "mathy" precedence, e.g. offsetBy(dx:dy:), where the offset is not an integer, but a vector represented by two floating-point scalars.

rxwei · January 9, 2021, 7:32am

It seems to me that the only reason such protocols exist is because type alias declarations don't support where clauses. I don't believe that we should introduce an additional protocol for this purpose. The burden of explicit conformances aside, it will likely be impossible to simply change the protocol to a type alias with the same name when type aliases do support where clauses one day.

dan-zheng · January 9, 2021, 8:11am

Yep, adding an extra protocol is certainly not as ideal.

About typealiases with generic requirements: check out this thread. Seems like the feature could be implemented with a moderate amount of compiler hacking.

dan-zheng · January 10, 2021, 10:41am

The bump-pointer allocated pullback structs are actually unrelated to Differentiable.zeroTangentVectorInitializer.

zeroTangentVectorInitializer wasn't ever used in the automatic differentiation - we had ideas for using it in the autodiff transform, but I'm glad we've decided not to pursue that direction to instead exclusively use "universal zeros" (static var TangentVector.zero), which are easier to reason about (by implementers and users) and much easier for code transformation & generation & optimization than per-instance zero-initializing closures (e.g. var Tensor.zeroTangentVectorInitializer).

Here's the PR adding bump-pointer allocated pullback structs:

github.com/apple/swift

[AutoDiff] Bump-pointer allocate pullback structs in loops.

apple:main ← rxwei:ad-loop-optimization

opened 06:46AM - 30 Nov 20 UTC

rxwei

+644 -100

In derivatives of loops, no longer allocate boxes for indirect case payloads. In…stead, use a custom pullback context in the runtime which contains a bump-pointer allocator. When a function contains a differentiated loop, the closure context is a `Builtin.NativeObject`, which contains a `swift::AutoDiffLinearMapContext` and a tail-allocated top-level linear map struct (which represents the linear map struct that was previously directly partial-applied into the pullback). In branching trace enums, the payloads of previously indirect cases will be allocated by `swift::AutoDiffLinearMapContext::allocate` and stored as a `Builtin.RawPointer`. The following entry points are added to the runtime: ```swift /// Creates a linear map context with a tail-allocated top-level subcontext. SWIFT_EXPORT_FROM(swift_Differentiation) SWIFT_CC(swift) AutoDiffLinearMapContext *swift_autoDiffCreateLinearMapContext( size_t topLevelSubcontextSize); /// Returns the address of the tail-allocated top-level subcontext. SWIFT_EXPORT_FROM(swift_Differentiation) SWIFT_CC(swift) void *swift_autoDiffProjectTopLevelSubcontext(AutoDiffLinearMapContext *); /// Allocates memory for a new subcontext. SWIFT_EXPORT_FROM(swift_Differentiation) SWIFT_CC(swift) void *swift_autoDiffAllocateSubcontext(AutoDiffLinearMapContext *, size_t size); ``` This is paving the road for a series of optimizations on linear map closure context allocations. For example, a pass can be run on all user-registered derivatives to allocate their closure contexts as a subcontext in a `swift::AutoDiffLinearMapContext`. As a result, differentiating loops over 1 million iterations no longer segfaults, and derivatives with loops have a consistent small performance increase. More work to be done later: - Replace `llvm::BumpPtrAllcoator` with something like the task allocator in #34880 so that we can deallocate things in a stack discipline and tail-allocate the initial slab. - Apply bump-pointer allocation to all linear map contexts.

The PR implements a (seemingly significant) autodiff performance optimization via custom runtime support for a more efficient autodiff trace representation instead of Swift's existing indirect enum value representation. The PR description has a nice description of the approach and benefits (future work would be to also optimize the autodiff closure representation):

@Chris_Lattner3: I wonder if you have reservations about this direction for some reason? I was actually excited by the PR because it represents progress towards autodiff performance, which has long been awaiting engineering effort and attention (and would continue to benefit from ergonomic, metrics-driven benchmarking).

Some related tangents:

Another autodiff performance win is fixing SR-13390: even simple calls to gradient(at:in:) are not fully inlined and optimized away. I ~fixed this and will try to create a PR within the month, just missing autodiff benchmark suite additions.
I've promised to port "The Swift Differentiable Programming Implementation Overview" to Markdown and to contribute it to apple/swift docs. I'd also like to finish this within the month - I hope the doc will help inform autodiff-related technical discussions, as well as onboarding autodiff compiler contributors.

rxwei · January 10, 2021, 11:57am

Thank you for raising this question. There are a few ways to optimize closure allocations in differentiation and I have been considering different approaches. Bump-pointer allocation for all pullback closures is possible, but it would be very heavyweight and ad-hoc if it were only applied to AD's pullback closures.

Pullbacks returned by reverse-mode derivative functions are "non-escaping result closures". These closures, in all compiler-generated code, never escape the caller unless captured by a parent non-escape result. Pullback closures capture inner pullback closures recursively, almost never escape the call site of the function that returned them (e.g. at users' call site of a derivative function), and have a strict stack allocation/deallocation discipline. This special property makes things easy to optimize, because you can do a continuation transform and turn all closure allocations to stack allocations. Swift is missing a formal syntax and semantics for noescape results, but here's an example (assuming that we could use @noescape for results):

func innerDerivative1(x: T) -> (value: T, pullback: @noescape (T.TangentVector) -> T.TangentVector) {
    ...
}

func innerDerivative2(x: T) -> (value: T, pullback: @noescape (T.TangentVector) -> T.TangentVector) {
    ...
}

func outerDerivative(x: T) -> (value: T, pullback: @noescape (T.TangentVector) -> T.TangentVector) {
    let (v1, pb1) = innerDerivative1(x) // allocation of pb1
    let (v2, pb2) = innerDerivative2(v1) // allocation of pb2
    // In the returned pullback below:
    // `pb1` and `pb2` captured by parent @noescape result closure
    return (value: v2, pullback: { dv2 in
        let dv1 = pb2(dv2) // deallocation of pb2
        let dx = pb1(dv1) // deallocation of pb1
        return dx
    })
}

If @noescape were available in the user syntax, it could allow us to define a CPS-based ABI for functions with @noescape result closures. The above functions would then be compiled to the following (it would be done in SIL, but here's the equivalent in Swift):

// Swift equivalent of the lowered version of the code above (take 1)

func innerDerivative1(x: T, pullbackContinuation: ((T.TangentVector) -> T.TangentVector) -> Void) -> T

func innerDerivative2(x: T, pullbackContinuation: ((T.TangentVector) -> T.TangentVector) -> Void) -> T

func outerDerivative(x: T, pullbackContinuation: ((T.TangentVector) -> T.TangentVector) -> Void) -> T {
    let v1 = innerDerivative1(x) { pb1 in
        let v2 = innerDerivative2(v1) { pb2 in
            pullbackContinuation { v in
                let dv1 = pb2(v)
                let dx = pb1(dv1)
                return dx
            }
        }
    }
    // (omitted part: return v2 to outer scope.)
}

This approach can make all pullback closures be stack-allocated and resolve the main performance problem, and can be extended to other library APIs which frequently return closures that satisfy this property. While stack-allocated pullbacks are possible for AD, pullback closures created in differentiated loops could benefit from using a stack-disciplined bump-pointer allocator, because we don't want loop derivatives to overflow the stack.

Having said that, I think doing this would be a huge amount of compiler work and there may be a lightweight approach that can help us achieve similar performance. Perhaps we can instead define the ABI of functions with @noescape result closures as taking an additional argument which represents a stack-disciplined allocator (which already exists in the runtime), and the compiler can simply compile the creations of @noescape result closures into code that allocates them on the stack allocator. Such an ABI would be easy to adopt in the differentiation transform as well, since all we need to do is using the contextual allocator to allocate pullback closures in our generated derivative functions.

// Swift equivalent of the lowered version of the code above (take 2)

func innerDerivative1(x: T, stackAllocator: StackAllocator) -> (value: T, pullback: (T.TangentVector) -> T.TangentVector) {
    ...
}

func innerDerivative2(x: T, stackAllocator: StackAllocator) -> (value: T, pullback: (T.TangentVector) -> T.TangentVector) {
    ...
}

func outerDerivative(x: T, stackAllocator: StackAllocator) -> (value: T, pullback: (T.TangentVector) -> T.TangentVector) {
    let (v1, pb1) = innerDerivative1(x, stackAllocator) // allocation of pb1
    let (v2, pb2) = innerDerivative2(v1, stackAllocator) // allocation of pb2
    // In the returned pullback below:
    // `pb1` and `pb2` captured by parent @noescape result closure
    return (value: v2, pullback: { dv2 in
        let dv1 = pb2(dv2) // deallocation of pb2
        let dx = pb1(dv1) // deallocation of pb1
        return dx
    })
}

Overall, I believe that defining derivative functions as returning pullback closures has been the right mathematical abstraction all along. The key missing part for performance is marking them as returning non-escaping result closures — we need an attribute and ABI (e.g. @noescape) that lets the compiler guarantee that those returned closures would be allocated efficiently, either on the stack or in a stack-disciplined pool allocator.

(My thanks to @Joe_Groff and @Michael_Gottesman for recent discussions on this.)

Michael_Gottesman · January 10, 2021, 6:49pm

Are you saying that you still want to use this for pullback closures? I have been thinking about our discussions with respect to pullback closures. Really if you think about it at a high level, you are mapping control flow onto an enum and then using that enum as a state machine to drive execution. Would it be possible to have a special way of encoding a loop into an enum? Then you could recognize the loop structurally and create a finite set of memory outside of the loop that is reused in between loop iterations.

Fundamentally, using a bump ptr allocator for that case shows a representation problem that you are resolving by using (theoretically) infinite memory. We should see if we can somehow represent the loop in the representation and thus be able to understand the loop scoping and thus reuse memory in between iterations.

Any time something needs infinite memory due to a loop, it should give one pause. That is why I am pushing back so hard on this.

rxwei · January 11, 2021, 12:11am

In a loop's pullback, each iteration depends on the last iteration's results, unless the variables in use were loop-invariant in the first place (in the original function). It is definitely not the case that memory can always be reused, because values captured by intermediate pullbacks have to be preserved to apply the chain rule of differentiation in the reverse order.

The current data structure (enum with payloads) used in the implementation is equivalent to what the traditional autodiff community calls a "tape", which stores not just a trace of the execution path but also necessary intermediate values produced by each iteration. Those intermediate values (captured in pullback closures in the case of Swift) need to be used in the reverse order to form a linear combination with cotangent vectors as required by math.

dabrahams · January 11, 2021, 3:16pm

I think the ownership system, when fully elaborated, may effectively give us noescape results. A yield'ed borrow of a non-copyable type is, for example, non-escaping. The problem is that closures are copyable. I haven't read up on the plans for making closures with mutable captures threadsafe, but one way would be to make them move-only. I bring this all up just to point out that there may be some interesting design synergies here that should be explored.

Chris_Lattner3 · January 25, 2021, 2:06am

I agree with this. While it is interesting that you've reduced the cost of these in a specific case, we shouldn't have codegen of a core language feature rely on heroic runtime implementation, and having compilation be fragile w.r.t. number of loop iterations or execution path seems unacceptable. We need guaranteed correctness out of core language features, and ideally a predictable execution cost model.

I think what you need here is a way in SIL to guarantee stack allocation of these closures. This will naturally solve the problem that you're seeing here because the closures in each loop iteration will naturally get reused (since the alloc/dealloc stack is scoped to the loop body. This will also make everything else cheaper using autodiff, even more performant than bump pointer allocator. This will also enable secondary optimizations to clean up the IR a lot better, since many of the closures will be inlined, and the alloc/dealloc stacks will be removed by existing optimizations.

-Chris

xanderdunn · February 12, 2021, 7:23pm

The Google Brain team just announced that Swift for Tensorflow is being sunset. The repos have been archived and its team members are being transitioned to new projects.

@rxwei we need you and your team's work now more than ever

SwiftUI · February 12, 2021, 8:09pm

Oops, what happened?

rxwei · February 13, 2021, 3:37am

The work definitely continues! We are currently working on differentiable programming's ABI stability and module stability, and fixing various bugs. We'd welcome contributors too — here's some project ideas about differentiable programming for GSoC: Swift.org - Project Ideas for GSoC 2021.

IOOI · February 14, 2021, 11:12am

Why I am not surprised?