Differentiable programming for gradient-based machine learning

Not sure what exactly you are referring to by "primitive pullbacks". These two methods aren't pullbacks and aren't yet defined as differentiable. These methods already handle zero values correctly by checking whether one side is zero, and therefore would work correctly when called by compiler-generated tangent accumulations.

Sure, they can be made differentiable. Note that array differentiation is out of scope for this proposal due to the size of the proposed language components. If this proposal is accepted, it would be great to work out the details of array differentiation in a follow-up proposal.

Yes. Same for things like SIMD <-> Array conversion. Linear algebra functions such as matmul also need some attention. If you define the pullback of matmul in terms of matmul itself, you'd need to check for zero incoming tangent and broadcast it to the original result shape, before matmul'ing it with an original argument.

@scanon would it make sense to add an isZero computed property to AdditiveArithmetic to make this more discoverable?

2 Likes

This is a great change Richard, thank you! Does this eliminate the bump pointer allocated closure instances?

-Chris

1 Like

The word “direction” goes with “along.” Something like “offset” would be more appropriate here.

Should we give a blessed name to the TangentVector associated type conformance (i.e. Differentiable & AdditiveArithmetic where TangentVector == Self), just to make it slightly easier for people to think about / write?

Agreed. step would be OK too, I think.

1 Like

I believe a blessed name for Self == Self.TangentVector cannot be represented as a typealias and must be represented as a protocol. A marker protocol requires explicit conformances, which significantly hurts usability.

Differentiability marker protocols can be useful though. An example of such a "differentiable scalar type" marker protocol is TensorFlowFloatingPoint from Swift for TensorFlow. It is useful as the defacto differentiability generic requirement in @differentiable declaration attributes for Tensor operations:

// Much less verbose than:
// @differentiable(
//   where Scalar: Differentiable & BinaryFloatingPoint,
//   Scalar.TangentVector == Scalar
// )
@differentiable(where Scalar: TensorFlowFloatingPoint)
func +<Scalar: Numeric>(
  lhs: Tensor<Scalar>, rhs: Tensor<Scalar>
) -> Tensor<Scalar> { ... }

// The convention: all functions on `Tensor<Scalar>` are differentiable
// `where Scalar: TensorFlowFloatingPoint`. This makes it easier to
// reason about differentiability generic requirements, since they're
// uniform across the board.

Thanks for pointing this out. The proposal has been updated to adopt "offset".

1 Like

A quick question and observation: I wonder if there is some mathematical basis for understanding the name "offset"?

I think I've been colored by Swift and its standard library and compiler codebase, where "offset" typically refers to integer offsets used for e.g. pointer arithmetic and collection indexing. Is it the best name for the argument to Differentiable.move(by:)?

"Offset" is certainly math terminology too: b is the offset in y = mx + b. References to "offset" in math: [0] [1]. Maybe there exists a different parameter name that is less overloaded and more math-appropriate. ("step"?)

I thought the old name "direction" was more math-appropriate - were it not for the fact that along: in move(along: direction) is not the best parameter label.

IMO offset is the right word because an offset has both a direction and a magnitude. There's already some "mathy" precedence, e.g. offsetBy(dx:dy:), where the offset is not an integer, but a vector represented by two floating-point scalars.

5 Likes

It seems to me that the only reason such protocols exist is because type alias declarations don't support where clauses. I don't believe that we should introduce an additional protocol for this purpose. The burden of explicit conformances aside, it will likely be impossible to simply change the protocol to a type alias with the same name when type aliases do support where clauses one day.

Yep, adding an extra protocol is certainly not as ideal.

About typealiases with generic requirements: check out this thread. Seems like the feature could be implemented with a moderate amount of compiler hacking.

The bump-pointer allocated pullback structs are actually unrelated to Differentiable.zeroTangentVectorInitializer.

zeroTangentVectorInitializer wasn't ever used in the automatic differentiation - we had ideas for using it in the autodiff transform, but I'm glad we've decided not to pursue that direction to instead exclusively use "universal zeros" (static var TangentVector.zero), which are easier to reason about (by implementers and users) and much easier for code transformation & generation & optimization than per-instance zero-initializing closures (e.g. var Tensor.zeroTangentVectorInitializer).


Here's the PR adding bump-pointer allocated pullback structs:

The PR implements a (seemingly significant) autodiff performance optimization via custom runtime support for a more efficient autodiff trace representation instead of Swift's existing indirect enum value representation. The PR description has a nice description of the approach and benefits (future work would be to also optimize the autodiff closure representation):

@Chris_Lattner3: I wonder if you have reservations about this direction for some reason? I was actually excited by the PR because it represents progress towards autodiff performance, which has long been awaiting engineering effort and attention (and would continue to benefit from ergonomic, metrics-driven benchmarking).

Some related tangents:

  • Another autodiff performance win is fixing SR-13390: even simple calls to gradient(at:in:) are not fully inlined and optimized away. I ~fixed this and will try to create a PR within the month, just missing autodiff benchmark suite additions.
  • I've promised to port "The Swift Differentiable Programming Implementation Overview" to Markdown and to contribute it to apple/swift docs. I'd also like to finish this within the month - I hope the doc will help inform autodiff-related technical discussions, as well as onboarding autodiff compiler contributors.
1 Like

Thank you for raising this question. There are a few ways to optimize closure allocations in differentiation and I have been considering different approaches. Bump-pointer allocation for all pullback closures is possible, but it would be very heavyweight and ad-hoc if it were only applied to AD's pullback closures.

Pullbacks returned by reverse-mode derivative functions are "non-escaping result closures". These closures, in all compiler-generated code, never escape the caller unless captured by a parent non-escape result. Pullback closures capture inner pullback closures recursively, almost never escape the call site of the function that returned them (e.g. at users' call site of a derivative function), and have a strict stack allocation/deallocation discipline. This special property makes things easy to optimize, because you can do a continuation transform and turn all closure allocations to stack allocations. Swift is missing a formal syntax and semantics for noescape results, but here's an example (assuming that we could use @noescape for results):

func innerDerivative1(x: T) -> (value: T, pullback: @noescape (T.TangentVector) -> T.TangentVector) {
    ...
}

func innerDerivative2(x: T) -> (value: T, pullback: @noescape (T.TangentVector) -> T.TangentVector) {
    ...
}

func outerDerivative(x: T) -> (value: T, pullback: @noescape (T.TangentVector) -> T.TangentVector) {
    let (v1, pb1) = innerDerivative1(x) // allocation of pb1
    let (v2, pb2) = innerDerivative2(v1) // allocation of pb2
    // In the returned pullback below:
    // `pb1` and `pb2` captured by parent @noescape result closure
    return (value: v2, pullback: { dv2 in
        let dv1 = pb2(dv2) // deallocation of pb2
        let dx = pb1(dv1) // deallocation of pb1
        return dx
    })
}

If @noescape were available in the user syntax, it could allow us to define a CPS-based ABI for functions with @noescape result closures. The above functions would then be compiled to the following (it would be done in SIL, but here's the equivalent in Swift):

// Swift equivalent of the lowered version of the code above (take 1)

func innerDerivative1(x: T, pullbackContinuation: ((T.TangentVector) -> T.TangentVector) -> Void) -> T

func innerDerivative2(x: T, pullbackContinuation: ((T.TangentVector) -> T.TangentVector) -> Void) -> T

func outerDerivative(x: T, pullbackContinuation: ((T.TangentVector) -> T.TangentVector) -> Void) -> T {
    let v1 = innerDerivative1(x) { pb1 in
        let v2 = innerDerivative2(v1) { pb2 in
            pullbackContinuation { v in
                let dv1 = pb2(v)
                let dx = pb1(dv1)
                return dx
            }
        }
    }
    // (omitted part: return v2 to outer scope.)
}

This approach can make all pullback closures be stack-allocated and resolve the main performance problem, and can be extended to other library APIs which frequently return closures that satisfy this property. While stack-allocated pullbacks are possible for AD, pullback closures created in differentiated loops could benefit from using a stack-disciplined bump-pointer allocator, because we don't want loop derivatives to overflow the stack.

Having said that, I think doing this would be a huge amount of compiler work and there may be a lightweight approach that can help us achieve similar performance. Perhaps we can instead define the ABI of functions with @noescape result closures as taking an additional argument which represents a stack-disciplined allocator (which already exists in the runtime), and the compiler can simply compile the creations of @noescape result closures into code that allocates them on the stack allocator. Such an ABI would be easy to adopt in the differentiation transform as well, since all we need to do is using the contextual allocator to allocate pullback closures in our generated derivative functions.

// Swift equivalent of the lowered version of the code above (take 2)

func innerDerivative1(x: T, stackAllocator: StackAllocator) -> (value: T, pullback: (T.TangentVector) -> T.TangentVector) {
    ...
}

func innerDerivative2(x: T, stackAllocator: StackAllocator) -> (value: T, pullback: (T.TangentVector) -> T.TangentVector) {
    ...
}

func outerDerivative(x: T, stackAllocator: StackAllocator) -> (value: T, pullback: (T.TangentVector) -> T.TangentVector) {
    let (v1, pb1) = innerDerivative1(x, stackAllocator) // allocation of pb1
    let (v2, pb2) = innerDerivative2(v1, stackAllocator) // allocation of pb2
    // In the returned pullback below:
    // `pb1` and `pb2` captured by parent @noescape result closure
    return (value: v2, pullback: { dv2 in
        let dv1 = pb2(dv2) // deallocation of pb2
        let dx = pb1(dv1) // deallocation of pb1
        return dx
    })
}

Overall, I believe that defining derivative functions as returning pullback closures has been the right mathematical abstraction all along. The key missing part for performance is marking them as returning non-escaping result closures — we need an attribute and ABI (e.g. @noescape) that lets the compiler guarantee that those returned closures would be allocated efficiently, either on the stack or in a stack-disciplined pool allocator.

(My thanks to @Joe_Groff and @Michael_Gottesman for recent discussions on this.)

2 Likes

Are you saying that you still want to use this for pullback closures? I have been thinking about our discussions with respect to pullback closures. Really if you think about it at a high level, you are mapping control flow onto an enum and then using that enum as a state machine to drive execution. Would it be possible to have a special way of encoding a loop into an enum? Then you could recognize the loop structurally and create a finite set of memory outside of the loop that is reused in between loop iterations.

Fundamentally, using a bump ptr allocator for that case shows a representation problem that you are resolving by using (theoretically) infinite memory. We should see if we can somehow represent the loop in the representation and thus be able to understand the loop scoping and thus reuse memory in between iterations.

Any time something needs infinite memory due to a loop, it should give one pause. That is why I am pushing back so hard on this.

2 Likes

In a loop's pullback, each iteration depends on the last iteration's results, unless the variables in use were loop-invariant in the first place (in the original function). It is definitely not the case that memory can always be reused, because values captured by intermediate pullbacks have to be preserved to apply the chain rule of differentiation in the reverse order.

The current data structure (enum with payloads) used in the implementation is equivalent to what the traditional autodiff community calls a "tape", which stores not just a trace of the execution path but also necessary intermediate values produced by each iteration. Those intermediate values (captured in pullback closures in the case of Swift) need to be used in the reverse order to form a linear combination with cotangent vectors as required by math.

4 Likes

I think the ownership system, when fully elaborated, may effectively give us noescape results. A yield'ed borrow of a non-copyable type is, for example, non-escaping. The problem is that closures are copyable. I haven't read up on the plans for making closures with mutable captures threadsafe, but one way would be to make them move-only. I bring this all up just to point out that there may be some interesting design synergies here that should be explored.

5 Likes

I agree with this. While it is interesting that you've reduced the cost of these in a specific case, we shouldn't have codegen of a core language feature rely on heroic runtime implementation, and having compilation be fragile w.r.t. number of loop iterations or execution path seems unacceptable. We need guaranteed correctness out of core language features, and ideally a predictable execution cost model.

I think what you need here is a way in SIL to guarantee stack allocation of these closures. This will naturally solve the problem that you're seeing here because the closures in each loop iteration will naturally get reused (since the alloc/dealloc stack is scoped to the loop body. This will also make everything else cheaper using autodiff, even more performant than bump pointer allocator. This will also enable secondary optimizations to clean up the IR a lot better, since many of the closures will be inlined, and the alloc/dealloc stacks will be removed by existing optimizations.

-Chris

2 Likes

The Google Brain team just announced that Swift for Tensorflow is being sunset. The repos have been archived and its team members are being transitioned to new projects. :frowning_face:

@rxwei we need you and your team's work now more than ever :exclamation:

11 Likes

Oops, what happened?

The work definitely continues! We are currently working on differentiable programming's ABI stability and module stability, and fixing various bugs. We'd welcome contributors too — here's some project ideas about differentiable programming for GSoC: Swift.org - Project Ideas for GSoC 2021.

34 Likes

Why I am not surprised?

:slightly_frowning_face:

4 Likes