Thank you for raising this question. There are a few ways to optimize closure allocations in differentiation and I have been considering different approaches. Bump-pointer allocation for all pullback closures is possible, but it would be very heavyweight and ad-hoc if it were only applied to AD's pullback closures.
Pullbacks returned by reverse-mode derivative functions are "non-escaping result closures". These closures, in all compiler-generated code, never escape the caller unless captured by a parent non-escape result. Pullback closures capture inner pullback closures recursively, almost never escape the call site of the function that returned them (e.g. at users' call site of a derivative function), and have a strict stack allocation/deallocation discipline. This special property makes things easy to optimize, because you can do a continuation transform and turn all closure allocations to stack allocations. Swift is missing a formal syntax and semantics for noescape results, but here's an example (assuming that we could use @noescape
for results):
func innerDerivative1(x: T) -> (value: T, pullback: @noescape (T.TangentVector) -> T.TangentVector) {
...
}
func innerDerivative2(x: T) -> (value: T, pullback: @noescape (T.TangentVector) -> T.TangentVector) {
...
}
func outerDerivative(x: T) -> (value: T, pullback: @noescape (T.TangentVector) -> T.TangentVector) {
let (v1, pb1) = innerDerivative1(x) // allocation of pb1
let (v2, pb2) = innerDerivative2(v1) // allocation of pb2
// In the returned pullback below:
// `pb1` and `pb2` captured by parent @noescape result closure
return (value: v2, pullback: { dv2 in
let dv1 = pb2(dv2) // deallocation of pb2
let dx = pb1(dv1) // deallocation of pb1
return dx
})
}
If @noescape
were available in the user syntax, it could allow us to define a CPS-based ABI for functions with @noescape
result closures. The above functions would then be compiled to the following (it would be done in SIL, but here's the equivalent in Swift):
// Swift equivalent of the lowered version of the code above (take 1)
func innerDerivative1(x: T, pullbackContinuation: ((T.TangentVector) -> T.TangentVector) -> Void) -> T
func innerDerivative2(x: T, pullbackContinuation: ((T.TangentVector) -> T.TangentVector) -> Void) -> T
func outerDerivative(x: T, pullbackContinuation: ((T.TangentVector) -> T.TangentVector) -> Void) -> T {
let v1 = innerDerivative1(x) { pb1 in
let v2 = innerDerivative2(v1) { pb2 in
pullbackContinuation { v in
let dv1 = pb2(v)
let dx = pb1(dv1)
return dx
}
}
}
// (omitted part: return v2 to outer scope.)
}
This approach can make all pullback closures be stack-allocated and resolve the main performance problem, and can be extended to other library APIs which frequently return closures that satisfy this property. While stack-allocated pullbacks are possible for AD, pullback closures created in differentiated loops could benefit from using a stack-disciplined bump-pointer allocator, because we don't want loop derivatives to overflow the stack.
Having said that, I think doing this would be a huge amount of compiler work and there may be a lightweight approach that can help us achieve similar performance. Perhaps we can instead define the ABI of functions with @noescape
result closures as taking an additional argument which represents a stack-disciplined allocator (which already exists in the runtime), and the compiler can simply compile the creations of @noescape
result closures into code that allocates them on the stack allocator. Such an ABI would be easy to adopt in the differentiation transform as well, since all we need to do is using the contextual allocator to allocate pullback closures in our generated derivative functions.
// Swift equivalent of the lowered version of the code above (take 2)
func innerDerivative1(x: T, stackAllocator: StackAllocator) -> (value: T, pullback: (T.TangentVector) -> T.TangentVector) {
...
}
func innerDerivative2(x: T, stackAllocator: StackAllocator) -> (value: T, pullback: (T.TangentVector) -> T.TangentVector) {
...
}
func outerDerivative(x: T, stackAllocator: StackAllocator) -> (value: T, pullback: (T.TangentVector) -> T.TangentVector) {
let (v1, pb1) = innerDerivative1(x, stackAllocator) // allocation of pb1
let (v2, pb2) = innerDerivative2(v1, stackAllocator) // allocation of pb2
// In the returned pullback below:
// `pb1` and `pb2` captured by parent @noescape result closure
return (value: v2, pullback: { dv2 in
let dv1 = pb2(dv2) // deallocation of pb2
let dx = pb1(dv1) // deallocation of pb1
return dx
})
}
Overall, I believe that defining derivative functions as returning pullback closures has been the right mathematical abstraction all along. The key missing part for performance is marking them as returning non-escaping result closures — we need an attribute and ABI (e.g. @noescape
) that lets the compiler guarantee that those returned closures would be allocated efficiently, either on the stack or in a stack-disciplined pool allocator.
(My thanks to @Joe_Groff and @Michael_Gottesman for recent discussions on this.)