Differentiable programming for gradient-based machine learning

dabrahams · December 15, 2020, 7:17pm

Aside from trivial things like the name move, I should talk about what I see as some of the more substantial weaknesses with the autodiff feature as currently proposed, since I've used it pretty extensively in SwiftFusion.

First, I have found the ergonomics of the system really painful. In particular, it has been incredibly hard to build new Differentiable types, even when composing other types that are Differentiable. As a very experienced generic programmer with a non-wizard-but-stronger-than-most math background, I would have expected to have a small learning curve, but it's not turning out that way. Some of this surely comes down to diagnostic QOI, but I think that's only a small part of the issue.

One of the normal strategies, of building up conformances protocol-by-inherited-protocol (e.g. conform to BidirectionalCollection by starting with Sequence conformance, adding Collection conformance, and finally adding conformance to BidirectionalCollection), and getting it to compile at each step, breaks down badly for me, and that's especially bad because the protocol refinement hierarchy required for a Differentiable type's TangentVector is quite deep.

Maybe part of the problem is that you ultimately always end up with a generic TangentVector type that itself has to be Differentiable, with Self as its own TangentVector type.

    associatedtype TangentVector: Differentiable & AdditiveArithmetic
        where TangentVector == TangentVector.TangentVector

As a result, you can't make anything beyond AdditiveArithmetic conformance work without standing up the whole system of conformances for the TangentVector type. Maybe part of the issue is the way Differentiable adds new @differentiable constraints to the AdditiveArithmetic requirements. I'm not 100% sure. This is the sense I have, but maybe something more is at play here. @rxwei It might be instructive for you to look at the TypeKeyedArrayBuffers type which I've struggled for weeks to make Differentiable and see what you run into.

I have a hunch that my idea of using a Manifold protocol might help with this part of the ergonomics somewhat, but that really is a wild guess.

My second major concern centers around the handling of zero, which turns out to have an incredibly important role in differentiable programming. zeroTangentVectorInitializer is awkward, but that's not the biggest problem. First, it seems to be based on the premise that you can only come up with a zero vector for a reshapable type (e.g. Array) if you know its shape, which I'm not sure is true. I've had some success creating universal zero values that are compatible with all shapes. This not only makes it possible to use a cleaner API (like the one from AdditiveArithmetic) but these universal zeros tend to be very efficient because they don't require any dynamic storage. Because zeroTangentVectorInitializer is a closure, if your have a tensor has a “ragged” shape like the top level data structure of SwiftFusion, you usually end up capturing a fairly heavyweight value in that closure to reconstruct the right zero, which I imagine is really hard to optimize. Lastly, if you buy into the premise of zeroTangentVectorInitializer that you can't build a zero without an instance, you still have the static zero from AdditiveArithmetic in which you have to unconditionally fatalError, which makes the AdditiveArithmetic conformance a lie.

IMO the zero-handling part of the proposal is truly a mess and in no shape to be locked down until it's sorted out. However, the ergonomics of creating a differentiable type need some careful attention, too. If a generic programming expert like me can't create a new differentiable data structure, pity the poor ML researcher who needs to do it.

Thanks for your time,
Dave

/cc @saeta @dan-zheng