Differentiable programming for gradient-based machine learning

dabrahams · November 23, 2020, 8:01pm

This is why I like having @scanon around

OK, this is a good point. There are two possible approaches:

Don't use +/+=.
Say that the TangentVector type has to be different from the Differentiable type in cases like this.

#2 might still be the right answer, because surely there are manifolds where the normal matrix addition is the right semantics for move(along:). The implication is that you build wrappers for the underlying representation types.

A more principled approach 3) might be to have something like a Manifold protocol that has associated Vector and TangentVector types, and requires all the operations that work appropriately to that manifold. This is essentially a multi-type concept in the generic programming sense. That said, I haven't tried to build such things in Swift and I don't know how well they work out in practice. IMO making the right choice between 1, 2, and 3 is going to require some experimentation.

scanon · November 23, 2020, 8:12pm

The other thing that gives me some pause is that the move(along:) operation is not (in the general case) associative, which people usually expect an operation called + to be. Specifically:

(p + v) + u means "Move p along the vector v in the tangent bundle at p. This gives a point q; move q long the vector u in the tangent bundle at q." Note that v and u are not even in the same tangent bundle, so p + (v + u) wouldn't make any sense.*
p + (v + u) means "Add vectors v and u in the tangent bundle at p, then move p along that vector."

One way to fix this (and possibly help typechecking too?) would be to only provide +=, but not +. That has its own weirdness, however.

[*] You can kinda make sense of this via parallel transport, but you probably wouldn't spell what's going on there as just "+".

dabrahams · November 23, 2020, 9:13pm

non-associativity seems like another good reason to avoid +=

IMO we should stop saying “move along” though. How would you phrase this naturally so that it accounts for the magnitude of the tangent vector?

dabrahams · November 23, 2020, 9:19pm

Dumb question time: doesn't it technically have to keep you on a linear approximation of the manifold at that point, and only actually keeps you on the manifold at the limit where the magnitude of the tangent vector is zero? Or have I just proven once again that I don't really understand the math?

scanon · November 23, 2020, 10:07pm

Yeah, I agree that a better name would be great.

This is actually a really good question, and gets precisely at the heart of the issue. I can write up a simple example that may explain it, but I'll see if I can find an already-existing explanation of what's happening that I can just point you at instead (right after I go pick up my toddler from day care and make dinner )

Troy_Harvey · November 24, 2020, 5:51am

First, we are excited to see this proposal to mainline Auto-diff. We've been using Swift's auto-diff in our embedded ML system frameworks for the last year. They become core to our our infrastructure at this point.

A few notes. We love to see a clear plan towards forward-mode support, and improvements to loop management. I'll let some of our team members chime-in. But thanks to Richard and Dan for all your work.

scanon · November 24, 2020, 4:02pm

Ok, so:

Having failed to find a good explanation online that answers this without dragging in a lot of notational mumbo-jumbo, let's look at the simplest example that kinda illustrates what's going on: the manifold of rotations on R². The usual way to visualize this manifold is to embed it into the plane as the unit circle, but it's important to keep in mind that that's just an embedding. The manifold itself is the abstract space of rotations.

At each point on the circle, there's a tangent space, which is just a copy of the real line. When we draw a picture of the embedding in the plane, we usually draw a line tangent to the unit circle at each point, but again, that's just an embedding. The actual tangent bundle is the abstract collection of all these copies of the real line, one for each point on the manifold.

Pick a point p on the manifold; the tangent space at p is a copy of the real line. Pick a vector from that tangent space, v. In the embedding into the plane, the operation that Richard calls p.move(along: v) is the operation of moving p around the circle by distance v. It does not move you off the circle along the tangent line (because, by definition, it has to stay on the manifold--the manifold is an abstract object that exists independent of the embedding into the plane; there is no "off the manifold" to move onto. The manifold is the whole space).

This is a very brief overview of a very large subject, so at some point we'll run up against the limits of what makes sense to communicate in the forums, but does that kind of make sense?

dabrahams · November 24, 2020, 4:11pm

Absolutely; great expanation, thanks!

porterchild · November 24, 2020, 4:36pm

So happy this is moving forward, amazing work!

I'm excited about the inherent promise of new potential that comes with adding a whole class of functionality to the language that is "just math". There are so many applications just waiting to happen! Image processing, physics simulation, graphics, animation, "normal" programming (I love the podcast playback speed example above). From what I'm told, math is pretty general

I'm excited to see autodiff continue to develop in terms of language integration (there are still some sharp edges). The recent addition of optional differentiability has been much appreciated!

The initial driving application for autodiff was deep learning. It seems evident from swift-models (Thanks Swift for TensorFlow Team!) that autodiff is becoming quite mature for those purposes. Personally, I would like to see speed get better in non-deep-learning situations. I'm using autodiff for optimization of physics simulations (with ecstasy-inducing efficiency over derivative-free optimizers when in high-dimensional spaces btw). For my part, I'd like to see autodiff speed improve for code with tight loops and lightweight operands - the exact opposite of deep learning. Autodiff speed for this kind of code is currently pretty bad.

Thanks for ushering us out of the weird timeline where programming had a gaping hole where derivatives ought to have been! Now that I've been using first-class autodiff for awhile, to give it up would feel like giving up the modulo operator or something :)

scanon · November 24, 2020, 4:36pm

The thing that I'm glossing over a little bit is that in the plane-rotation example, it's pretty obvious how to carry the tangent space with you as you move around the circle, so that you can easily make sense of "p + u + v". In the most general setting, it's much less obvious how that works; there may be multiple paths between two points, and even the handedness of the tangent coordinate frame may not be preserved if you try to carry it along the paths (e.g. on a Mobius strip, if I travel from p back to p by going around the strip, the tangent coordinate frame would be flipped).

clarkdobson · November 24, 2020, 5:36pm

The proposal looks great! I'd just like to add my support for this work, and thank the team for the fantastic development so far.

As yet another user developing with Swift for optimization outside standard deep learning applications, I'd also very much like to see speed improvements for derivatives and gradients, especially reverse-mode gradients of functions involving loops. Substantial improvements in this direction would make some really exciting applications feasible that would be difficult to pull off in any existing frameworks.

bartchr808 · November 24, 2020, 8:23pm

Also fully agree with along specifically not being the right word here. I was thinking of alternatives and I thought of the following.

At first, I thought of possibly falling back to the proper mathematical definition and calling it what it is: an exponential map. For example, maybe calling the function exponentialMap(_:). But this didn't really seem Swifty.

I tried taking some inspiration from offset(by:) and renaming it to translate(by:), but that has it's own issues WRT what it means in differential geometry (see translation surfaces).

But I then I thought of simply just changing move(along:) to move(by:). Perusing other APIs like index(_:offsetBy:) and offsetBy(dx:dy:), this new name follows a similar trend of using a scalar to specify how much you want to shift something by. And when we think of AutoDiff in 1D spaces like Float, then this definition/analogy still holds. However with AutoDiff, we are expanding to N-D spaces and many other more complex spaces, but at least when we think of tangent vectors and Cartesian coordinate systems, having something like myVector.move(by: tangentVector) seems okay, and the same should hold for other complex manifolds.

Given all this, I still prefer the ring move(along:) has, and haven't found an ideal replacement!

dabrahams · November 24, 2020, 10:17pm

Not to pick on you—to a first approximation everybody uses that expression—but I am not a fan of using the word “Swifty” in design discussions. It usually means, “I have a sense this is inconsistent with other things in Swift, probably even with something written in the API guidelines, but I haven't figured out what,” and it ends up being a way of getting off the hook for thinking rigorously about how/why things should be named. Very often the statement is indistinguishable from someone expressing a personal preference.

In this case I'd say your instincts are right: “exponential map” is needlessly technical, which is covered by “Avoid obscure terms.”

I tried taking some inspiration from offset(by:) and renaming it to translate(by:) , but that has it's own issues WRT what it means in differential geometry (see translation surfaces ).

But I then I thought of simply just changing move(along:) to move(by:) . Perusing other APIs like index(_:offsetBy:) and offsetBy(dx:dy:) , this new name follows a similar trend of using a scalar to specify how much you want to shift something by. And when we think of AutoDiff in 1D spaces like Float , then this definition/analogy still holds. However with AutoDiff, we are expanding to N-D spaces and many other more complex spaces, but at least when we think of tangent vectors and Cartesian coordinate systems, having something like myVector.move(by: tangentVector) seems okay, and the same should hold for other complex manifolds.

Yes, if we were going to stick with something that means “move” or “translate,” I agree that by: is the right preposition. I'm not convinced move is the best base word.

scanon · November 25, 2020, 3:18am

Another way to think about it is that we're moving the base point along a straight line on the manifold with velocity given by the vector for unit time. So perhaps something along the lines of move(velocity:) or similar could work.

dabrahams · November 28, 2020, 3:51am

“velocity” comes with the implication that time is somehow involved. This vector really is the amount and direction of movement, right?

Chris_Lattner3 · November 28, 2020, 4:08am

I'm really excited to see this proposal making progress -- congratulations. I also am happy to see this cut down a bit to make the first step more reasonable. You've all obviously put a lot of thought and consideration into this, but here are a few thoughts and questions:

I don't see the @memberwise Differentiable concept explored much, why a new attribute and what does it mean? The existing codable and equatable autosynthesized conformances are also memberwise and don't use an attribute, why is something new required here? The only mention implies it has something to do with the TangentVector synthesis, but I don't understand why it is required - why not synthesize it when absent like codable and equatable do for their requirements?

Is += differentiable? if so, the first Perceptron example can use it in the loss calculation. If not, why not?

Random annoyance but the term "differential operators" is really weird in Swift since these are methods, not operators in the Swift sense. I'd love to find another word to describe these things.

Should Differentiable require AdditiveArithmetic? This would make generics code simpler (e.g. the definition of TangentVector in several places, Point and multiple examples in this section) and it isn't clear to me if anything of merit is Differentiable without being additive. It might simplify the system overall, as well as its description in the document.

... oh, this breaks Array I guess? Yuck. Maybe there should make the base protocol be "DifferentiableButNotNecessarilyAdditive" and make "Differerentiable" be that plus Additive, since the combination seems like vastly the most common case. Maybe this isn't the right way to go, but it seems like we need a single name for the combination of these two protocols.

I'm used to "move(along:)" but perhaps offset(by:) is worth considering. I agree with others in the thread that += would be the wrong thing here given the asymmetric nature of the operation.

Why isn't zeroTangentVectorInitializer a func requirement? The behavior definition is a bit weird so it must be intentional. Please capture the rationale in the proposal. I'm curious to know what the efficiency implications of this are vs a standard func requirement. In the absence of inlining, is the closure allocated each time the getter is called and has its lifetime managed by ARC?

The "not required but warned about" behavior of the @noDerivative attribute makes sense to me. The name still isn't great - is there any way to turn this into a word with a positive sense, e.g. @stationary, @discrete or something like that? I believe that our prior art for negative things is the "non" prefix, and @nondifferentiated is weird.

I think the second example in this section has a minor bug, the input parameter isn't declared in Layer but is referred by the @differentiable attribute.

I like the integration of the @transpose attribute to make @differentiable(linear) functions. I think that the (already extensive) background above could mention linear functions and why they are an important special case to model in the function type system. Dropping the distinction between linear types and other differentiable functions would simplify things, so there should be clear rationale for their inclusion.

I think that func _ should be a separate proposal, I'd recommend using underscore prefixed names (like _foo) in this proposal to avoid distraction.

Trivial typo in this section: you're missing a ) in the "Complex differentiation is representable" paragraph.

The @noDerivative attribute on function parameters is a bit weird to me, it is more like a @notWRT attribute or something. I'm not sure what the right name is here though.

What evolution limitations will be caused by taking this proposal, e.g. without higher order differentiation? If the differential operators will have to change, does that cause ABI or other problems? Will these be defined as "always inline into client" in the meantime or something to mitigate any of these problems?

I'd recommend moving the "future directions" section after the "effect on ABI/API" sections since they apply to the base proposal, not the future directions.

Overall, I'm very excited to see the years of work on this coming together!

-Chris

rxwei · November 28, 2020, 4:24am

Thanks for the comments, @Chris_Lattner3! It looks like your comments (e.g. @memberwise, func _, linear functions, etc) and links to sections are based on the manifesto, not this proposal. Most of these things have already been addressed in the current proposal -- here's a list of all the changes from the manifesto. The link to the current proposal is swift-evolution/0000-differentiable-programming.md at autodiff · rxwei/swift-evolution · GitHub.

Chris_Lattner3 · November 28, 2020, 4:48am

Aha, my mistake, sigh. I'll give it another look. Thanks.

Chris_Lattner3 · November 28, 2020, 6:19am

Second try, commenting about the proposal instead of the manifesto (many points are common), here are some detail questions:

Is += differentiable? if so, the first Perceptron example can use it in the loss calculation. If not, why not? Are inout functions supported?
The term "differential operators" is really weird in Swift since these are methods, not operators in the Swift sense. I'd love to find another word to describe these things.
Should Differentiable require AdditiveArithmetic ? This would make generics code simpler (e.g. the definition of TangentVector in several places, Point and multiple examples in this section) and it isn't clear to me if anything of merit is Differentiable without being additive. It might simplify the system overall, as well as its description in the document.

... oh, this breaks Array I guess? Yuck. Maybe there should make the base protocol be "DifferentiableButNotNecessarilyAdditive" and make "Differerentiable" be that plus Additive, since the combination seems like vastly the most common case. Maybe this isn't the right way to go, but it seems like we need a single name for the combination of these two protocols.
I'm used to " move(along:) " but perhaps offset(by:) is worth considering. I agree with others in the thread that += would be the wrong thing here given the asymmetric nature of the operation.
Why isn't zeroTangentVectorInitializer a func requirement? The behavior definition is a bit weird so it must be intentional. Please capture the rationale in the proposal. I'm curious to know what the efficiency implications of this are vs a standard func requirement. In the absence of inlining, is the closure allocated each time the getter is called and has its lifetime managed by ARC?
The "not required but warned about" behavior of the @noDerivative attribute makes sense to me. The name still isn't great - is there any way to turn this into a word with a positive sense, e.g. @stationary , @discrete or something like that? I believe that our prior art for negative things is the "non" prefix, and @nondifferentiated is weird.
The @noDerivative attribute on function parameters is a bit weird to me, it is more like a @notWRT attribute or something. I'm not sure what the right name is here though.
What evolution limitations will be caused by taking this proposal, e.g. without higher order differentiation? If the differential operators will have to change, does that cause ABI or other problems? Will these be defined as "always inline into client" in the meantime or something to mitigate any of these problems?
What is the ABI guarantees of the Differentiation module? This proposal doesn't provide transpose/linear and other things that are pretty core to the future of the feature, how will those fit in?

Finally, I think there is a much bigger question here: this proposal is introducing a new @differentiable(reverse) feature that is different than the basic @differentiable feature and not aligned with the manifesto. I appreciate the goal of subsetting and simplifying the proposal to make incremental progress here, but this doesn't appear to be a subset - it is a different direction. Taking this proposal and then further baking out the rest of the model seems like it will lead to apparent redundancy.

I'm not sure what the right answer here is, but I think we need to pick from one of these options:

If this is really the ultimate fate of swift autodiff and we will never get the bigger goals, then you should own it and just drop "reverse" word, calling this @differentiable.
If this is part of a coherent plan, then I think it makes sense to revise the manifesto to show how the bigger picture fits coherently with this as a base proposal.

Do you see this concern as well? Am I missing something here?

-Chris

scanon · November 29, 2020, 5:56pm

Yeah. Are we overthinking this? Should it simply be move(by: v) or move(v)?