Differentiable programming for gradient-based machine learning

rxwei · November 20, 2020, 10:12pm

While "tangent" is short and is obvious to differentiation users, its meaning may not be obvious to people who don't use differentiable programming. Most standard library numeric types will conform to Differentiable, so I think it's best if the name is self-documenting and if it can be quickly disambiguated by doing a simple web search. When you search for "tangent", the definition that pops up is "tangent line" in geometry. But when you search for "tangent vector", what shows up is a very accurate definition of what it means in Swift differentiable programming:

In mathematics, a tangent vector is a vector that is tangent to a curve or surface at a given point. Tangent vectors are described in the differential geometry of curves in the context of curves in Rⁿ. More generally, tangent vectors are elements of a tangent space of a differentiable manifold. Wikipedia

Reverse-mode AD's derivatives will produce pullbacks. Forward-mode AD's derivatives will produce differentials. They're mathematically transposes of each other but are produced by different compiler implementations. Currently, we only have reverse-mode differentiation stably implemented in the compiler. While we believe differentiation would be complete with both modes (better unified together), it will require a significant amount of engineering and its use cases are not nearly as dominant as most gradient-based machine learning. With this proposal we are hoping to enable Swift to deliver a good experience for ML use cases, in a way that's forward-compatible with more general abstractions (@differentiable functions, for example).

At the type level, they are very different because they have different ABI. A reverse-mode differentiable function's ABI is a tuple of the original function and a derivative function that produces a pullback ((R'...) -> (T'...)):

original: (T...) -> (R...)
derivative: (T...) -> ((R...), (R'...) -> (T'...))

* An apostrophe stands for the associated tangent vector. For example, T' means T.TangentVector.

A forward-mode differentiable function would be a tuple of the original function and a derivative function that produces a differential ((T'...) -> (R'...)):

original: (T...) -> (R...)
derivative: (T...) -> ((R...), (T'...) -> (R'...))

In the manifesto, @differentiable functions are defined in an efficient and compact representation such that a derivative function produces all three things: the original result, the differential and the pullback. (It uses @differentiable(linear) to represent the bundle of differential and pullback because they are transposes of each other. But in binary this is what it looks like:

original: (T...) -> (R...)
derivative: (T...) -> ((R...), ((T'...) -> (R'...), (R'...) -> (T'...)))
                                ^~~~~~~~~~~~~~~~~~  ^~~~~~~~~~~~~~~~~~
                                differential        pullback

Every function type needs to have a stable ABI. A differentiable function that doesn't have a differential-producing derivative is not the same as a general @differentiable function, not at the representation level, and therefore not at the type level.

rxwei · November 20, 2020, 10:40pm

Traditionally (or presently), many ML frameworks have a graph representation on which they perform both differentiation (altering the semantics) and execution-related optimizations (preserving the semantics) at the library level. The approaches to automatic differentiation section has a detailed overview of how AD is done in library-based approaches.

In Swift's differentiable programming feature proposal, derivative code generation operates on Swift code at compile time, which enables developers to differentiate any type and any function (not restricted to a single library) and have compile-time diagnostics. This is very different from existing machine learning frameworks and introduces separation of concerns.

ML libraries that use Swift's differentiable programming can choose to create a representation that is most suitable for their execution at runtime. They can either perform eager execution (i.e. no graph at all), which is to run operations right in their implementation, or build a graph in order to stage any later computation until the data is finally needed (i.e. lazy evaluation). Execution-related transformations and optimizations will operate on a library-defined graph by a library-defined runtime or compiler. Therefore, they happen at a much much later stage than differentiation and are not differentiation's concerns.

In other words, Swift AD is bringing AD ahead of runtime. The heterogeneous compute optimization and dispatch parts of the pipeline still reside where they currently are in different ML libraries, being performed at runtime.

Lantua · November 20, 2020, 10:43pm

... right sorry for the noise.

Would it be the case that reverse-mode only make sense for the input side, and forward mode for the output side? In that case, we might still be able to use @differentiable to refer to both reverse and forward mode:

// This
@differentiable (Float) -> Float
// Desugar to
(@differentiable(reverse) Float) -> (@differentiable(forward) Float)

liuliu · November 20, 2020, 10:53pm

It makes sense. With Swift AD, there is no need for backward() method in these said libraries, and they can construct the graph based on the calling sequence of their respective @differentiable(reverse) and @derivative(of:) functions. I think it is pretty straightforward for eager execution. I need to think a bit more on lazy execution side how the bookkeeping can happen. Thanks for the detailed reply. Helps a lot to understand the details!

rxwei · November 20, 2020, 11:07pm

Lantua:

Would it be the case that reverse-mode only make sense for the input side, and forward mode for the output side? In that case, we might still be able to use @differentiable to refer to both reverse and forward mode:
// This
@differentiable (Float) -> Float
// Desugar to
(@differentiable(reverse) Float) -> (@differentiable(forward) Float)

As far as implementation is concerned, we need to express function type variations, and function type attributes are a common way to do that (e.g. @convention(c)). If an attribute is applied on a parameter, it would be parsed as a function type parameter attribute or function type result attribute (which doesn't exist today). Then the rest of the compiler would need to look at parameters and results to determine whether a function type is a differentiable one. Well yes, I believe this can be done at a technical level. But stepping back from that, I'm not sure (@differentaible(reverse) Float) -> Float has better clarity over @differentiable(reverse) (Float) -> Float when desugared. @noDerivative would be very rarely used in function types though.

I do feel that @differentiable(reverse) is a very long attribute to type by hand, especially on a function type. I am interested in shorter attributes to express the idea that "this function type is differentiable and you can get the gradient or pullback of it", but I haven't found one that's as clear as @differentiable(reverse). But the good thing is that most users will not need to type a @differentiable(reverse) function type. Libraries that define higher-order functions on differentiable functions will need to type it out. For end users of machine learning APIs, it would be rarely used.

Lantua · November 20, 2020, 11:50pm

To be clear, I'm not trying to make an easy-to-parse type definition. That's the job for sugar.

It's just that this is currently allowed:

@diffable(rev) (@noDiff Float, @noDiff Float) -> Float

and I think it should just be the same as the regular function (w/o any diff attribute).

Given that @noDiff is already type-significant parameter attribute, it'd make for a more sound system for the canonical type to put everything in the parameter attribute, and have type attribute be sugar for repeated parameter attribute.

rxwei · November 21, 2020, 12:03am

Lantua:

The pullback result looks confusing. What if we use labeled tuple instead, and use wrt only for single-parameter case?

@derivative(of: foo)
func derivativeOfFoo<T: Differentiable>(_ x: T, _ y: T, _ z: T)
  -> (
    value: T,
    pullback: (T.TangentVector) -> (x: T.TangentVector, y: T.TangentVector, z: T.TangentVector)
  ) { ... }

// Derivative with respect to `x` and `z`.
@derivative(of: foo)
func derivativeOfFoo<T: Differentiable>(_ x: T, _ y: T, _ z: T)
  -> (
    value: T,
    pullback: (T.TangentVector) -> (x: T.TangentVector, z: T.TangentVector)
  ) { ... }

So then we can also do things like pullback(4).x which would be quite natural.

Since parameter names are not quite part of the user-visible interface and won't appear at call sites, I'm not sure using parameter names to distinguish between parameters is a good idea. If you define a derivative for an imported function, it can break easily if the original function changes the parameter name in their future release. For that reason, we have allowed both parameter names and parameter indices in wrt:. I think parameter names should only be allowed in the current module. For derivatives of external functions, I think using parameter indices (e.g. wrt: (0, 1)) would give imported modules the freedom to change parameter names like they have today. As a result, I think implying differentiability parameter selection from pullback result tuple element labels has library evolution concerns.

BigSur · November 21, 2020, 12:04am

A few shorter easier-to-type suggestions, have you ever considered these alternatives for @differentiable @noDerivative like below?

@del @der @dif @diff @df
: with reverse -> rev; forward->fwd
or just super-short @D(rev) @D(fwd) and @noD

rxwei · November 21, 2020, 12:25am

Lantua:

It's just that this is currently allowed:
@diffable(rev) (@noDiff Float, @noDiff Float) -> Float
and I think it should just be the same as the regular function (w/o any diff attribute).

This is actually not allowed. It currently crashes (ha!) and I'll fix it. There needs to be at least one parameter that conforms to Differentiable and is not marked as @noDerivative. I will update the proposal to reflect that.

That is possible, and I agree that removing ad-hoc rules is better in the type system. However, I'm curious in what scenarios do you think the proposed canonical version would be a better alternative to be used by a user or to be printed by the compiler in a diagnostic? Since IIUC you implied that today's syntax would be sugar on top of the proposed canonical type syntax, the proposed canonical type syntax would therefore be an addition to today's proposed features, so I think it can be deferred to a future proposal.

rxwei · November 21, 2020, 12:31am

BigSur:

A few shorter easier-to-type suggestions, have you ever considered these alternatives for @differentiable @noDerivative like below?
@del @der @dif @diff @df
: with reverse -> rev; forward->fwd
or just super-short @D(rev) @D(fwd) and @noD

Since the attributes are to be added to the core language, IMO these alternatives would be confusing to people who don't know about (or who are not using) differentiation. Swift's design and Swift Evolution proposals don't seem to have a history of choosing brevity over clarity.

Lantua · November 21, 2020, 1:19am

Well, I don't have a lot of qualms about the types displayed to the user since it should use the whatever is written at function declaration anyway, whether or not it's canonical.

It all stems from my misconception that @diff (@noDiff, @noDiff) -> Float is allowed. Having two types that are indistinguishable (that and (Float, Float) -> Float) doesn't sound quite right. Rejecting this case seems ad-hoc, but I'd have a lot less problem compared to just now.

Perhaps. That reminds me, do we put the @diffable type attribute in the ABI, or do we infer it from the existence of @diffable param attribute since the type attribute is quite redundant?

rxwei · November 21, 2020, 1:42am

To correct my earlier point, the proposal does resolve the library evolution concerns (I forgot!) by requiring parameter names to be the ones from from the derivative function:

... a wrt: argument in a @derivative attribute, when referring to parameters by names, must use parameter names in the derivative function.

But thanks for pointing it out! I've modified the proposal to fix this hole.

The ABI for a @differentiable(reverse) function type is always a bundle of two functions (4 words long in memory). This is entirely determined from the existence of the @differentiable(reverse) function type attribute.

dabrahams · November 21, 2020, 3:33am

Overall, I'm very excited to see this stuff being proposed and moving forward!

I may have more of substance to add later, but for now just these comparatively minor points about the API surface and documentation, some of which we discussed over at Google but never actually did anything about:

gradient(of:) is a higher-order function which behaves exactly like the 𝛁
(Del) operator in mathematics.

OK…

It takes a differentiable closure that returns a scalar and its gradient function

Nit: it takes a (reverse-)differentiable function (“closure” is a kind of literal) as a parameter.
but according to the signature, that parameter doesn't return “a scalar and a gradient function” as the text seems to imply. It takes a differentiable value (which the text doesn't mention) and returns a scalar.
According to the signature, that “gradient function” appears to be the return value of the gradient function we're documenting.

/// Returns the gradient function of the given closure with respect 
/// to the argument.
/// - Parameter:
///   - body: A closure whose derivative function will be evaluated.
/// - Returns: A gradient vector with respect to `x`.

This thing only takes one parameter, so “the argument” is the same thing as “the given closure.”
Parameter clauses are overused and in this case it is unhelpful.
No, gradient doesn't return a gradient vector.

f(x) already reads as “f of x,” so we should never be using of: as an argument label. gradient(f) is just fine.
body is an inappropriate name for the parameter; that name only makes sense in cases where the parameter's side-effects are as likely to be their main semantics as their return value. A descriptive name would explain its role at the use site, but in this case it has no role other than “the function whose gradient we're computing,” so f would probably be a better name than any attempt to add semantic value would yield.

Is this wrong?

/// Returns 𝛁`f`, the gradient function of `f`.
///
/// The gradient function returns the slope of `f` at *x*,
/// given a value *x* in `f`'s domain.
func gradient<T: Differentiable, R: FloatingPoint & Differentiable>(
    _ f: @escaping @differentiable(reverse) (T) -> R
) -> (T) -> T.TangentVector where R.TangentVector: FloatingPoint

I know Apple doesn't like to use “code voice” in doc summaries, but that rule hurts more often than it helps, and in cases like this, makes it almost impossible to document the function clearly.

In the expression gradient(at: 3.0, in: f), the in: doesn't make sense to me. Isn't this “the gradient of f at 3.0?” Seems to me the usage should be

gradient(f, at: 3.0)

or, if we want to be cute and really play up the “uncurried” relationship,

gradient(at: 3.0)(f)

dellaert · November 22, 2020, 6:21pm

I'd like to express my enthusiastic support for this proposal: we started the SwiftFusion project, initially a collaboration between Google Research and Georgia Tech, precisely because the idea of having differentiable functions as first class citizens is so appealing. In our case, it allowed us to unify non-linear optimization based on factor graphs (what gtsam.org is all about) and deep learning, allowing us to learn factors in a data-driven way.

I am not enough of a swift guru to provide deep technical feedback, but I am glad to see there is substantial discussion.

Forward differentiation would be a great future contribution for non-ML applications, specifically those applications where second-order optimizers can be used efficiently, as opposed to gradient descent - which only needs the gradient and not the Jacobians. Thinking about sparse Jacobians would also be interesting.

dabrahams · November 23, 2020, 6:49pm

I think I need to review the full proposal again to find all of the surface critiques I had, but in the meantime, here's another one I just remembered: move(along:) doesn't just move the tensor along some tangent vector, which would imply only the direction of movement. The tangent vector has direction and magnitude, and represents an offset that will be applied to the tensor. The relationship is an N-dimensional generalization of Strideable, similar to adding an integer to a pointer or a duration to a point in time. It has always seemed to me that += would be a better way to express this.

If we had to use a method, naming becomes challenging because offset(by: x) could be either the mutating or non-mutating version. Obviously you could go with add and adding but that seems pretty silly to me when we could use += and +.

anandabits · November 23, 2020, 7:03pm

Isn’t the convention to use form prefix, as in union and formUnion? This would suggest using offset and formOffset?

dabrahams · November 23, 2020, 7:24pm

No, that happens because union is not a verb.

anandabits · November 23, 2020, 7:27pm

Offset is both a noun and a verb. Do we have precedent for words like this elsewhere?

dabrahams · November 23, 2020, 7:30pm

Yes, that was my point.

Do we have precedent for words like this elsewhere?

Yeah, “avoid using them in ways that would be ambiguous.” I don't see why we should spend energy on this, though, since +/+= already have the right meaning.

scanon · November 23, 2020, 7:38pm

+/+= doesn't have the right meaning, though (or is at least ambiguous[1]). It also introduces another non-homogeneous overload of +/+= with type constraints, and I'm not sure what the type checker implications of that change would be (@hborla, @xedin, any guesses where we are on this now?)

You're absolutely right that the scenario is precisely analogous to Stridable (I would argue that we shouldn't have used + there either, for basically the same reasons, but clearly that ship has sailed).

[1] Consider e.g. the manifold of rotation matrices of dimension n; the tangent space at any point is also a nxn matrix, but it's a different subspace, and move(along:) is not the normal matrix addition--move(along:) has to keep you on the manifold, but adding (in the normal matrix addition sense) a non-zero element of the tangent space always takes you off the manifold.