Differentiable programming for gradient-based machine learning

dabrahams · December 1, 2020, 2:23am

The former, if we don't want to pursue the Manifold protocol. I have not found a better base word than move, but I still think Manifold might be worth looking at (or maybe it's DifferentiableManifold?). It makes usage a bit uglier:

SomeManifold.move(&somePoint, by: someTangentVector)

but this operation doesn't appear often in user code AFAICT (only 8 times in tensorflow/swift-apis and tensorflow/swift-models combined) and it would avoid wrapper types for cases like the one you described where a given pair of point and tangent vector types could naturally be used for many different manifolds.

rxwei · December 4, 2020, 12:30pm

dabrahams:

Nit: it takes a (reverse-)differentiable function (“closure” is a kind of literal) as a parameter.

but according to the signature, that parameter doesn't return “a scalar and a gradient function” as the text seems to imply. It takes a differentiable value (which the text doesn't mention) and returns a scalar.

According to the signature, that “gradient function” appears to be the return value of the gradient function we're documenting.

This thing only takes one parameter, so “the argument” is the same thing as “the given closure.”

Parameter clauses are overused and in this case it is unhelpful.

No, gradient doesn't return a gradient vector.

body is an inappropriate name for the parameter; that name only makes sense in cases where the parameter's side-effects are as likely to be their main semantics as their return value. A descriptive name would explain its role at the use site, but in this case it has no role other than “the function whose gradient we're computing,” so f would probably be a better name than any attempt to add semantic value would yield.

Thanks for reviewing the documentation details :) I've reworded it. The documentation I wrote in this proposal is mostly for proposal explanation purposes and I'm sure they will be rewritten if accepted and released. As for "closure is kind of literal", I've actually standardized on "reverse-differentiable closure" in my edits because

According to The Swift Programming Language v5.3, "closure" actually seems to be the most appropriate term to describe this parameter. "Closure expression" is referring to a literal but "closure" isn't.
Global and nested functions, as introduced in Functions, are actually special cases of closures. Closures take one of three forms:
- Global functions are closures that have a name and don’t capture any values.
- Nested functions are closures that have a name and can capture values from their enclosing function.
- Closure expressions are unnamed closures written in a lightweight syntax that can capture values from their surrounding context.

The existing documentation of higher-order functions in the standard library (except for those taking predicates) use "closure" instead of "function" for their function-typed parameter: map(_:) ("a mapping closure ..."), reduce(_:_:) ("a closure ...").

While symbolic math expressions such as f(x) would obviously read as "f of x", as soon as function names start being words/phrases instead of letters, and functions become imperative procedures to call instead of just abstract models, there get to be fewer implicit words in the syntax. IMO this applies to gradient(of:) the way it applies to existing APIs such as index(of:).

Good point. I agree in: is not a good label and I'll rename these functions to gradient(at:of:) and pullback(at:of:).

rxwei · December 4, 2020, 12:45pm

On an alternative for `move(along:)`

I agree with Dave and others that move and along: are not quite right. by: seems to me like the best option for the argument label. As for using move(by:) as an alternative, I've actually become less comfortable with move. move(by:) does seems okay when we think in terms of manifolds, but it is super unclear when defined as a member of primitive math types like Float and Double — developers rarely think of values of those types as points on a manifold, so they can be confused seeing a move method under Float or an expression like 3.0.move(by: 1.0) even when having knowingly imported Differentiation.

offset is an accurate but not overly domain-specific description of this operation. While it is true that all precedents using offset as the base name in Apple’s SDKs are using it as a past-participle (therefore indicating non-mutating), the mutating-ness of this operation is already unambiguously conveyed in the type signature, so I'm feeling much better about offset(by:) than other alternatives.

mutating func offset(by tangentVector: TangentVector)

IMO we shouldn't pursue Manifold or DifferentiableManifold as the protocol name because Differentiable is much more approachable and relevant to the differentiation feature.

A static method is a good idea, but it will reduce the discoverability of the API. While this API is almost never called by ML users directly (it's only called by some library-defined optimizers), general manifold optimization use cases will be calling this much more often and therefore I think it would be best presented as an instance method.

rxwei · December 4, 2020, 1:02pm

Yes, functions with inout arguments are supported. I will explain the rules here a bit and add the details in the proposal. Differentiating such a function, when the function has a Void result, means differentiating the data flow from inputs that contributed to the mutation of the result type. Just like functions without inout parameters, the “wrt parameters” (or, “differentiability parameters”) for a differentiable function are used to determine its derivative function type. There are three scenarios involving inout arguments:

If a function has both an inout parameter and a non-Void result, the inout parameter will not be treated as a differentiability result. Therefore, its derivative function type is calculated using the standard rules treating the inout parameter as a normal parameter.

func adding(_ x: Float, _ y: Float, _ z: inout Float) -> Float
   
// Differentiating w.r.t. x and y.
@derivative(of: adding(_:_:_:), wrt: (x, y))
func derivativeOfAdding(_ x: Float, _ y: Float, _ z: inout Float) -> (value: Float, pullback: (Float) -> (Float, Float))

// Differentiating w.r.t. all parameters.
@derivative(of: adding(_:_:_:))
func derivativeOfAdding(_ x: Float, _ y: Float, _ z: inout Float) -> (value: Float, pullback: (Float) -> (Float, Float, Float))

If a function has a Void result and a single inout parameter that is not being differentiated w.r.t, its pullback function type in the derivative function type has a non-inout corresponding parameter.
```
func foo(_ x: Float, _ y: inout Float)
   
@derivative(of: foo, wrt: x)
func derivativeOfFoo(_ x: Float, _ y: inout Float) -> (value: Float, pullback: (Float) -> Float)
```

If a function has a Void result and a single inout parameter that is being differentiated w.r.t., its pullback function type has an inout corresponding parameter. Some canonical examples are the derivatives of Array.append(_:) and Float.+=(_:_:).

extension Array {
    public mutating func append(_ newElement: Element)
	
	@usableFromInline
	@derivative(of: array, wrt: (self, newElement))
	mutating func derivativeOfAppend(_ newElement: Element) -> (value: Void, pullback: (inout TangentVector) -> Element.TangentVector)
	        where Element: Differentiable
}

extension Float {
	public static func +=(_ lhs: inout Float, _ rhs: Float)
	
    @usableFromInline
	@derivative(of: Float.+=)
    mutating func derivativeOfAdd(_ lhs: inout Float, _ rhs: Float) -> (value: Void, pullback: (inout Float) -> Float)
}

I agree that the term is not ideal since none of these APIs are Swift operators, but I like “differential operators” because its precise definition can be looked up with a simple web search: Differential operator - Wikipedia. I’ll switch to a more descriptive alternative in the proposal, maybe “higher-order functions for differentiation”.

While real vector spaces the most common differentiable types (e.g. SIMD32<Float>, Float, Double, etc), they are not the common case which (non-library) developers will end up writing in practice. For example, neural network layers and models (which developers will create a lot of) are often not vector spaces, as they may contain configuration variables as stored properties.

// This is not additive.
struct SomeLayer: Differentiable {
    var weight: SIMD32<Float>
    var isBatched: Bool
}

I agree with you completely that the design of this protocol should optimize for the most common case. But as explained above, the most common data structures to be created by developers are not real vector spaces, but aggregates of library-defined real vector space types plus arbitrary configuration variables users wish to add. As such, making Differentiable require AdditiveArithmetic would IMO be doing the exact opposite of optimizing for the most common case.

Plus, the conformance synthesis rules today make it very easy to define an additive differentiable type — one can just declare an AdditiveArithmetic conformance and a TangentVector would be synthesized to equal Self.

struct MyProductVector: Differentiable, AdditiveArithmetic {
    var x: SIMD4<Float>
    var y: SIMD2<Double>
    
    // Synthesized:
    //     typealias TangentVector = MyProductVector
}

A closure is needed to reduce the memory footprint of the pullback closure context. A derivative value w.r.t. a function argument should have the same shape as the original argument, and the shape is runtime-defined in most cases (tensors). If we made this be a method or a computed property that simply returns a TangentVector, the pullback function would need to capture the original value even if it’s not mathematically needed. Since tensors can be very large objects, this would be concerning in memory-constrained environments.

zeroTangentVectorInitializer is to be called by the AD-generated code capture all of the information needed for creating in the derivative function (VJP) when needed, so that the pullback closure won’t be keeping unnecessary values alive. Implementers of zeroTangentVectorInitializer can choose what they capture. In the doc comment of this method we give the developer recommendations to avoid capturing self.

For scalars and statically-shaped vectors such as Float and SIMD32<Float>, the closure will be @convention(thin) so no allocations. In the future we can consider adding a refining protocol which asserts that the conforming type’s TangentVector.zero is equivalent to zeroTangentVectorInitializer(), i.e. having type-defined shapes, and the pullback can just call .zero instead of using zeroTangentVectorInitializer.

For tensors and other dynamically shaped types, we can thunk the zeroTangentVectorInitializer (if serialized) to be taking an AD-specific bump-pointer allocator as an argument and allocate the closure context there instead. We are working on moving all pullback closure allocations in derivative functions to a stack-disciplined bump-pointer allocator.

@noDerivative has three use cases:

Opt out conformance synthesis for a stored property
Mark function declarations (e.g. Array.count) as knowingly producing a zero derivative so that the compiler won’t error
Mark a @differentiable function’s parameter as non-wrt.

Given all of its use cases, IMO the name @noDerivative captures the semantics of all three cases. I haven’t been able to come up with a good single-word name. IMO the bottom line is to be sure to mention “derivative” or “differentiation” in the name so that the attribute's feature domain is clear — with this in mind, the only alternative I can come up with is @nondifferentiable, but it doesn’t seem like a good fit for all three use cases above.

We initially used a @nondiff attribute for this purpose, but later standardized on @noDerivative since it has the right meaning in this case — the @noDerivative in @differentiable(reverse) (T, @noDerivative U) -> V means there’s no derivative result for U in the resulting pullback function (V.TangentVector) -> T.TangentVector.

To optimize for the most common use cases, we intend to define @differentiable(reverse) (as well as @differentiable and @differentiable(linear) in the future) for first-order differentiation. Encoding an arbitrary-order differentiable function in the ABI can lead to performance compromises and unpredictable optimizability, let alone a very challenging task to do (FWIW, it may need the compiler to emit a self-generating linear map closure that represents a recursive form of Faa di Bruno’s formula).

In the future, I think supporting fixed-order differentiable functions in the ABI is much more likely to happen than supporting arbitrary-order ones. But either approach can be done in a way that is compatible with the APIs proposed here and in the manifesto. At the syntax level, @differentiable can be made an alias of @differentiable(1), and @differentiable(n) can be implicitly converted to @differentiable(n-1). Therefore, today’s differential operators will be ABI compatible and don’t have to be declared as @_alwaysEmitIntoClient.

ABI stability for Differentiation can be guaranteed. Higher-order functions introduced in this proposal (those operating on @differentiable(reverse) closures) will work with @differentiable and @differentiable(linear) closures because they have subtyping relations. The following implicit conversions will be possible (implemented as a thunk application):

let f: @differentiable(linear) (T) -> U
let g = f as @differentiable (T) -> U
let h = g as @differentiable(reverse) (T) -> U

// Proposed reverse-mode differentiation API can be called on any of the closures above.
pullback(at: x, of: f)
pullback(at: x, of: g)
pullback(at: x, of: h)

Chris_Lattner3:

Finally, I think there is a much bigger question here: this proposal is introducing a new @differentiable(reverse) feature that is different than the basic @differentiable feature and not aligned with the manifesto. I appreciate the goal of subsetting and simplifying the proposal to make incremental progress here, but this doesn't appear to be a subset - it is a different direction. Taking this proposal and then further baking out the rest of the model seems like it will lead to apparent redundancy.

I'm not sure what the right answer here is, but I think we need to pick from one of these options:

If this is really the ultimate fate of swift autodiff and we will never get the bigger goals, then you should own it and just drop "reverse" word, calling this @differentiable .

If this is part of a coherent plan, then I think it makes sense to revise the manifesto to show how the bigger picture fits coherently with this as a base proposal.

What makes me consider @differentiable(reverse) as a subset is that it’s naturally a subtype of @differentiable, which is then a subtype of @differentiable(linear). When we have general @differentiable functions, reverse-mode differentiation APIs proposed today will be fully compatible with those functions. Additionally, @differentiable(reverse) is a smaller representation than @differentiable so it will be the ideal type to use for gradient-only use cases under memory constraints, so it won't become redundancy. Similarly it would make sense to have a @differentiable(forward) in the future as well. I don’t think @differentiable(reverse) will become “deprecated” even if the full picture is in place.

I don't think @differentiable(reverse) will be the end of this journey. Many people from the community have requested forward-mode differentiation use cases and I think it will be completed someday. I’ll definitely update the manifesto to reflect a coherent plan as requested :) Thanks for raising these questions!

dabrahams · December 4, 2020, 1:45pm

Yes, I know they will rewrite your language eventually. I'm worried about making sure I understand the proposal. That's why it's so important to get these straightened out.

(I'll also make the underappreciated point that API design flaws often don't become apparent until you try to document the API simply, clearly, and tersely. If you're finding that hard, it often means you have an API that can't be documented simply and clearly, which means it can't be easily explained, used, or understood.)

According to The Swift Programming Language v5.3, "closure" actually seems to be the most appropriate term to describe this parameter. "Closure expression" is referring to a literal but "closure" isn't.

I stand corrected, thanks!

Eh, good point, but I don't think the analogy is a very strong one. gradient(x) is very much a math expression, just like sin(x) is; both use words that are mathematical terms of art.

IMO some consideration should be given to why we are choosing not to use signatures that keep the primary argument in the first position. From what I've seen, to a first approximation nobody wants to use trailing closure syntax with these functions.

dabrahams · December 4, 2020, 2:24pm

rxwei:

As for using move(by:) as an alternative, I've actually become less comfortable with move . move(by:) does seems okay when we think in terms of manifolds, but it is super unclear when defined as a member of primitive math types like Float and Double — developers rarely think of values of those types as points on a manifold, so they can be confused seeing a move method under Float or an expression like 3.0.move(by: 1.0) even when having knowingly imported Differentiation .

offset is an accurate but not overly domain-specific description of this operation. While it is true that all precedents using offset as the base name in Apple’s SDKs are using it as a past-participle (therefore indicating non-mutating), the mutating-ness of this operation is already unambiguously conveyed in the type signature, so I'm feeling much better about offset(by:) than other alternatives.
mutating func offset(by tangentVector: TangentVector)

But every argument you've given against move(by:) applies equally to offset(by:). I never loved “move,” but there's nothing manifold-specific about it: it's an ordinary everyday word in English. Combined with the fact that it's shorter than offset and has an unambiguous part of speech, it seems unambiguously better than offset to me (this is exactly the thought process I went through when making my first post BTW).

The point here was not to reconsider the protocol name, which is fine on its own, but to create a protocol that allows common math types to be used with different manifolds. As @scanon has pointed out, for many of the types we'd like to differentiate, there is no single manifold implied, and this “move” operation we're trying to name would have to be implemented differently for each manifold. As soon as someone makes a conformance of Matrix2x2<Float> to Differentiable and uses += to implement move, you can't use that matrix type to represent rotations in R², because the meaning of move has been locked to the type. Instead you'd need to create a new wrapper type around it.

But that is all to the good when considering problems like the confusion induced by seeing move in code completion for Float or used in (3.0).move(by: 2).

IMO you're giving up a bit too quickly on the idea of separating the manifold from the differentiable type. It may turn out to be the wrong choice, but needs some serious thinking through in the context of real use cases.

I suppose it's also worth asking whether manifolds should be dynamically parameterized, so you'd create a manifold instance containing the parameters, and use regular methods on it:

This approach would of course further increase the burden of use, but we should ask if there important use cases that need this capability, because the space cost of carrying the dynamic parameters around inside each instance of the differentiable or tangent vector types might be prohibitive. IMO differentiable programming will always be something of an expert-level feature, so it can bear a slightly higher ergonomic cost if that enables important applications.

rxwei · December 4, 2020, 9:02pm

I don't think the sin(x) analogy is a strong one either. The spelling of sin(x) has an established precedent in math and in programming. gradient(x) does not have nearly as strong of a precedent. The term of art for the gradient operator is 𝛁f, not gradient(f). Plus, the "sin" in is treated (and typeset) more like a symbol than a word/phrase.

I do not believe that is true. In fact, I've rarely seen gradient(at:in:) used without a trailing closure except within our compiler test cases. In the vast majority of use cases in ML, a trailing closure will be used. ML developers tend to think of this as feeding an input forward through the function that is going to be differentiated and getting its gradient back. It's much more natural to write:

let modelGradient = gradient(at: model) { model in
    let y = model(x)
    return meanSquaredError(y, label)
}

than the following:

// Odd, especially when the closure gets bigger, which can be
// very common in models such as GAN.
let modelGradient = gradient(of: { model in
    let y = model(x)
    return meanSquaredError(y, label)
}, at: input)

... Not using a closure expression at all can lead to a more suboptimal use site.

func loss(_ model: Model) -> Float {
    let y = model(x)
    return meanSquaredError(y, label)
}
let modelGradient = gradient(of: loss, at: model)

We want to encourage the use of closure expressions with this API since it can take advantage of type inference (and it's important especially because many ML developers come from Python). Trailing closures are the best form of that.

dabrahams · December 4, 2020, 9:34pm

Okay, I'm sold! Thanks for the patient discussion.

porterchild · December 4, 2020, 10:24pm

let modelGradient = gradient(at: model) { model in
    let y = model(x)
    return meanSquaredError(y, label)
}

This may be an unpopular opinion, but my eye struggles to parse trailing closure syntax even after 3 years of Swift (unless the argument is extremely obvious.).

The inputs and outputs being implicit makes it harder to read because I have to figure them out myself. I never use this style for differentiable functions, and I remember it making S4TF code hard for me to read when I was first exposed to it.

I find this much more readable on a first pass:

func loss(_ model: Model) -> Float {
    let y = model(x)
    return meanSquaredError(y, label)
}
let modelGradient = gradient(at: model, of: loss)

Just my 2 cents. Granted I'm not using it for neural nets but physics, so the arguments and outputs are constantly changing and my brain doesn't settle into expecting the argument to always be a model, and the output to always be a loss.

I see nothing wrong with maintaining the current order of arguments to gradient, but I did want to express my slight discomfort with the ubiquitous trailing closure syntax. In general it seems true that type inference is good unless it sacrifices clarity. In this specific case, maybe I'm the only one who feels it is less clear. However, I can't imagine a beginner appreciating the ambiguity.
Then again, I can only imagine one beginner with much fidelity, which is me :)

rxwei · December 4, 2020, 10:40pm

Sorry! The error was mine actually. input should be model since we are taking the model gradient. I'll fix the examples above.

It should be:

let modelGradient = gradient(at: model) { model in
    let y = model(x)
    return meanSquaredError(y, label)
}

and

func loss(_ model: Model) -> Float {
    let y = model(x)
    return meanSquaredError(y, label)
}
let modelGradient = gradient(at: model, of: loss)

porterchild · December 4, 2020, 10:56pm

Thanks for the correction. I didn't actually read into the example much, because this is an opinion I held long before reading it.

dabrahams · December 15, 2020, 7:17pm

Aside from trivial things like the name move, I should talk about what I see as some of the more substantial weaknesses with the autodiff feature as currently proposed, since I've used it pretty extensively in SwiftFusion.

First, I have found the ergonomics of the system really painful. In particular, it has been incredibly hard to build new Differentiable types, even when composing other types that are Differentiable. As a very experienced generic programmer with a non-wizard-but-stronger-than-most math background, I would have expected to have a small learning curve, but it's not turning out that way. Some of this surely comes down to diagnostic QOI, but I think that's only a small part of the issue.

One of the normal strategies, of building up conformances protocol-by-inherited-protocol (e.g. conform to BidirectionalCollection by starting with Sequence conformance, adding Collection conformance, and finally adding conformance to BidirectionalCollection), and getting it to compile at each step, breaks down badly for me, and that's especially bad because the protocol refinement hierarchy required for a Differentiable type's TangentVector is quite deep.

Maybe part of the problem is that you ultimately always end up with a generic TangentVector type that itself has to be Differentiable, with Self as its own TangentVector type.

    associatedtype TangentVector: Differentiable & AdditiveArithmetic
        where TangentVector == TangentVector.TangentVector

As a result, you can't make anything beyond AdditiveArithmetic conformance work without standing up the whole system of conformances for the TangentVector type. Maybe part of the issue is the way Differentiable adds new @differentiable constraints to the AdditiveArithmetic requirements. I'm not 100% sure. This is the sense I have, but maybe something more is at play here. @rxwei It might be instructive for you to look at the TypeKeyedArrayBuffers type which I've struggled for weeks to make Differentiable and see what you run into.

I have a hunch that my idea of using a Manifold protocol might help with this part of the ergonomics somewhat, but that really is a wild guess.

My second major concern centers around the handling of zero, which turns out to have an incredibly important role in differentiable programming. zeroTangentVectorInitializer is awkward, but that's not the biggest problem. First, it seems to be based on the premise that you can only come up with a zero vector for a reshapable type (e.g. Array) if you know its shape, which I'm not sure is true. I've had some success creating universal zero values that are compatible with all shapes. This not only makes it possible to use a cleaner API (like the one from AdditiveArithmetic) but these universal zeros tend to be very efficient because they don't require any dynamic storage. Because zeroTangentVectorInitializer is a closure, if your have a tensor has a “ragged” shape like the top level data structure of SwiftFusion, you usually end up capturing a fairly heavyweight value in that closure to reconstruct the right zero, which I imagine is really hard to optimize. Lastly, if you buy into the premise of zeroTangentVectorInitializer that you can't build a zero without an instance, you still have the static zero from AdditiveArithmetic in which you have to unconditionally fatalError, which makes the AdditiveArithmetic conformance a lie.

IMO the zero-handling part of the proposal is truly a mess and in no shape to be locked down until it's sorted out. However, the ergonomics of creating a differentiable type need some careful attention, too. If a generic programming expert like me can't create a new differentiable data structure, pity the poor ML researcher who needs to do it.

Thanks for your time,
Dave

/cc @saeta @dan-zheng

rxwei · December 16, 2020, 6:34am

I'm not quite seeing the issue you mentioned. It would help if you can post concrete code examples where there are sharp edges. I took a brief look at TypeKeyedArrayBuffers but it looks like a very low-level non-mathematical type. Could you elaborate a bit more on why it should be differentiable? TangentVector being required to conform to Differentiable doesn't seem problematic to me, specifically because its TangentVector is equal to itself, in which case all other protocol requirements have a default implementation.

From what I've observed in machine learning use cases, creating a custom differentiable type costs little effort when parts of it can leverage conformance derivation. If the user defines a custom tangent vector, other protocol requirements can still be derived automatically. Moreover, defining fully custom differentiable types (custom exponential map, etc) is really an "advanced feature" to be utilized by libraries; ML developers who develop neural network models almost never need to do this.

I'm having trouble understanding the concrete details of this idea and would appreciate some details. If we define a separate Manifold protocol in addition to Differentiable, which protocol is responsible for defining a tangent vector type? How would one define a default implementation for Manifold.move(by:) when the tangent vector type is equal to the differentiable type?

First of all, I completely agree that zeroTangentVectorInitializer is a weird protocol requirement, and that it is an important issue to resolve before the proposed feature becomes final.

The proposal has not claimed that one can only come up with a zero vector from an instance, nor has it denied that one can define a universal zero for certain types. The proposed design provides zeroTangentVectorInitializer as a customization point so that it enables library developers to define the semantics they need. Specifically, some existing ML libraries (e.g. Autograd and JAX) have the invariant that the gradient w.r.t. an input has the same shape as the input.

>>> import autograd.numpy as np
>>> from autograd import grad
>>> def f(x):
...     return np.zeros((), np.float)
...
>>> x = np.ones((2, 2))
>>> f(x)
array(0.)
>>> grad(f)(x)
array([[0., 0.],
       [0., 0.]]) # has the same shape as `x`

One can certainly use a universal zero tangent vector if they wish to, in which case zeroTangentVectorInitializer would return { .zero } and capture nothing (and therefore be efficient). There are precedents in ML as well: TensorFlow and PyTorch simply use None as their zero tangent vector. For the example you named above, i.e. ragged tensors, zeroTangentVectorInitializer never forces you to define zero tangent vector as having the same shape; I would define a universal zero tangent vector myself for efficiency. The current design just provides an option for libraries that do need the said semantics.

If we change the semantics to always use TangentVector.zero as the zero, it would surely work for a number of use cases (ML included, but not all established ML libraries' semantics can be implemented using the Swift autodiff feature as I explained above). Removing zeroTangentVectorInitializer amounts to requiring that all dynamically shaped types define a universal zero tangent vector. I really don't think it is a future-proof design, but I would be happy to do it if there is strong consensus. I'm also interested in seeing alternative designs that would allow certain libraries to define their zero tangent vectors to have the shaping semantics they need.

Not sure I would call it "truly a mess" and I would appreciate going into details about concrete use cases. I want to make clear that end users almost never have to implement zeroTangentVectorInitializer — it is only up for authors of differentiable types (with dynamic shapes or scalar types) to implement. The design question seems to be whether we want to allow this customization point or not.

To the point about creating custom differentiable structures, as I mentioned earlier, ML developers can define custom tangent vectors fairly easily today and leverage derived conformances for most things when they make sense. Your counterexample of TypeKeyedArrayBuffers seems like a complicated and advanced example, which therefore doesn't seem like a good argument against the design which is optimized for the vast majority of use cases. In any case, I'd like to understand it a bit more.

Chris_Lattner3 · December 16, 2020, 6:42am

Awesome!

Why are there all these limitations? Functions can take more than one inout argument. I would expect something taking one or more inout arguments to be just as differentiable as one that takes an equal number of values in and returns the same number of results. You support multiple parameters and results, so why do we need corner cases around Void results or not?

I still strongly disagree with this rationale. The body of Swift programmers is much larger than the number of people who are familiar with this terminology. We can introduce new terminology here for Swift programmers (and perhaps equate that terminology to the term of art) and be in a strong place with nothing lost and a more consistent programming model.

Ok this goes beyond my expertise, but my intuition leads me to agree and share Dave's concerns that further exploration could uncover more accessible design points. This seems worth continued exploration.

Ok......

This doesn't add up. You've turned a simple problem into something with three subcases, one of which bottoms out into extreme runtime complexity. My general goal for Swift is to build simple and composable features without "magic". The rationale here is that magic seems great ... until it fails. It's failure turns a promise of clean abstractions into leaky ones, and forces the programmers into an entirely new realm of conceptual complexity that they must now own and reason about.

As others have mentioned, the handling of Zero is surprisingly complicated in AD systems and I think it is worth exploring a range of different options here to get from "ok to great" in the design. I don't think that a design point that requires "subsystem specific bump pointer allocators to provide acceptable performance for closure allocations induced by a weird protocol design" is on the right path. We should aim to make the func requirement work in place of the "closure returning computed property requirement" since that is the language affordance used by effectively everyone for all the things.

rxwei:

Chris_Lattner3:

The "not required but warned about" behavior of the @noDerivative attribute makes sense to me. The name still isn't great - is there any way to turn this into a word with a positive sense, e.g. @stationary , @discrete or something like that? I believe that our prior art for negative things is the "non" prefix, and @nondifferentiated is weird.

@noDerivative has three use cases:

Opt out conformance synthesis for a stored property

Mark function declarations (e.g. Array.count ) as knowingly producing a zero derivative so that the compiler won’t error

Mark a @differentiable function’s parameter as non-wrt.

Given all of its use cases, IMO the name @noDerivative captures the semantics of all three cases. I haven’t been able to come up with a good single-word name. IMO the bottom line is to be sure to mention “derivative” or “differentiation” in the name so that the attribute's feature domain is clear — with this in mind, the only alternative I can come up with is @nondifferentiable , but it doesn’t seem like a good fit for all three use cases above.

Your post here doesn't address my observation that we don't name attributes this way, remaining consistent with the extant language design is pretty important to me. I can explain more about why I find language consistency to be important if that would help.

You seem very confident about that, but as a third party reviewer it is hard for me to feel that based on your assurance. You're anticipating major future extensions to the model and those could affect layout and API in fairly substantial ways. Have you considered making the new module be an "inline only" sort of thing like SwiftUI was?

rxwei:

Chris_Lattner3:

I'm not sure what the right answer here is, but I think we need to pick from one of these options:

If this is really the ultimate fate of swift autodiff and we will never get the bigger goals, then you should own it and just drop "reverse" word, calling this @differentiable .

If this is part of a coherent plan, then I think it makes sense to revise the manifesto to show how the bigger picture fits coherently with this as a base proposal.

What makes me consider @differentiable(reverse) as a subset is that it’s naturally a subtype of @differentiable , which is then a subtype of @differentiable(linear) . When we have general @differentiable functions, reverse-mode differentiation APIs proposed today will be fully compatible with those functions. Additionally, @differentiable(reverse) is a smaller representation than @differentiable so it will be the ideal type to use for gradient-only use cases under memory constraints, so it won't become redundancy. Similarly it would make sense to have a @differentiable(forward) in the future as well. I don’t think @differentiable(reverse) will become “deprecated” even if the full picture is in place.

I don't think @differentiable(reverse) will be the end of this journey. Many people from the community have requested forward-mode differentiation use cases and I think it will be completed someday. I’ll definitely update the manifesto to reflect a coherent plan as requested :) Thanks for raising these questions!

Thank you for the offer. When you do update the manifesto, it will make it possible to properly evaluate this proposal, thanks!

-Chris

rxwei · December 16, 2020, 10:05am

Functions with multiple results are not supported, because we require differentiable functions to have a differentiable return type. We hope this will fall out once tuples can conform Differentiable one day. The special handling of Void helps us decide whether to look for the right "mathematical output" to treat as the result.

@differentiable(reverse)
func foo(x: Float) -> (Float, Float)
// error: '(Float, Float)' doesn't conform

@differentiable(reverse)
func foo(x: Float, y: inout Float)
// Okay. No results so we treat the only `inout` argument as the
// mathematical result.

@differentiable(reverse)
func foo(x: inout Float, y: Float, z: inout Float)
// error: Which inout parameter is the mathematical result?

Are you suggesting that we support multiple results by breaking apart result tuples in type checking and check for each return tuple element's conformance individually? It can be done but I think it will be ABI-breaking if tuples conform to Differentiable one day.

Or are you suggesting that we treat all inout parameters and the single Differentiable-conforming result as mathematical outputs so that they behave like a single product space result? That can be done but I've never seen such use cases. It also seems purely additive — we could make future proposals remove the special cases in type checking rules, if that's an acceptable direction.

I didn't argue for using "differential operator" nor intend to provide a rationale for keeping the name. I've actually already switched to using the term "higher-order functions for differentiation" in the proposal. Are you happy with this name? Or do you have an alternative suggestion?

I agree and I would like to sort this out before this goes into further review. Thanks for the feedback. Let me try to summarize the issues here.

How zero tangent vectors are created in a differentiation-enabled library has traditionally been a choice made by the library since they generally have an in-house AD implementation. Some of these libraries (e.g. autograd and JAX) make sure that zeros have the same shape and scalar type as the input that was differentiated wrt. Other libraries (e.g. TensorFlow and PyTorch) use a universal zero value (often the Python None) to represent zero tangent vectors.

The reason we proposed something like zeroTangentVectorInitializer is because we want to make the system support both cases above as well as any other differentiable programming uses cases that require gradient values to have the same shape (scalar type or whatever else) as the original arguments.

So what options do we have next? To me it seems like we have two options:

If we are confident that we won't run into cases where universal zeros will cause problems, we can drop it and just use TangentVector.zero (which as @dabrahams pointed out would make conformances to AdditiveArithmetic not a lie). It is clear to me that universal zeros will work for ML, and it seems that @marcrasi and others have built a number of such universal zeros for non-ML-focused data structures.
If we think universal zeros aren't future-proof, we can find an alternative to the closure-returning property. For example, it could be a combination of associatedtype Shape, var shape: Shape, and static func makeZeroTangentVector(shape: Shape) -> TangentVector. But "shape" feels too specific — there could be other dynamic metadata such as scalar type.

I am getting increasingly convinced that universal zeros can work. @marcrasi, we discussed these options years ago when we came up with this closure-returning property design. Maybe you can shed some light on this based on your experience creating universal zero tangent vectors for non-ML use cases? What are the tradeoffs and pain points?

I understand your concerns about language consistency but I don't have a great alternative in mind and would love to hear suggestions. While words like @stationary and @discrete like you suggested carry a positive sense, they seem strictly less clear than @noDerivative. @noDerivative behaves exactly like it looks, i.e. opting out of having derivatives. Maybe @derivativeless if I can invent a word. Do you or the community have a suggestion?

Sorry, I don't mean to overpromise ABI guarantees. I have considered making things inline-only but doing this for just the Differentiable module doesn't seem to be able resolve any ABI incompatibility issues in practice if the fear of ABI breakage is about the use of @differentiable(reverse) functions. Assumptions will be made about @differentiable(reverse) ABI when it is used by shipping libraries if they have any @differentiable(reverse) protocol witnesses or storage of @differentiable(reverse) closures. If we were to expose the initial feature ultra conservatively, wouldn't it need us to require that all code that's using @differentiable(reverse) be also @_alwaysEmitIntoClient? If the concern is only about differential operators and the Differentiable protocol's extension methods, yes they can be @_alwaysEmitIntoClient and I'm interested in knowing the Core Team's recommendations on this topic.

rxwei · December 17, 2020, 4:46am

FWIW, one way to think about zeroTangentVectorInitializer is that it is an empty pullback, the result of differentiating nothing (or a function of type (Self) -> Void) w.r.t. self. There's no mathematical result, so the derivative is zero. I wonder if we can somehow align this (currently very strange looking) protocol requirement with the typing rules of derivative functions of normal functions and give it a more approachable name.

// Example of zero initializers' similarity to derivative functions
struct Foo: Differentiable {
    ...
    // A normal function with some differentiable result.
    func somefunc() -> Bar {
        ...
    }

    // `foo`'s derivative is expected to have the following type:
    @derivative(of: somefunc)
    func somefuncDerivative() -> (value: Bar, pullback: (Bar.TangentVector) -> Foo.TangentVector) {
       ...
    }

    // The zero tangent vector initializer ≈ derivative of `(Foo) -> Void`.
    func derivativeOfNothingWrtSelf() -> (value: Void, pullback: () -> Foo.TangentVector)
}

rxwei · December 21, 2020, 4:02am

I have updated the definition of Differentiable in the proposal to the following. Namely, zeroTangentVectorInitializer has been removed, and move(along:) has been renamed to move(by:). Thanks for everyone's feedback!

public protocol Differentiable {
    /// A type that can be used to represent derivatives with respect to a
    /// value whose type is `Self`. Mathematically, this is equivalent to the
    /// tangent bundle of the differentiable manifold represented by the
    /// differentiable type.
    associatedtype TangentVector: Differentiable & AdditiveArithmetic
        where TangentVector == TangentVector.TangentVector

    /// Moves `self` by the given direction. In Riemannian geometry, this is
    /// equivalent to exponential map, which moves `self` on the geodesic
    /// surface by the given tangent vector.
    mutating func move(by direction: TangentVector)
}

The compiler-generated pullback code will always use TangentVector.zero as the zero tangent vector. This makes everything consistent with the contract given by AdditiveArithmetic conformances. It will also force developers (for good reasons) of dynamically shaped mathematical types to design a universal zero value for efficiency, just like AdditiveArithmetic protocol already does today. If "zero tangent of the same dynamic shape" is required by some use cases in the future, we can then explore the idea of introducing a new protocol that refines Differentiable and provides a customization point for zero tangent values.

dan-zheng · December 21, 2020, 8:03am

Interesting simplification!

Have you thought about how to change the definition of TangentVector.zero for types like Array.TangentVector today? Array.TangentVector.zero is currently .init([]) and loses dynamic shape information, leading to autodiff issues like Issues · apple/swift-issues · GitHub (shape assertion failure in autodiff-generated derivative function). I'm curious how we can best fix those without using Array.zeroTangentVectorInitializer.

I recall we briefly discussed (1) a "sparse Array.TangentVector representation" and (2) a "symbolic broadcasted Array.TangentVector.zero value" as potential alternatives, but I'm not either of these has been explored in detail.

(JAX project indicates potential performance problems from sparsity in (1). Maybe (2) is more useful and representationally preferable - "universal zero value" in your response seems to indicate so.)

rxwei · December 21, 2020, 10:04am

Since today's implementation of AD is already using TangentVector.zero throughout, I'd expect derivatives of array operations to be already assuming universal zeros. In the issue you linked to, it seems like an issue with how Array.+(_:_:)'s derivative was defined, not with the design of Differentiable.

To use universal zeros (AdditiveArithmetic.zero), all pullbacks need to consistently accept and return universal zeros. That is, a pullback needs to handle cases where the incoming tangent vector equals .zero and return .zero for zero tangent vectors (especially for dynamically sized tangent vectors). The derivative of array concatenation is not doing this currently. I believe that it should be changed to the following:

  @usableFromInline
  @derivative(of: +)
  static func _vjpConcatenate(_ lhs: Self, _ rhs: Self) -> (
    value: Self,
    pullback: (TangentVector) -> (TangentVector, TangentVector)
  ) {
    return (value: lhs + rhs, pullback: { v in 
      if v.base.isEmpty { return (.zero, .zero) }
      return (
        TangentVector(.init(v.base[0..<lhs.count])),
        TangentVector(.init(v.base[lhs.count...]))
      )
    )
  }

dan-zheng · December 21, 2020, 10:10am

This makes sense! I'd like to verify whether your Array.+ derivative implementation indeed fixes Issues · apple/swift-issues · GitHub, that would be so neat.

I wonder if any other primitive pullbacks for differentiable operations taking Array arguments also need updating. Maybe all of the pullbacks for original functions with shape-related preconditions inside:

Array.TangentVector.+ (elementwise addition)
Array.TangentVector.- (elementwise subtraction)
Array.move(by:), Array.TangentVector.move(by:) (if we ever want to make these be differentiable)

Functions taking n-d array types (e.g. Tensor from Swift for TensorFlow) also need derivatives updated to check for the Tensor.zero case. I think includes "functions" like var Tensor.scalars: (Tensor<Scalar>) -> [Scalar].

Differentiable programming for gradient-based machine learning

On an alternative for move(along:)

On an alternative for `move(along:)`