The Automatic Differentiation Manifesto

rxwei · October 15, 2018, 4:06pm

Hi all,

I wrote an Automatic Differentiation Manifesto, as the start to push for world's first general-purpose differentiable programming language.

rex-remind · October 16, 2018, 5:59pm

Thanks for writing this up, this sounds like a really powerful feature! Excited to see first class support for this in Swift!

I'm pretty new to ML and even newer to AD, so this is a naive question, but I recently watched this talk and it had one of the most digestible models of AD i've come across. Curious if/how this fits into the manifesto?

Much longer version of same talk with more explanation here.

rxwei · October 16, 2018, 6:58pm

Lots of things in the manifesto are actually consistent with the model described in Conal's paper. For example, function D is like function differential(at:in:) in the manifesto, except that differential(at:in:)'s type signature generalizes over general differentiable manifolds.

I intentionally made the AD model align a bit closer to differential geometry ("pullback", "tangent space", etc). I'm not sure introducing category-theory-inspired protocols in the Swift standard library is a good idea.

cbarrett · October 16, 2018, 11:46pm

EDIT: I think this is answered in section 6, actually. Thanks!

Very cool. And amazing that you got to speak to (the!) Gordon Plotkin about it.

Is there to be any support for differentiation at the datatype declaration level? Of course it's possible to represent all data in terms of Vector{N} and, in some circumstances, it may make sense to do so. But for a lot of programming tasks it is nicer to work with say, a Features struct instead. To put it another way, we have structs because we know that tuples alone just won't do.

If you'd like to discuss further, please PM me.

rxwei · October 17, 2018, 1:37am

What do you mean exactly by support at the declaration level? One can already make a type differentiable by declaring a conformance to Differentiable.

struct T : Differentiable {
    ...
}

func foo(x: T) -> Float {
    ...
}

gradient(of: foo) // (T) -> T.CotangentVector
gradient(at: ..., in: foo) // T.CotangentVector

Ah okay. I'm surprised I didn't see this...

scanon · October 17, 2018, 8:07am

First, thanks for getting this started, @rxwei. Exciting!

I'm on vacation, so any notes from me are going to come in spurts while the baby is sleeping. Here's the first batch, focusing on VectorNumeric:

    associatedtype ScalarElement

Shouldn't ScalarElement be constrained to be Arithmetic? If not, why not?

    associatedtype Dimensionality

This is really a shape, rather than dimension; having an associated type called Dimensionality seems misleading, because the dimension of a (finite-dimensional) vector space is always an integer. The dimension of the vector space of 2x3 matrices over the real numbers is 6, not [2,3]. Can we call this associatedtype Shape instead? Or is there some reason you are avoiding that term?

    /// Create a scalar in the real vector space that the type represents.
    ///
    /// - Parameter scalar: the scalar
    init(_ scalar: ScalarElement)

I don't understand what this does. Scalars aren't in "the vector space that the type represents." They're objects of a different type entirely. Also, you use "real vector space" fairly pervasively in the comments for this protocol, but AFAIK you want to represent vector spaces (or even modules) over arbitrary fields (rings).

    init(repeating repeatedValue: ScalarElement, dimensionality: Dimensionality)

    /// The dimensionality of this vector.
    var dimensionality: Dimensionality { get }

Again, these would make more sense as shape: Shape.

    /// Returns the scalar product of the vector.
    static func * (scale: ScalarElement, value: Self) -> Self

I'm assuming that Self * ScalarElement and Self *= ScalarElement would be defaulted as well, is that correct?

rxwei · October 17, 2018, 8:22am

It's an oversight. It definitely should!

Would it make sense to conform to Numeric though?

I agree that shape is more straightforward. I was just worried that "shape" is an unfamiliar concept to Swift so I chose a word closer to "dimension". Shape WFM.

Ok, initially I made gradient(of:) support vector-valued functions by suppling a vector of default ones. This is sometimes useful because the user can write a loss function that returns a Tensor which is actually a scalar.

func foo<T: VectorNumeric>(x: T) -> T { ... }
gradient(of: foo) // equivalently: { x in pullback(at: x, in: foo)(T(1)) }

But I later changed gradient(of:) to only support functions that return a scalar. So this requirement can be removed.

Also, definitely shouldn't have mentioned "real". The code comment was copied from an old design.

Done.

Yes.

rxwei · October 17, 2018, 9:45am

I have a bigger question though. I think it makes sense to have Float and Double conform to VectorNumeric and Differentiable so that they'll work with AD, but there are two problems:

Having a scalar type conform to a vector protocol in stdlib may be confusing.
The only sensible Shape type for a scalar is perhaps ().

scanon · October 17, 2018, 10:02am

Conforming scalars to a vector-space protocol makes perfect sense mathematically, but it introduces an ambiguity between *(_:Self,_:Self) and *(_:Scalar,_:Self). Ideally Swift would have a way for us to tell the compiler that the ambiguity is purely syntactic (the two operations are semantically equivalent) and to just fuse them, but that doesn't exist today (this would also resolve the problem you have with ExpressibleByIntegerLiteral if it existed, of course).

rxwei · October 17, 2018, 9:12pm

I think this ambiguity can be worked around for most common cases. We can define a default implementation of *(_:Scalar,_:Self) in a conditional protocol extension to Numeric. And we define *(_:Self,_:Self) on the concrete type. I can't think of a case when the user would want to define algorithms generic over the composition of two protocols Numeric & VectorNumeric, so in most cases the concrete implementation is getting called.

anandabits · October 22, 2018, 2:13pm

I finally had time to read this. I don't have any feedback of substance to add but just want to say that I find this incredibly exciting! Thank you for pushing this forward @rxwei!

rxwei · October 23, 2018, 11:46pm

Here's something I haven't talked about in the @differentiable attribute chapter. It has something to do with the syntax of generic constraints.

Here's a vector type whose + is differentiable.

public extension Vector {
    @differentiable(tangent: tangentAdd, adjoint: adjointAdd)
    static func + (lhs: Vector, rhs: Vector) -> Vector {
        ...
    }

    static internal func tangentAdd(lhs: (Vector, Vector), rhs: (Vector, Vector), originalValue: Vector) -> Vector
    static internal func adjointAdd(lhs: Vector, rhs: Vector, originalValue: Vector, direction: Vector) -> (Vector, Vector)
}

Note that this doesn't make mathematical sense, because arguments and the result do not conform to Differentiable (they can be Int, for example). So we need to add some generic constraints that constrain differentiability so that tangentAdd and adjointAdd can have different types. The syntax can look like @_specialize(where ...).

public extension Vector {
    @differentiable(tangent: tangentAdd, adjoint: adjointAdd, where Scalar: FloatingPoint)
    static func + (lhs: Vector, rhs: Vector) -> Vector {
        ...
    }
}

public extension Vector where Scalar: FloatingPoint {
    static internal func tangentAdd(lhs: (Vector, Vector), rhs: (Vector, Vector), originalValue: Vector) -> Vector
    static internal func adjointAdd(lhs: Vector, rhs: Vector, originalValue: Vector, direction: Vector) -> (Vector, Vector)
}

Sajjon · October 30, 2018, 4:01pm

Hello, during the summer I developed EquationKit which support partial differentiation of multivariate polynomials.

So you can write stuff like this:

let polynomial = (3*x + 5*y - 17) * (7*x - 9*y + 23)
print(polynomial) // 21x² + 8xy - 50x - 45y² + 268y - 391)
let number = polynomial.evaluate() {[ x <- 4, y <- 1 ]}
print(number) // 0

let y＇ = polynomial.differentiateWithRespectTo(x)
print(y＇) // 42x + 8y - 50
y＇.evaluate() {[ x <- 1, y <- 1 ]} // 0

let x＇ = polynomial.differentiateWithRespectTo(y)
print(x＇) // 8x - 90y + 268
 x＇.evaluate() {[ x <- 11.5,  y <- 4 ]} // 0

rxwei · October 30, 2018, 7:37pm

Thanks for sharing. This looks like a typical library implementation of symbolic differentiation.

This is very very different from AD, especially first-class AD, though.

eaplatanios · November 29, 2018, 4:53pm

Hi Richard,

Thanks a lot for putting together this great document. I finally got to reading it, even if super late. I think the ideas are great!

I wanted to point out something that came up when I was trying to type AD, and that I think is not currently being dealt with (except if I am missing something). Given a function of type (Float, Float) -> (Float, Float) you define the gradient as a function of type (Float, Float) -> (Float, Float). I believe that a bit more flexibility is needed with respect to the gradient types, in some cases. To make this clear consider the gather op in TensorFlow. Ideally, given an input tensor with type Tensor<Float>, you would want the gradients tensor to have type TensorIndexedSlices<Float> or SparseTensor<Float>, given that the gradients can be very sparse. I know that Swift for TF does not necessarily yet support sparse tensors, but I bring this up only because it more generally relates to AD and it can be an important issue. For example, for some NLP models I tried, densifying the gradients of the gather op can be detrimental to performance (e.g., densifying gradient updates for a word embeddings lookup table, when using a large vocabulary).

One way around this would be to allow gradient functions to return any type of gradient they want for each of their arguments, and only make sure the number of arguments is consistent.

Not sure how useful this is, but wanted to throw it out there, given that it can impact performance quite severely.

Cheers,
Anthony