Differentiable Programming Mega-Proposal

Hi Swift community,

We have completed a comprehensive proposal for the differentiable programming feature we’ve been incubating over the last 1.5 years. We’ve gone over many iterations on the feature design, and have partially completed the implementation. Now we are ready to start a discussion on Swift Evolution, specifically on upstreaming and standardizing the feature.

Since this proposal is overly long (~60 pages), we hope to start by merging it into the docs/ directory in apple/swift via apple/swift#27034, and draft bite-sized proposals that contain references to the mega-proposal.

We look forward to your feedback!

- Richard, Dan, Marc and Bart

Full text: See external markdown.
Abridged text: See below. This is to fit 115603 words into the 32000-word limit on the forum.

Differentiable Programming Mega-Proposal

Table of contents

Introduction

This proposal introduces first-class differentiable programming to Swift. First-class differentiable programming includes five core additions:

  • The Differentiable protocol.
  • @differentiable function types.
  • The @differentiable declaration attribute for defining differentiable
    functions.
  • The @differentiating and @transposing attributes for defining custom
    derivatives.
  • Differential operators (e.g. derivative(of:)) in the standard library.

Differentiable programming is a new paradigm for programming in which programs can be differentiated throughout. At a glance, differentiable programming lets you take the derivative of functions whose parameters and results conform to the Differentiable protocol.

@differentiable
func f(_ x: Float) -> Float {
    x * x
}
let dfdx = derivative(of: f)
dfdx(3) // 6

The ability to get derivatives of programs enables a new world of numerical computing applications, notably machine learning. With first-class support, gradient-based learning algorithms can even be built using standard library types such as Float and SIMD64<Float> and be differentiated using protocol-oriented APIs such as valueWithGradient(at:in:).

struct Perceptron: @memberwise Differentiable {
    var weight: SIMD2<Float> = .random(in: -1..<1)
    var bias: Float = 0

    @differentiable
    func callAsFunction(_ input: SIMD2<Float>) -> Float {
        (weight * input).sum() + bias
    }
}

var model = Perceptron()
let andGateData: [(x: SIMD2<Float>, y: Float)] = [
    (x: [0, 0], y: 0),
    (x: [0, 1], y: 0),
    (x: [1, 0], y: 0),
    (x: [1, 1], y: 1),
]
for _ in 0..<100 {
    let (loss, 𝛁loss) = valueWithGradient(at: model) { model -> Float in
        var loss: Float = 0
        for (x, y) in andGateData {
            let ŷ = model(x)
            let error = y - ŷ
            loss = loss + error * error / 2
        }
        return loss
    }
    print(loss)
    model.weight -= 𝛁loss.weight * 0.02
    model.bias -= 𝛁loss.bias * 0.02
}

Differentiable programming scales up to full machine learning models, built with third-party libraries like TensorFlow.

import TensorFlow

let model = Sequential {
    var layer1 = Dense<Float>(inputSize: 784, outputSize: 100, activation: relu)
    var layer2 = Dense<Float>(inputSize: 100, outputSize: 30, activation: relu)
    var layer3 = Dense<Float>(inputSize: 30, outputSize: 3, activation: identity)
}

var classifier = Model()
let optimizer = SGD(for: classifier, learningRate: 0.02)
Context.local.learningPhase = .training
let x: Tensor<Float> = ...
let y: Tensor<Int32> = ...

for _ in 0..<1000 {
    let 𝛁model = gradient(at: classifier) { classifier -> Tensor<Float> in
        let ŷ = classifier(x)
        let loss = softmaxCrossEntropy(logits: ŷ, labels: y)
        print("Loss: \(loss)")
        return loss
    }
    optimizer.update(&classifier, along: 𝛁model)
}

While the differentiation APIs are flexible and fully dynamic, differentiation is based on a program transformation that happens at compile-time. This enables many static analyses that not only help produce more efficient programs, but also detect common numerical programming mistakes such as non-differentiable functions and zero derivatives.

let grad = gradient(at: 1.0) { x in
    3.squareRoot()
}
test.swift:2:18: warning: result does not depend on differentiation arguments and will always have a zero derivative; do you want to add 'withoutDerivative(at:)' to make it explicit?
    3.squareRoot()
    ^
     withoutDerivative(at:)

With a first-class differentiable programming language, some of the most common runtime errors in machine learning become directly debuggable without library boundaries. Simply step through backpropagation using LLDB to debug derivatives.


Backpropagation debugging demo using LLDB.

Motivation

Background

In mathematics, a derivative of a function of a real variable is another function that computes the sensitivity to changes in the output of the original function with respect to changes in the original function's arguments. Differentiation is the process of computing derivatives. See the "Math Introduction" section below for more details.

Derivatives are a fundamental tool in calculus and have applications in many domains, notably deep learning. Numerical computing in Swift Swift is an expressive, high-performance language that is a great fit for numerical applications. Recent proposals have paved the way for low-level numerical computing in Swift: [AdditiveArithmetic][SE-0233], SIMD [[1][SE-0229]] [[2][SE-0251]], [generic math functions][SE-0246]. However, high-level numerical computing applications, including machine learning and artificial intelligence, require more work.

We believe that first-class differentiable programming is a big step towards high-level numerical computing support and will make Swift a real contender in the numerical computing and machine learning landscape. Differentiable programming will enable intelligent applications, machine learning models, scientific experiments, physical simulations, and more.

Intelligent applications

Intelligent applications are smart: they use machine learning techniques to enhance user experiences. Intelligent applications can make predictions, provide suggestions, and learn user preferences: all of these can be powered by differentiable programming.

The core of an intelligent application is a function with real-valued parameters. Differentiation can be used to systematically optimize (i.e. find "good" values for) these parameters via gradient descent. (Optimizing these parameters via conventional algorithms is typically difficult or intractable.)

For example, consider a podcast player that tries to automatically adjust the playback speed based on the podcast type and the podcast section.

enum PodcastCategory {
    case comedy
    case news
    ...
}

enum PodcastSection {
    case advertisement
    case introduction
    case body
    case conclusion
}

struct PodcastState {
    let category: PodcastCategory
    let section: PodcastSection
}

struct PodcastSpeedModel {
    var minSpeed, maxSpeed: Float
    var categoryMultipliers: [PodcastCategory: Float]
    var sectionMultipliers: [PodcastSection: Float]

    /// Returns a podcast speed multiplier prediction for the given podcast category
    /// and section.
    func prediction(for state: PodcastState) -> Float {
        let speed = categoryMultipliers[state.category] * sectionMultipliers[state.section]
        if speed < minSpeed { return minSpeed }
        if speed > maxSpeed { return maxSpeed }
        return speed
    }
}

This podcast speed model parameters that determine how quickly the podcast should play under different circumstances: minSpeed, maxSpeed, categoryMultipliers, and sectionMultipliers. A priori, it is not clear what good parameter values are, and different users may prefer different parameter values.

An intelligent application could determine personalized parameter values as follows:

  1. Let the user set the speed manually, and record observations whenever the user changes the speed.

  2. After collecting enough observations, search for parameter values such that the model predicts speeds close to the user's preferred speed. If such values are found, offer to start automatically setting the speed.

"Gradient descent" is an algorithm that performs this search, and a language that supports differentiable programming makes it easy to implement gradient descent. Here is some pseudocode illustrating gradient descent.

First, we need an objective function for gradient descent to minimize. Mean absolute error is used here:

struct Observation {
    var podcastState: PodcastState
    var userSpeed: Float
}

func meanError(for model: PodcastSpeedModel, _ observations: [Observation]) -> Float {
    var error: Float = 0
    for observation in observations {
        error += abs(model.prediction(for: observation.state) - observation.userSpeed)
    }
    return error / Float(observations.count)
}

Next, we implement the gradient descent algorithm.

var model = PodcastModel()
let observations = storage.observations()
for _ in 0..<1000 {
    // The language differentiates `meanError` to get a "gradient", which is a value indicating
    // how to change `model` in order to decrease the value of `meanError`.
    let gradient = gradient(at: model) { meanError(for: $0, observations) }

    // Change `model` in the direction that decreased the value of `meanError`.
    model -= 0.01 * gradient
}

Type-safe machine learning

Today, machine learning is predominantly done in dynamically-typed languages like Python: these languages are concise and easy to use. However, some people prefer safer programming: features like type checking and static diagnostics help catch errors early and improve productivity.

Differentiable programming in Swift enables safe, powerful machine learning. Custom differentiable data structures can be declared and checked at compile-time. Thanks to protocol-oriented programming, differentiable types are generalized by a protocol, enabling differential operators to be defined as higher-order functions constrained on such a protocol. Mathematical optimization algorithms such as neural network optimizers can also be defined generically over such a protocol and work with all differentiable types.

Calculus is fun

Calculus is fun, and differentiation in the Swift toolbox will let programmers explore that fun. Here are some interesting applications:

Animations

Easing functions specify the rate of change of parameters for animations. Differentiation enables easy manipulation of these functions.

Games

Physics equations can be modeled using differentiable functions in game engines. Intelligent agents in games can be trained using techniques like machine learning that are enabled by differentiation.

Simulations

Many simulation techniques for fluids and other physical processes are based on approximate solutions to equations defined in terms of derivatives, like the Euler equations and Navier-Stokes. Being able to differentiate functions is an important building block for implementing algorithms to solve these equations.

Robotics

Control algorithms used in robotics and mechanical engineering rely on (often higher-order) derivatives of functions that model the behavior of joints and other physical systems. A language like Swift that can efficiently compute these derivatives without incurring the unpredictable runtime overhead of garbage collection may be well-placed to run aboard robots.

Rendering and ray tracing

Traditional rendering systems are black boxes that consume data structures with scene geometry and produce images, but the physical processes they simulate are made up of differentiable functions. Building a ray tracer out of differentiable building blocks unlocks applications like inverse rendering (going from an image to scene geometry). [1] [2]

History of differentiation algorithms

This section is abridged! Please see the corresponding section in the external Markdown linked above.

Approaches to automatic differentiation

In practice, automatic differentiation is the most common differentiation algorithm because it is precise and efficient. This section summarizes approaches to automatic differentiation.

Embedded domain-specific languages

This section is abridged! Please see the corresponding section in the external Markdown linked above.

Source code transformation tools

Source code transformation tools are another approach to differentiable programming. Tool users write code, select various differentiation configuration options (the name of the function-to-differentiate, the independent and dependent variable, etc), and provide them to the tool. The tool analyzes the input code and generates output code that computes derivatives according to the options.

Historically, this is one of the oldest approaches for automatic differentiation. Tools like Tapenade and ADIC/ADIFOR compute derivatives of Fortran and C code.

An advantage of source code transformation tools is that they are essentially static compilers: they can perform static analyses on input code to generate optimized derivative-computing output code. For example, Tapenade performs "activity analysis" to determine variables that do not need a derivative and "TBR (to-be-recorded) analysis" to remove unnecessary intermediate variables during differentiation.

However, these tools are not ideal for usability: users must interact with an external GUI to specify inputs and they receive a textual program as output. This external workflow is an extra indirection that takes users out of their natural programming environment. Exposing the tool-provided differentiation features within a language would be more ergonomic.


Image of Tapenade web interface.
User specifies input program and configuration options.
Tapenade generates derivative-computing output program.

First-class language support

Another class of differentiable programming approaches is by integrating the differentiation semantics and code transformations into a programming language to some degree. While there are no mainstream programming languages that support differentiable programming, research systems like Stalin∇ add first-class differential operators (e.g. grad) into the language and the reverse-mode automatic differentiation transformation into the compiler.

First-class language support for differentiation can reap the benefits of source code transformation techniques (e.g. language coverage, performant derivative code) without requiring programmers to use an external tool. Well-designed, powerful differentiation primitives enable users to define their own custom differentiation APIs that would otherwise not be possible in differentiation libraries.

Why bake differentiation into Swift?

First-class language support for differentiation will enable convenient, extensible, and performant differentiable programming in Swift.

Maximal coverage of Swift language features

First-class support for differentiation in Swift enables differentiation to work nicely with a maximal number of Swift language features, including mutation and control flow. Users of differentiable programming do not need to write in a restricted subset of Swift: just write normal code and use differentiation.

Extensibility

First-class language support enables an extensible differentiable programming system.

Custom types can be extended to be differentiable with minimal boilerplate. Custom derivative functions can be retroactively registered for existing functions. Users can define custom differentiation APIs using the powerful primitive operators defined in the standard library and supported by the type system.

Static warnings and errors

Some functions perform non-differentiable operations (on the path from parameters to result) and thus cannot be differentiated. Functions that do not use their parameters to compute the result are technically differentiable, but the derivative is trivially always zero.

With language support for differentiation, the compiler can identify these cases statically via data flow analysis and produce a non-differentiability error or warning. These diagnostics improve productivity and help users catch errors ahead of time. Library-based differentiation approaches cannot generally provide these diagnostics.

For details on static warnings and errors, see the "Static analysis" section in the detailed design below.

The pursuit for user-defined code transformations

The key code transformation enabling differentiable programming is "derivative code generation". Derivative code generation implements automatic differentiation: given an "original function" to differentiate, a derivative function is generated by replacing function applications in the original function with corresponding derivative function applications. The algorithm is described in detail in the Swift Differentiable Programming Implementation Overview document.

Some languages provide the ability to define custom code transformations:

  • Macros enable syntax-based code transformations at compile-time. Hygienic macros (macro systems that avoid accidental variable capture) are available in a variety of languages, including Lisp, Julia, Rust, and Scala, to name a few. As an example: generated type-safe schema wrappers can implemented using hygienic macros in Scala.

  • Compiler plugin systems enable programmers to write plugins that extend the behavior of a compiler. Compiler plugins are more popular in bootstrapped languages, like Haskell, Rust and Scala, where the plugin can be written in the language itself. As an example: a continuation-passing-style code transformation can be implemented as a compiler plugin in Scala.

One might make the case that derivative code generation for differentiation is better implemented as a custom code transformation. While that may be true in theory, Swift does not yet support custom code transformations in practice. This proposal presents differentiable programming as a system of high-level language features and semantics; derivative code generation is an implementation detail. If a system for custom code transformations is added to Swift one day, it may be possible to reimplement derivative code generation using that system without changing the high-level differentiable programming features proposed here.

Math introduction

What is a derivative?

The derivative of a function f measures how quickly the function's output changes when you make small changes to the function's input. The value of this measurement depends on the input x that you start with, and we call the value of the measurement starting at that input "the derivative of f at x.

For a single variable real function (a function with a single real input and a single real output), the derivative of f at x can be summarized as a single real number f'(x) such that f(x + ε) ~= f(x) + f'(x) * ε. In other words, changing the input by a tiny amount epsilon changes the output by f'(x) * ε.


f(x) = x changes by exactly ε whenever you change its input by ε, so its derivative is 1 everywhere.


Near x = 0, f(x) = x^2 changes very little when you change its input, so its derivative at x = 0 is 0 (see orange line).
Near x = 1, f(x) = x^2 changes by approximately 2*ε when you change its input by ε, so its derivative at x = 1 is 2 (see green line).
In general, the derivative of f(x) = x^2 at x is 2*x.

Iterative optimization

Iterative optimization algorithms use derivatives to optimize functions (i.e. find the inputs that minimize or maximize the output of the function). For example, the simple "gradient descent" algorithm starts with an arbitrary input x and uses the derivative of the function at x to determine whether it needs to increase or decrease x to decrease the output of the function. Then it mutates x slightly along the appropriate direction and repeats until the output stops decreasing.

Derivatives of functions with arbitrary inputs

Real world programs deal with data more complicated than single real variables. Fortunately, there are mathematical theories that extend derivatives to functions with nearly arbitrary inputs and outputs.

Recall our original description of derivative: "The derivative of a function f measures how quickly the function's output changes when you make small changes to the function's input." This makes sense for arbitrary input and output types, as long as we can describe small changes in them.

It is easy to describe small changes in nested structures of real numbers: they are just small changes in all the components' real numbers. For example, consider:

struct Point {
    var x, y: Float
}

struct PointPair {
    var p1, p2: Point
}

A small change in Point might be "add 0.01 to x and add 0.02 to y". A small change in PointPair might be "add 0.01 to p1.x and add 0.01 to p2.x".

We can define new types that capture the values of these small changes. We call these types "tangent vectors", a term from math. For example:

extension Point {
    struct TangentVector {
        // `dx` and `dy` are small changes in `x` and `y`, respectively.
        var dx, dy: Float
    }
}

extension PointPair {
    struct TangentVector {
        // `dp1` and `dp2` are small changes in `p1` and `p2`, respectively.
        var dp1, dp2: Point.TangentVector
    }
}

In terms of these tangent vectors, the small changes that we described in words above would be:

Point.TangentVector(dx: 0.01, dy: 0.02)

PointPair.TangentVector(
    p1: Point.TangentVector(dx: 0.01, dy: 0),
    p2: Point.TangentVector(dx: 0.01, dy: 0))

In terms of tangent vectors, the derivative of a function f: (A) -> B is a function df: (A, A.TangentVector) -> B.TangentVector. In other words, df takes a starting value of type A and a small change A.TangentVector and tells you what the resulting small change in B is.

The gradient descent iterative optimization algorithm can run on any function f: (A) -> Float as long as A is a type for which we can define a tangent vector. It iteratively walks around different values of A, searching for a value that minimizes the output of f.

Proposed solution

To push Swift's capabilities to the next level in numerics and machine learning, we introduce differentiable programming as a new language feature, which includes standard library APIs and small additive changes to the type system.

The Differentiable protocol

Differentiable is a standard library protocol that generalizes all data structures that can be a parameter or result of a differentiable function. The compiler derives protocol requirement implementations when a @memberwise conformance is declared.

extension Float: Differentiable {
    typealias TangentVector = Self
}
struct Perceptron: @memberwise Differentiable {
    var weight: SIMD64<Float>
    var bias: Float
}

The @differentiable declaration attribute

The @differentiable declaration attribute is an attribute that marks function-like declarations (function declarations, initializers, properties, and subscripts) as being differentiable.

@differentiable
func cubed(_ x: Float) -> Float {
    x * x * x
}
extension Perceptron {
    @differentiable
    func callAsFunction(_ input: SIMD64<Float>) -> Float {
        (weight * input).sum() + bias
    }
}

@differentiable function types

A subtype of normal function types with a different runtime representation, which stores metadata that allows their values to be differentiated anywhere.

func addOne(_ x: Float) -> Float { x + 1 }
let _: @differentiable (Float) -> Float = addOne
let _: @differentiable(linear) (Float) -> Float = addOne

@differentiating and @transposing attributes

@differentiating and @transposing attributes are used for declaring custom derivative functions for some other function declaration.

import Glibc

@differentiating(expf)
func _(_ x: Float) -> (value: Float,
                       differential: @differentiable(linear) (Float) -> Float) {
    let y = expf(x)
    return (value: y, differential: { v in v * y })
}

Differential operators

Standard library differentiation APIs that take @differentiable functions and return derivative functions or compute derivative values.

// In the standard library:
//
//     func derivative<T: FloatingPoint, R>(
//         of body: @escaping @differentiable (T) -> R
//     ) -> (T) -> R where T.TangentVector: FloatingPoint

@differentiable
func f(_ x: Float) -> Float {
    x * x
}
let dfdx = derivative(of: f)
dfdx(3) // 6

Detailed design

This section is abridged! Please see the corresponding section in the external Markdown linked above.

Examples of differentiable programming

This section is abridged! Please see the corresponding section in the external Markdown linked above.

Future directions

This section is abridged! Please see the corresponding section in the external Markdown linked above.

Source compatibility

This section is abridged! Please see the corresponding section in the external Markdown linked above.

Alternatives considered

This section is abridged! Please see the corresponding section in the external Markdown linked above.

Acknowledgements

This section is abridged! Please see the corresponding section in the external Markdown linked above.

65 Likes

Exciting stuff! It's been great having this project go on and be continually contributing great things back to Swift!

As for this feature, I doubt most people who currently use Swift or who are just learning (and haven't come from from an ML background) will ever really need to put this into their repertoire for everyday use. (Unless ML programming becomes the next bread and butter).

So in my mind this is a pretty expert-level feature, with what I imagine has some pretty specific domain knowledge required to implement.

If that is true, one of my worries is going to be the ongoing maintenance cost of this feature. If this is upstreamed, is there a real danger of eventually losing the domain knowledge required from the TensorFlow team to maintain this? Should we even factor that in as a community, or is this something the Apple core team will have to debate internally? I'm inclined to say we should just ignore this concern, and leave it for someone who has the authority to speak, speak. But I'm just throwing it out there.

Another of my concerns is teachability. Has some thought been put into how this will be documented/taught to someone who is learning Swift, but might not have any experience with DP concepts, but still wants to wet their feet? I imagine fitting something into the Swift book will be a bit challenging given some of the background knowledge required.

But circling back to the idea of adding such a large expert feature, I would be really excited to see this added, as it expands the domain of Swift into exciting ML concepts at a first-class level.

16 Likes

This feature going into the core is going to be amazing! Up until now TensorFlow has been the only DL framework in the ecosystem given they developed the compiler, but once this is out in the open others can come and make their own framework, this could be a huge boost for the Swift AI community :slight_smile:

4 Likes

Another of my concerns is teachability...how this will be documented/taught to someone who is learning Swift.

I think this is a great point, and feel a lot of good work can be done in this area once we have really planned out well how we want to bring this feature into Swift. As background, I joined the team as an intern over the Summer, working on various parts of the AD project, and am back in school now. I came in with absolutely no experience in AD (only basic data science and iOS development), and I totally understand the challenges of learning AD.

With this, I think we need to create several stages/chapters of explanations to new users - use terms and examples that seem more familiar to new users with basic calculus understanding (think simple Float -> Float functions) that may not be 100% correct (for example don't talk about differentiable manifolds). But build on with a more accurate representation each time. As an analogy, it's like how in high-school science classes, students learn a new, more accurate representation of the atom every couple months, from Thompson, to Rutherford, to Bohr, and finally to Schrodinger.

Of course, someone still needs to do all this! This has not been done yet, and that's mainly since we have been focused on getting this feature fully baked out and wanted to start working with you and the rest of the Swift community on this.

As for this feature, I doubt most people who currently use Swift or who are just learning (and haven't come from an ML background) will ever really need to put this into their repertoire for everyday use

Definitely agree that today, a small subset of people who are currently using Swift will use this. As we know, as general-purpose as Swift is, the main set of people using it today are iOS developers on Mac based machines. But with a lot of great work being done such as adding web/server-side capabilities (Vapor, SwiftNIO, etc), making Swift work great on Linux and Windows devices (running a program and also tooling like text editors and IDEs), and people making Swift a more performant scientific programming language, I think more people will start to use Swift in other areas. And as such, in anything numerically heavy, AD can begin to be used by more people in those areas. We need to, of course, build these communities, and I believe this can be done by introducing a powerful tool like AD into the language so people have an enticing reason to switch over to Swift to build their libraries and projects.

is there a real danger of eventually losing the domain knowledge required from the TensorFlow team to maintain this?

This is a fair concern, and I think through great documentation we can mitigate the issue. For example, the Implementation Overview document is a great resource for this. One thing to keep in mind is that the documentation reflects the current implementation, not the final implementation (for example it refers to VJP and JVP functions, which are not brought up in this proposal). Additionally, we want to make sure that open source contributors can work on this in the Swift compiler as well - not just the TensorFlow and Apple core team - so documentation and explanations are definitely a concern!

On a similar note, as part of the broader Swift for TensorFlow project, we have had open design meetings, two of which have talked about the AD implementation details of the project which can also be helpful on this front. We deeply care about making sure AD doesn't become a black box, and helping others learn about exactly how it works and is implemented.

8 Likes

What are examples of practical differentiable types where the tangent vector type is different from type? When I grepped through the proposal for all occurrences of "TangentVector =", it seems it's always self, except for the Point/PointDiff example (but these are structurally isomorphic; is that always the case?)

1 Like

An example would be the following:

struct MyStruct: Differentiable {
  var toTrain: Bool
  var weight: Float
  var bias: Float
}

In this case, the weight and bias fields are differentiable since Float is differentiable. However, what about the toTrain boolean flag here? Booleans aren't differentiable, thus the entire struct isn't differentiable. Only a subset of these fields are differentiable. Thus, the compiler will actually synthesize the following TangentVector for MyStruct:

struct MyStruct {
  // ...
  struct MyStructTangentVector {
    var weight: Float
    var bias: Float
  }
  typealias TangentVector = MyStructTangentVector
  // ...
}

Instead of having TangentVector = Self.

The TangentVector is almost always not equal to Self when the object consists of non-differentiable fields. I currently can't think of other instances, maybe @rxwei @dan-zheng @marcrasi might have an example!

EDIT:
Got 2 more examples!

First is orthogonal matrices whose TangentVector is a skew-symmetric matrix and vice versa:

struct SkewSymmetricMatrix: Differentiable {
  typealias TangentVector = OrthogonalMatrix
  // ...
}
struct OrthogonalMatrix: Differentiable {
  typealias TangentVector = SkewSymmetricMatrix
  // ...
}

Second is quantization. Whenever we quantize values, we use the de-quantized type for derivatives since the quantized type is a discrete value:

// `T` is the de-quantized type.
public struct Quantized<T: Quantizable, QuantizedScalar: FixedWidthInteger> {
    var data: QuantizedScalar
    var range: Range<T.Scalar>
    var scale: QuantizedScalar
    var zeroPoint: Int
}

// `Quantized` is a differentiable manifold when type `T` is also a differentiable manifold.
extension Quantized: Differentiable where T: Differentiable {
  typealias TangentVector = T.TangentVector
}

As background, quantization is used in machine learning for more efficient training and inference of models on smaller devices, helping save memory by storing fewer bits for weights.

Thus, even though the actual value is discrete (like Int), we approximate it's tangent to be continuous (like Float) since for a lot of machine learning applications, we don't need a 100% accurate value for weights when training and for inferencing.

7 Likes

Even in cases where TangentVector is naturally isomorphic to Self, if we have a newtype-like feature at some point it would be great to be able to use it to suppress operations that don't make sense; when Self is Float and TangentVector is naturally also Float, it may still be desirable to make it a type error to multiply two tangent vectors, for example.

7 Likes

This seems like a very special-case solution, introducing a lot of complexity, in order to solve a very specific problem faced (almost exclusively?) by TensorFlow. Am I missing the other likely users of this feature in the near term?

One might make the case that derivative code generation for differentiation is better implemented as a custom code transformation. While that may be true in theory, Swift does not yet support custom code transformations in practice.

This seems the most important paragraph in the proposal. Is there a reason to pursue this special case now, locking ourselves into one specific implementation, rather than the general solution that the proposal acknowledges would be superior? The Swift team has deferred or rejected numerous requests because they might interfere with more general solutions in the future. This feels like it's in that camp. Is there something that makes differentiable programming more central to Swift's future than, say, DSL programming?

IMO, this proposal should be a list of general changes to the attribute system such that TensorFlow can implement @differentiable without special-casing by the compiler. It might then propose that, as a PoC, it be special-cased temporarily. But without that roadmap, this is -1 IMO.

Have I misread the proposal in some way?

34 Likes

I think it’d be extremely difficult to implement this building on something else. I feel like any higher-level implementation would lose out on a lot of the low level wins implementing it in the compiler gives.

And I’m not really even sure how exactly you’d build this using something else, given how intertwined it is with the actual language.

That may be true, but the proposal suggests the opposite in the same paragraph:

If a system for custom code transformations is added to Swift one day, it may be possible to reimplement derivative code generation using that system without changing the high-level differentiable programming features proposed here.

In either case, this requires much more explanation in the Alternatives Considered section, because I think the whole proposal hinges on why this should be a special case, both why it can't be made general, and why it's so broadly valuable that's worth so much weight.

13 Likes
Synthesis conditions

The compiler automatically synthesizes implementations of Differentiable protocol requirements for struct and class types. Here are the conditions for synthesis: The type must declare a conformance to Differentiable with a @memberwise attribute before the protocol name, either on the type declaration or on an extension in the same file. All stored properties of the conforming type must either be a var that conforms to Differentiable or be marked with the @noDerivative attribute. If a non- Differentiable or a let stored property is not marked with @noDerivative , then it is treated as if it has @noDerivative and the compiler emits a warning (with a fix-it in IDEs) asking the user to make the attribute explicit.

  1. Is it ok if the compiler doesn't automatically synthesize an implementation for enums?

  2. Is it ok that the compiler synthesized implementation for recursive types (if there's one) might be incorrect, and hence one would have to implement it manually?

I don't know how often these come up in ML, but these seem like cases not talked about here, so I figured I'd ask.

1 Like

Firstly +1. Great work Tensorflow team!

@cocoaphony I'm going to strongly disagree with your assessment. Calculus has nothing to do with ML. It is a basic foundation of mathematics, and therefore computing.

In fact, most high school kids take it — and often by 10th grade. And yet once we graduate, we throw it out in favor of simple discrete math. Not because it wasn't better at solving problems, but because the mismatch with our (old fashioned) tools and programming languages that only support basic algebraics. Don't keep us in the past.

To make a programming language that supports key foundations of calculus is necessary for the future, and long past due. It's not obscure, it's high school math — no different than sin, cos, and tan.

In our companies code base alone, there are dozens of applications for differentiable computation. From simulation, physics calculations, graphing, estimators, optimizers, regression, convergence problems, even working out rate problems in networking and packet management... and that is before we dig into ML.

33 Likes

@Troy_Harvey, I agree with you AND with @cocoaphony :slight_smile:

My take on this would be that it IS very fundamental, but that there are actually a rather small number of programs out there that need this (completely my imagination, no proof).

So I would rather see this not be part of the compiler (or rather have the compiler be bent specifically to support only this), but rather an "addon" if you wish. If that has serious trade-offs in performance or usability, then maybe it would be better to remove those from the compiler infrastructure but still keep the content of this proposal in a separate package.

Also, kudos to the team who worked on this proposal and the code behind it. That is amazing work!

@scanon My understanding is that Swift does have a newtype-like feature--a struct with one field.

To comment on the main proposal: I find this really exciting despite a complete lack of machine learning background. It's a very cool feature, and I'm really interested in seeing what unconventional applications the community might find for it (ala the example of animation libraries given in the proposal).

This proposal also makes dipping my toe into ML even more appealing personally--this feels like it would be really fun to play around with.

2 Likes

I'm -1 one this proposal. I fail to see why this needs to be baked into the language as a specific case of, what should essentially be, an application of meta-programming.

I think this is a great example of something that should be allowed to be built on-top of Swift, but not something that should be built into Swift. This is not the only application of code generation and compile-time static introspection that is desired.

I also agree with Rob. In my opinion, this statement:

Swift does not yet support custom code transformations in practice.

Does not warrant baking in a construct, but rather warrants further discussion as to why Swift does not allow for that. For instance, if Swift did support custom code transformations, this proposal would about adding this functionality to the standard library, not the language itself.

Why is adding support for code transformations not part of the Alternatives section?

Also, I found the section for "Source code transformation tools" to be misleading... it's possible to integrate custom build rules to transform code without having to "take[s] users out of their natural programming environment". I've done this for custom enum generation, and I've seen others do it for other code transformations.

14 Likes

Two brief tangents (ha!):

"simple numerical methods" are calculus. Literally the entire motivation of the development of differential and integral calculus was to allow us to rigorously solve complex problems using simple numerical methods. Differentiable programming doesn't change that--it's another tool to abstract the techniques we use and make them a little bit nicer to write down; the actual computation that we end up performing underneath those abstractions is equivalent.

Absolutely--this is a point that I make often--but the ergonomics of using it for something like Float that has a bazillion methods and static operators are unpleasant; you need to provide wrappers for an enormous amount of functionality. This is something that some compiler support could go a long way to help with, and make it much more likely that people would do this sort of thing regularly.

6 Likes

Steve, yes you are right. Should read simple "discrete mathematics" or "algebraics".

While the derivative functions may be equivalent, the time to develop them is not. It is equally true that if it is Turing complete, and has assembly language, we can do anything. But that is reductive, otherwise we've been wasting our time on higher-level better quality languages.

This has been a topic of discussion in offices for decades. Leave school, never use calculus again, solve everything in step-wise integration and simplified curve fits for approximate derivatives.

Requisite cartoon added for proof
.

11 Likes

Unlike many comments here, I think in integration of concepts like AD which are driving the next generation of CS forward, lead to democratization and uptake. At this point, I wouldn't use the words "special-case". Headed forward, it may be the dominate case.

I am amazed in the last 3 years, the huge shift towards Python. Interviewing employees, traveling the country and talking with industry. From buildings techs, to bankers, to movie animators, to industrial analysts, MBAs, and of course CS — people are using python data science tools to solve problems in their industry. It been a huge shift in a small timeframe. The democratization of computer science seems to be finally happening — and it happening around the universal need for analysis, not traditional "programming". Swift has the opportunity to bridge traditional systems/app development with data science where the "rest" of the people are headed. Much to my dismay, many of the CS students show up to an interview with only deep learning, and no programming skills (yikes). Welcome to 2019.

6 Likes

I strongly agree with this. This sounds like an argument for the general-purpose newtype that has been discussed repeatedly. Would this be a good motivating example to get that newtype built, and then see if we need additional differential-specific features that newtype cannot address?

I'd like to break this proposal up into smaller pieces, focusing first on those that are most useful to the broadest groups, while paving the way for novel applications like AD. newtype seems an obvious starting point for that, and one that I would strongly support.

This entire idea is extremely exciting. I don't want to downplay that. It should absolutely be pursued as a way to explore weaknesses in Swift that can be solved with general-purpose solutions. I just object to one-off solutions that favor this problem over server-side Swift, proof engines, text parsing, DSLs, embedded programming, or the dozens of other types of programming that also would benefit from syntax tweaks, new language features, or explicit compiler support.

14 Likes

This seems to be non sequitur - none of the reasons Python has become [more] popular of late seem to involve support for this pitch’s “differential programming”. Supporting this in Swift doesn’t appear to do anything to address the actual blockers for Swift replacing Python in these applications.

Furthermore, I’ve never seen anything like this “differentiables” stuff in Python as used for data analysis. Python certainly doesn’t have any built-in magic specifically for this, though it presumably could be implemented because Python does have support for expressive metaprogramming. Which echos the prior point about things Swift is missing that are much more fundamental and much bigger limiters to its broader adoption.

I don’t have the Swift compiler & language knowledge to judge this pitch, but as a “lay programmer” I’m essentially befuddled by it. I understand the calculus, I know intellectually what this is all doing, but I can’t think of a single situation in which this stuff would have significantly helped me with anything I’ve faced in my career thus far. I humbly suggest it’s on the pitch’s authors to better explain the real-world problems this solves - or, if it be the case, be more forthcoming about how domain-specific this is. And to clarify if this is really of use broadly for applying machine learning, or is it really just going to make writing the handful of machine learning libraries easier?

6 Likes