Hello Swift community,
The development of differentiable programming in Swift (“Differentiable Swift”, “AutoDiff”) has come a long way since its beginning almost three years ago. Earlier this year, following the Core Team’s interest in evaluating incorporating this capability into Swift, @dan-zheng and @marcrasi drove and completed a big transition to upstream all of the implementation to the main branch of Swift.
Today, we would like to take differentiable programming in Swift to the Pitch phase with a new proposal. This proposal is derived from the Differentiable Programming Manifesto, but has been scoped down to a forward-compatible (ABI-compatible and optimizable) subset of features to support differentiable programming’s dominant use case — machine learning — as well as other gradient-based numerical computing. We look forward to your feedback!
Differentiable programming for gradient-based machine learning
- Proposal: SE-NNNN
- Authors: Richard Wei, Dan Zheng, Marc Rasi, Bart Chrzaszcz, Aleksandr Efremov
- Review Manager: TBD
- Status: Pitch
- Implementation: On
main
branch behindimport _Differentiation
Introduction
Derivatives are a fundamental tool in calculus and have applications in many
domains, notably gradient-based machine learning (ML). As an easy-to-use,
high-performance language, Swift is a great fit for both highly expressive
algorithms and numerical computations. Meanwhile, ML is one of the fastest
growing technologies in modern days, but the mainstream ML development tools are
mostly based on dynamic languages where it can be challenging for developers to
take advantange of software debugging tools and compile-time code diagnostics or
to maintain type safety in large-scale software.
As a compiled programming language with a modern type system, Swift has a unique
opportunity to develop its own numerical computing and ML ecosystem. Driven by
the growing needs of ML libraries and algorithms, we believe one key technology,
differentiable programming, will help push ML development experience and
developer productivity to a whole new level.
We propose adding differentiable programming as a first-class,
language-integrated feature in Swift, making Swift become the first
general-purpose, statically-typed programming language to have automatic
differentiation
capabilities.
At a glance, this feature includes the following additions:
- A
@differentiable(reverse)
declaration attribute for declaring
differentiable functions. @differentiable(reverse)
function types.- A
@derivative(of:)
attribute for defining custom derivatives. - A
Differentiation
module to be distributed in Swift releases, containing:- A
Differentiable
protocol, generalizing data structures that are
differentiable. - Differential operators (e.g.
gradient(of:)
), for evaluating the
derivatives of functions.
- A
Differentiable programming is a new paradigm for programming in which programs
can be differentiated throughout. At a glance, differentiable programming lets
you take the derivative of functions whose parameters and results conform to the
Differentiable
protocol.
import Differentiation
func f(_ x: SIMD32<Float>) -> Float {
(x * x).sum()
}
let dfdx = gradient(of: f)
dfdx(SIMD32(repeating: 3)) // SIMD32([6, 6, 6, 6, ...])
The ability to get derivatives of programs enables a new world of numerical
computing applications, notably machine learning. With first-class support,
gradient-based learning algorithms can even be built using standard library
types such as Float
and SIMD64<Float>
and be differentiated using
protocol-oriented APIs such as valueWithGradient(at:in:)
.
import Differentiation
struct Perceptron: Differentiable {
var weight: SIMD2<Float> = .random(in: -1..<1)
var bias: Float = 0
func callAsFunction(_ input: SIMD2<Float>) -> Float {
(weight * input).sum() + bias
}
}
var model = Perceptron()
let andGateData: [(x: SIMD2<Float>, y: Float)] = [
(x: [0, 0], y: 0),
(x: [0, 1], y: 0),
(x: [1, 0], y: 0),
(x: [1, 1], y: 1),
]
for _ in 0..<100 {
let (loss, modelGradient) = valueWithGradient(at: model) { model -> Float in
var loss: Float = 0
for (x, y) in andGateData {
let prediction = model(x)
let error = y - prediction
loss = loss + error * error / 2
}
return loss
}
print(loss)
model.weight -= modelGradient.weight * 0.02
model.bias -= modelGradient.bias * 0.02
}
Differentiable programming scales up from simple examples like this to
full-fledged machine learning models using neural networks. Neural networks are
similar to the Perceptron
example above in that it contains trainable
parameters (commonly part of neural network layers) and each parameter can be
modified based on gradient of a loss with respect to each parameter. Neural
network layers can be generalized by a protocol that inherits from
Differentiable
:
// Example library:
public protocol Layer: Differentiable {
associatedtype Input: Differentiable
associatedtype Output: Differentiable
@differentiable(reverse)
func callAsFunction(_ input: Input) -> Output
}
public class Dense: Layer { ... }
public class Convolution: Layer { ... }
public struct NDArray: Differentiable { ... }
// Client code:
final class MyModel: Layer {
let dense1: Dense
let dense2: Dense
func callAsFunction(_ input: NDArray<Float>) -> NDArray<Float> {
dense2(dense1(input))
}
}
While the differentiation APIs are flexible and fully dynamic, differentiation
is based on a program transformation that happens at compile time. This enables
many static analyses that not only help produce more efficient code but also
detect common numerical programming mistakes such as non-differentiable
functions and zero derivatives.
let grad = gradient(at: 1.0) { x in
3.0.squareRoot()
}
test.swift:2:4: warning: result does not depend on differentiation arguments and will always have a zero derivative
3.0.squareRoot()
^
test.swift:2:4: note: add 'withoutDerivative(at:)' to silence the warning if zero derivatives are intentional
3.0.squareRoot()
^
withoutDerivative(at: )
Unlike library-based automatic differentiation, differentiable programming makes
many common runtime errors in machine learning become directly debuggable using
LLDB without library boundaries. Also contrary to library-based approaches,
differential operators offered in the Differentiation
library can be used to
take the derivative of functions on any type that conforms to the
Differentiable
protocol, such as Float
, SIMD4<Double>
, Complex<Double>
,
[Float]
and custom types. This enables programmers to integrate gradient-based
learning algorithms, physical simulations, and scientific experiments directly
in their applications without having to incorporate any embedded domain-specific
language or an automatic differentiation algorithm.
Example: Intelligent apps
One example that uses gradient-based machine learning techniques to enhance user
experiences of an app is providing intellience based on learned user behavior.
Intelligent apps can make predictions, provide suggestions, and learn user
preferences: all of these can be powered by differentiable programming.
The core of such an intelligent app is a function with real-valued "trainable
parameters". Differentiation can be used to systematically optimize (i.e. find
"good" values for) these parameters via gradient descent. (Optimizing these
parameters via conventional algorithms is typically difficult or intractable.)
Consider a podcast player that tries to automatically adjust the playback speed
based on the podcast type and the podcast section. We can define its business
logic as the following, as well as a "model" which contains real-valued
parameters that control how inputs get mapped onto outputs.
enum PodcastCategory: Int {
case comedy
case news
...
}
enum PodcastSection: Int {
case advertisement
case introduction
case body
case conclusion
}
struct PodcastState {
let category: PodcastCategory
let section: PodcastSection
}
struct PodcastSpeedModel: Differentiable {
var minSpeed, maxSpeed: Float
/// The multiplier for each podcast category.
var categoryMultipliers: [Float]
/// The multiplier for each podcast section.
var sectionMultipliers: [Float]
/// Returns a podcast speed multiplier prediction for the given podcast category
/// and section.
func prediction(for state: PodcastState) -> Float {
let speed = categoryMultipliers[state.category] * sectionMultipliers[state.section]
if speed < minSpeed { return minSpeed }
if speed > maxSpeed { return maxSpeed }
return speed
}
}
Parameters in this podcast speed model, represented as stored properties in the
struct, determine how quickly the podcast should play under different
circumstances: minSpeed
, maxSpeed
, categoryMultipliers
, and
sectionMultipliers
. A priori, it is not clear what good parameter values are,
and different users may prefer different parameter values.
An intelligent application could determine personalized parameter values as
follows:
-
Let the user set the speed manually, and record observations whenever the
user changes the speed. -
After collecting enough observations, search for parameter values such that
the model predicts speeds close to the user's preferred speed. If such
values are found, offer to start automatically setting the speed.
"Gradient descent" is an algorithm that performs this search, and a language
that supports differentiable programming makes it easy to implement gradient
descent. Here is some pseudocode illustrating gradient descent.
First, we need an objective function for gradient descent to minimize.
Mean absolute error is used
here:
struct Observation {
var podcastState: PodcastState
var userSpeed: Float
}
func meanError(for model: PodcastSpeedModel, _ observations: [Observation]) -> Float {
var error: Float = 0
for observation in observations {
error += abs(model.prediction(for: observation.podcastState) - observation.userSpeed)
}
return error / Float(observations.count)
}
Next, we implement the gradient descent algorithm. In the loop, we take the
gradient of the mean error with respect to the model (i.e. with respect to its
properties such as minSpeed
and categoryMultipliers
). After some iterations,
the mean error will be minimized and the model will produce more "correct"
results based on its learning.
var model = PodcastSpeedModel()
let observations = storage.observations()
for _ in 0..<1000 {
// The language differentiates `meanError` to get a "gradient", which is a value indicating
// how to change `model` in order to decrease the value of `meanError`.
let modelGradient = gradient(at: model) { meanError(for: $0, observations) }
// Change `model` in the direction that decreased the value of `meanError`.
let learningRate = 0.01
model.minSpeed -= learningRate * modelGradient.minSpeed
model.maxSpeed -= learningRate * modelGradient.maxSpeed
for i in model.categoryMultipliers.indices {
model.categoryMultipliers[i] -= learningRate * modelGradient.categoryMultipliers[i]
}
for i in model.sectionMultipliers.indices {
model.sectionMultipliers[i] -= learningRate * modelGradient.sectionMultipliers[i]
}
}
As we can see, differentiable programming enables developers to effortlessly
incorporate extremely lightweight gradient-based learning algorithms into
applications, while having derivative code synthesized automatically by Swift.
Language-integrated differentiable programming benefits not only ML
practitioners and app developers, but also developers of ML and scientific
computing frameworks. Relying on a single language-integrated differentiable
programming eliminates the burden of separately maintaining an automatic
differentiation algorithm and a domain-specific langauge, easing the development
and maintenance overhead.
Motivation
This section is abridged! Please followed the link above to see the full text.
Math introduction
This section is abridged! Please followed the link above to see the full text.
History of differentiation algorithms
This section is abridged! Please followed the link above to see the full text.
Proposed solution
To push Swift's capabilities to the next level in numerics and machine learning,
we introduce differentiable programming as a new language feature, which
includes standard library APIs and small additive changes to the type system.
The Differentiable
protocol
Differentiable
is a protocol defined in the standard library that generalizes
all data structures that can be a parameter or result of a differentiable
function. The compiler derives protocol requirement implementations when a
conformance is declared and when any implementation is missing.
extension Float: Differentiable {
typealias TangentVector = Self
}
struct Perceptron: Differentiable {
var weight: SIMD64<Float>
var bias: Float
}
The @differentiable(reverse)
declaration attribute
The @differentiable(reverse)
declaration attribute is an attribute that marks
function-like declarations (function declarations, initializers, properties, and
subscripts) as being differentiable.
@differentiable(reverse)
func cubed(_ x: Float) -> Float {
x * x * x
}
extension Perceptron {
@differentiable(reverse)
func callAsFunction(_ input: SIMD64<Float>) -> Float {
(weight * input).sum() + bias
}
}
In Differentiable Programming Manifesto, it is described that the
differentiable programming feature uses @differentiable
without (reverse)
.
However, we choose not to use @differentiable
here because the initial set of
proposed feature do not include forward-mode differentiation. Adding (reverse)
makes room for future feature addition without ABI breakage.
@differentiable(reverse)
function types
Differentiable functions are first-class values, identified by a
@differentiable(reverse)
attribute in the function type. A @differentiable(reverse)
function
type is a subtype of its corresponding normal function type (i.e. without a
@differentiable(reverse)
attribute) with an extended ABI, which stores extra
information that allows their values to be differentiated anywhere the function
is passed. A normal function can be implicitly converted to a @differentiable(reverse)
function with appropriate compile-time checks.
func addOne(_ x: Float) -> Float { x + 1 }
let _: @differentiable(reverse) (Float) -> Float = addOne
@derivative
attribute
The @derivative
attribute is used for declaring custom derivative functions
for some other function declaration. This attribute can be used by libraries to
define differentiable functions that are "primitives", i.e. ones that the
compiler cannot differentiate automatically, or by the user to define special
behavior for debugging and performance tuning purposes.
The Differentiation
library uses this attribute to define derivatives for math
functions, such as expf(_:)
in the C standard library.
import Darwin // Or 'Glibc' on Linux
@usableFromInline
@derivative(of: expf)
func derivativeOfExpf(_ x: Float) -> (value: Float, pullback: (Float) -> Float) {
let y = expf(x)
return (value: y, pullback: { v in v * y })
}
Differential operators
Standard library differentiation APIs that take @differentiable(reverse)
functions and
return derivative functions or compute derivative values.
// In the standard library:
// public func gradient<T, R: FloatingPoint>(
// of body: @differentiable(reverse) (T) -> R
// ) -> (T) -> T.TangentVector where R.TangentVector == R
func f(_ x: Float) -> Float {
x * x
}
let dfdx = gradient(of: f)
dfdx(3) // 6
Detailed design
Differentiable data structures
This section is abridged! Please followed the link above to see the full text.
The Differentiable
protocol
The Differentiable
protocol defines operations and structures required for a
type to be differentiated.
public protocol Differentiable {
/// A type that can be used to represent derivatives with respect to a
/// value whose type is `Self`. Mathematically, this is equivalent to the
/// tangent bundle of the differentiable manifold represented by the
/// differentiable type.
associatedtype TangentVector: Differentiable & AdditiveArithmetic
where TangentVector == TangentVector.TangentVector
/// Moves `self` along the given direction. In Riemannian geometry, this is
/// equivalent to exponential map, which moves `self` on the geodesic
/// surface along the given tangent vector.
mutating func move(along direction: TangentVector)
/// A closure that produces a zero tangent vector and does not capture `self`.
///
/// In some cases, the zero tangent vector of `self` is equal to
/// `TangentVector.zero`. In other cases, the zero tangent vector depends on
/// information in `self`, such as shape for an n-dimensional array type.
/// For differentiable programming, it is more memory-efficient to define a
/// custom `zeroTangentVectorInitializer` property which returns a closure
/// that captures and uses only the necessary information to create a zero
/// tangent vector. For example:
///
/// ```swift
/// struct Vector {
/// var scalars: [Float]
/// var count: Int { scalars.count }
/// init(repeating repeatedElement: Float, count: Int) { ... }
/// }
///
/// extension Vector: Differentiable {
/// typealias TangentVector = Vector
///
/// @noDerivative
/// var zeroTangentVectorInitializer: () -> TangentVector {
/// let count = self.count
/// return { TangentVector(repeating: 0, count: count) }
/// }
/// }
/// ```
///
@noDerivative
var zeroTangentVectorInitializer: () -> TangentVector { get }
}
extension Differentiable {
/// A tangent vector such that `move(along: zeroTangentVector)` will not modify
/// `self`.
@noDerivative
var zeroTangentVector: TangentVector { zeroTangentVectorInitializer() }
}
This section is abridged! Please followed the link above to see the full text.
Differentiable function declarations
This section is abridged! Please followed the link above to see the full text.
Make a function differentiable using @derivative
This section is abridged! Please followed the link above to see the full text.
Differentiable function types
This section is abridged! Please followed the link above to see the full text.
Differential operators
The Differentiation
module will provide APIs which developers can use to
obtain gradient functions, gradient vectors, and pullback closures, along with
efficiently-computed original results from a given @differentiable(reverse)
closure. These APIs are called "differential opeators".
gradient(of:)
gradient(of:)
is a higher-order function which behaves exactly like the 𝛁
(Del) operator in mathematics. It takes a
differentiable closure that returns a scalar and its gradient function, i.e. a
closure which accepts the same arguments as the input closure but returns
gradient vectors with respect to the input closure's parameter.
/// Returns the gradient function of the given closure with respect to the argument.
/// - Parameter:
/// - body: A closure whose derivative function will be evaluated.
/// - Returns: A gradient vector with respect to `x`.
func gradient<T: Differentiable, R: FloatingPoint & Differentiable>(
of body: @escaping @differentiable(reverse) (T) -> R
) -> (T) -> T.TangentVector where R.TangentVector: FloatingPoint
gradient(at:in:)
gradient(at:in:)
is the "uncurried" form of gradient(of:)
. It takes a value
and a differentiable closure that returns a scalar, and evalutes the closure's
gradient function on the value.
/// Returns the gradient vector with respect to the argument by evaluating the
/// provided closure's derivative at the argument.
/// - Parameter:
/// - x: An argument to be passed to `body`.
/// - body: A closure whose derivative function will be evaluated.
/// - Returns: A gradient vector with respect to `x`.
func gradient<T: Differentiable, R: FloatingPoint & Differentiable>(
at x: T, in body: @differentiable(reverse) (T) -> R
) -> T.TangentVector where R.TangentVector: FloatingPoint
The call sites of this API read as if the call is feeding an argument into the
trailing closure, getting back a gradient vector. This API is consistent with
developers' mental model on taking the gradient of algorithms, and therefore
will be the most commonly used API. For example, a deep learning model's
training loop may look like the following.
for _ in 0..<1000 {
// Differentiate the loss with respect to the model `classifier` itself, producing a
// tangent vector `modelGradient` that represents partial derivatives with respect to
// all trainable model parameters in the model.
let modelGradient = gradient(at: classifier) { classifier in
let prediction = classifier(x)
let loss = softmaxCrossEntropy(logits: prediction, labels: y)
print("Loss: \(loss)")
return loss
}
optimizer.performStep(for: model, along: modelGradient)
}
valueWithGradient(at:in:)
Sometimes the developer needs to obtain both the original result and the
gradient vector. While it is possible for the developer to call the
differentiable closure and gradient(at:in:)
separately, it would lead to
significant recomputation overhead, because computing the gradient vector of a
differentiable closure at a value will already compute the closure's original
result. valueWithGradient(at:in:)
is an API for efficiently computing both the
original result and the gradient vector.
/// Returns the result and gradient vector with respect to the argument by evaluating the
/// provided closure's derivative at the argument.
/// - Parameter:
/// - x: An argument to be passed to `body`.
/// - body: A closure whose derivative function will be evaluated.
/// - Returns: The result of `body` evaluated on `x`, equivalent to `body(x)`, and
/// a gradient vector with respect to `x`.
func valueWithGradient<T: Differentiable, R: FloatingPoint & Differentiable>(
at x: T, in body: @differentiable(reverse) (T) -> R
) -> (value: R, gradient: T.TangentVector) where R.TangentVector: FloatingPoint
// Example: Want both the result and the gradient of `foo(x)`.
func foo(_ x: Double) -> Double {
tanh(tanh(exp(x)))
}
let x = 2.0
// Slow way:
let y = foo(x)
let dydx = gradient(at: x, in: foo)
// Efficient way:
let (y, dydx) = valueWithGradient(at: x, in: foo)
valueWithPullback(at:in:)
valueWithPullback(at:in:)
is the most general form of differential operator
for reverse-mode automatic differentiation. Unlike valueWithGradient(at:in:)
which directly computes the gradient vector, valueWithPullback(at:in:)
returns
a pullback closure that represents a linear approximation of the input closure
at the given value. This formulation corresponds exactly to derivative functions
that are defined with @derivative
, and enables the most flexibility and
composability. In fact, all other differential operators discussed above are
implemented in terms of valueWithPullback(at:in:)
.
/// Returns the result and pullback closure by evaluating the provided closure's
/// derivative at the argument.
/// - Parameter:
/// - x: An argument to be passed to `body`.
/// - body: A closure whose derivative function will be evaluated.
/// - Returns: The result of `body` evaluated on `x`, equivalent to `body(x)`, and
/// a pullback closure, which represents a transposed linear combination that
/// approximates `body` at `x`. When evaluated on a tangent vector, `pullback` evaluates
/// the linear comibination on the tangent vector and returns a gradient vector with
/// respect to `x`.
func valueWithPullback<T: Differentiable, R: Differentiable>(
at x: T, in body: @differentiable(reverse) (T) -> R
) -> (value: R, pullback: (R.TangentVector) -> T.TangentVector)
Static analysis
Differentiable programming in Swift aims to provide the best static compiler
diagnostics to help users catch mistakes. Beyond error diagnostics, the compiler
and the standard library are equipped with static analyses and marker APIs that
help the user write differentiable code with explicit annotations about
non-obvious non-differentiable cases.
This section is abridged! Please followed the link above to see the full text.
Source compatibility
This feature does not change any existing APIs. While the addition of
@differentiable(reverse)
function types changes the function implicit
conversion rules in the type checker, the relevent code paths are only triggered
when a @differentiable(reverse)
function type is involved in a contextual
type.
Effect on ABI stability
The ABI changes proposed is purely additive. Protocols with requirements marked
with @differentiable(reverse)
will contain an extra entry storing its
corresponding derivative function, provided by conforming types. Similarly,
@differentiable(reverse)
is a new function representation that represents a
bundle of two functions, the original function and the derivative function.
Effect on API resilience
This feature adds the Differentiable
protocol and
differential operators to the standard library as
public APIs. They introduce additions to the standard library.
Differentiable
protocol
The Differentiable
protocol contains all necessary requirements for a type to
be differentiated. Without breaking API, it will be possible to add extensions
to the Differentiable
protocol and add new requirements with default
implementations.
Differential operators
Differential operators (e.g. derivative(of:)
and gradient(of:)
) are added to
the standard library as lightweight top-level higher-order functions. These APIs
can be renamed or moved under some namespace without breaking ABI.
Alternatives considered
Not support differentiable programming
We believe first-class differentiable programming is a big step towards making
Swift a real contender in the numerical computing and machine learning
landscape. Differentiable programming will enable intelligent applications,
machine learning models, scientific experiments, physical simulations, and more.
Use another language or framework for differentiable programming
Dynamic languages, like Python and Julia, have established library support for
differentiable programming. While it is possible to interoperate with these
libraries via Swift, we feel that first-class differentiable programming in
Swift is leaps ahead in expressivity, usability, and safety.
Other approaches to differentiable programming
See
"Approaches to automatic differentiation"
above for an overview and comparison of automatic differentiation approaches.
First-class language support for differentiation will enable convenient,
extensible, and performant differentiable programming in Swift - more so than
library-based approaches.
Acknowledgements
The development of this feature started in early 2018 as part of the Swift for
TensorFlow project and has been pioneered by
engineers from Google. The authors would like to thank everybody involved. See
the
Acknowledgements
section of the manifesto.