Agreement of valueWithDifferential and value

oxinabox · December 19, 2019, 5:07pm

This is to continue a discussion I was having with @rxwei at NeurIPS.
Because it matters for the similar work we are doing in JuliaLang.
And is generally interesting

Consider that one has v, dv = valueWithDifferential(body, x)
or valueWithPullback, or valueWithGradient etc
where the general promise is that v = body(x).

As I recollect the discussion:

It was @rxwei's position that v had to be exactly equal to body(x).

It was my position that it had to be fairly close, but that some error due to floating point math was permitted.
In particular if I had to quanify it,
I would argue that for v_true being the actual result of body(x) according to the math, and v beiing the value according to v, dv = valueWithDifferential(body, x)
then | body(x) - v_true | >= | v - v_true | for all x.

i.e. if you have some operation like normal sin from a good LibM, that is accurate to 1 ulp,
then the value according to some operation that also calculates or tracks derivitive information
should also be accurate to 1 ulp.
But its error could be in the oposite direction.
If however it was something like SLEEF's fast implementation of sin which is only accurate to 3 ulp, then it too only has to be accurate te 3 ulp, again it could be in opposite direction.

The reason for allowing this disagreement is that to allow efficient computing of the deriviatives, it is sometimes useful to change how one compute the value, so as to calculate useful intermediary values that get reused when calculating the derivative information.
And it is really fiddly to be ensure floating point math remains the same.
Given it is not even associative (i.e. a + (b + c) != (a + b) + c).

And example of this is /, here is code from some of our AD tooling for Julia showing how that is useful.
I just would have no confidence that this will give identical ansers to performing the operation directly.
The math is correct but floating points are not to be trusted.

That final point is why I think this is OK.
everyone doing this kind of thing should be aware that floating points are not to be trusted.
TensorFlow documents that exact floating point values returned by functions are excluded from their SemVer promise.

Anyway, this is an interesting conversation worth having.

dan-zheng · December 19, 2019, 6:59pm

Thanks for starting this interesting discussion!

I also assumed that valueWithDifferential(at: x, in: f).value == f(x) and valueWithPullback(at: x, in: f).value == f(x) are reasonable semantic requirements, but it seems there may be performance-related motivations for relaxing these requirements.

I'm not familiar with Julia, so here's my interpretation of the linked Julia code:

function rrule(::typeof(/), A::AbstractVecOrMat{<:Real}, B::AbstractVecOrMat{<:Real})
    Aᵀ, dA_pb = rrule(adjoint, A)
    Bᵀ, dB_pb = rrule(adjoint, B)
    Cᵀ, dS_pb = rrule(\, Bᵀ, Aᵀ)
    C, dC_pb = rrule(adjoint, Cᵀ)
    function slash_pullback(Ȳ)
        # Optimization note: dAᵀ, dBᵀ, dC are calculated no matter which partial you want
        # this is not a problem if you want the 2nd or 3rd, but if you want the first, it
        # is fairly wasteful
        _, dC = dC_pb(Ȳ)
        _, dBᵀ, dAᵀ = dS_pb(extern(dC))

        # need to extern as  dAᵀ, dBᵀ  are generally `Thunk`s, which don't support adjoint
        ∂A = @thunk last(dA_pb(extern(dAᵀ)))
        ∂B = @thunk last(dA_pb(extern(dBᵀ)))

        (NO_FIELDS, ∂A, ∂B)
    end
    return C, slash_pullback
end

The derivative of matrix division (A / B) is defined in terms of complex conjugate operations (adjoint) and matrix inverse division (A \ B, i.e. B / A).

The original result C is computed as adjoint(adjoint(B) \ adjoint(A)). This C may differ slightly from C_true = A / B, but the computed intermediate values d{A,B,S,C}_pb enable a more efficient slash_pullback implementation. This motivates relaxing the strict C == C_true requirement.

I wonder if you meant to flip the inequality? i.e. ∀x. abs(v - v_true) <= abs(f(x) - v_true)

The flipped inequality makes sense to me: it's okay for valueWithDifferential(at: x, in: f).value to be more accurate than f(x) for all x. The original inequality doesn't quite make sense to me, since it doesn't restrict abs(v - v_true).

oxinabox · December 19, 2019, 7:19pm

Ah yes, I had, I have editted opening post to correct.

We should be bounding the operation that is not yet defined with the operation that is.

oxinabox · December 19, 2019, 7:24pm

That sounds correct to me.
(Its a bit hard as I am not familar with Swift; else I would write the Swift equiv.
Maybe when I am done with Dex, I will move on to Swift)

rxwei · December 19, 2019, 10:49pm

Interesting! Thanks for bringing this up. My main concern was that there's no static compiler diagnostics or runtime assertion to guarantee value is calculating the right thing, and it's an issue regardless of whether it's an approximation or equality. Allowing an approximated result is definitely making this problem harder, though.

oxinabox · December 19, 2019, 11:09pm

I don't actually have a fantastic example handy.
In retrospect, the \ example is not actually more performant as doing it directly -- it is equally performant, and avoids duplication of the fiddly part of the code.

We've long had the opinion that there were intermediate values that were useful in defining the pullback.
But maybe there are not or at least they are kinda rare.

I guess one obvious one though would be for various kinda of triple product f(a,b,c) = a*b*c which I think shows up in various forms of vectors, matrixes and transposes there of.
There is various efficient ways of doing this all at once,
But if you know you are going to want the pullback for a you really want to compute and store b*c for later use.

dan-zheng · December 19, 2019, 11:34pm

One example of "efficient pullback captures intermediate value" is the sigmoid function:

// `σ(x) = 1 / (1 + exp(-x))`
// Could also be defined for a multi-dimensional array type.
func sigmoid(_ x: Float) -> Float {
    return 1 / (1 + exp(-x))
}

// Optimized derivative: `σ'(x) = σ(x) * 1 - σ(x)`
// Derivation: https://math.stackexchange.com/a/1225116/277269
@derivative(of: sigmoid)
func derivative(_ x: Float) -> (value: Float, pullback: (Float) -> Float) {
    let y = sigmoid(x)
    return (y, { dy in
        // Capture intermediate value `y` to avoid recomputation.
        dy * y * (1 - y)
    })
}

Pullbacks of n-d array reduction operations often need to perform broadcasting. It's more efficient for these pullbacks to selectively capture just the shape of the input array (an "intermediate value") rather than the entire input array.

Example for a sum reduction operation (adapted from here):

extension Tensor where Scalar: Numeric {
    /// Returns the sum along the specified axes. The reduced dimensions are retained with value 1.
    /// - Parameter axes: The dimensions to reduce.
    /// - Precondition: Each value in `axes` must be in the range `-rank..<rank`.
    func sum(alongAxes axes: Tensor<Int32>) -> Tensor { ... }
}

internal extension Tensor where Scalar: TensorFlowFloatingPoint {
    @derivative(of: sum(alongAxes:))
    func derivative(alongAxes axes: Tensor<Int32>) -> (Tensor, (Tensor) -> Tensor) {
        let value = sum(alongAxes: axes)
        return (value, { [shape = shapeTensor] in
            // Use capture list to capture just `self.shapeTensor` instead
            // of all of `self`.
            $0.broadcasted(toShape: shape)
        })
    }
}

oxinabox · December 20, 2019, 12:35am

In your sigmoid example, y is not an intermediary value -- it's the output.
You can thus be computing it with the same function as if you were not differentiating, thus it will be exactly equal.

Discovering the Axes is valid though I think. If it is used in both the forwards and reverse path.
I don't speak swift well enough.
So I am not sure

rxwei · December 20, 2019, 1:15am

Capturing tensor/array shapes in the pullback is a good example of this. The pullback needs the shape to create a zero tangent vector of the same shape, but it should not capture the original tensor/array value because it may be very large.

A more general use case is to allow users to write arbitrary custom derivatives for any function. They should be enabled to capture anything they need in the pullback to achieve the best performance.

willtebbutt · December 20, 2019, 12:49pm

In response to @oxinabox , a good example where we want to cache computations from the forwards-pass for use in the pullback is in our \ operator for abstract matrices, here -- the results of triu / tril / lu / qr should all be stored for the reverse-pass (if they're computed). We're not currently doing that, but only because no one has gotten around to doing it yet.

oxinabox · May 1, 2020, 10:03am

Seth Axen raised a good example of this on our testing package

Here is the Zygote.jl implementation of the custom senstitivity fro eigvals

adjoint function LinearAlgebra.eigvals(A::LinearAlgebra.RealHermSymComplexHerm)
  d, U = eigen(A)
  return d, d̄ -> (U * Diagonal(d̄) * U',)
end

Rather than computing the primal value by calling eigvals
it computes the eigenvales and eigenvectors by using eigen.
because the eigenvectors are needed in the pullback.

However eigen actually uses a different (but also good) algorithm for computing eigen values.
(A little playing around found differences on scale of 1e-15 for a 10x10 random matrix)

eigvals / LAPACK.syevr!('N', ...)
which goes through DSYEVR
and eventually calls DSTERF
which uses "Pal-Walker-Kahan variant of the QL or QR algorithm." to compute eigenvalues

vs
eigen / LAPACK.syevr!('V',...)

which also goes through [DSYEVR]
but eventually calls DSTEMR
which calculates eigenvalues "either by bisection or the dqds algorithm."

I think this is probably one of the best simple examples of why primal result should be allowed to differ.
I feel like we can definately trust LAPACK to be doing correct and efficent things

dan-zheng · May 1, 2020, 12:44pm

Thanks @oxinabox for sharing!

To clarify: is "efficient implementation" the reason for eigvals's adjoint to call eigen?

I guess relaxing the f(x) == valueWithDifferential(at: x, in: f).value requirement is practically motivated by specific derivative registration use cases. It's not possible to enforce this requirement via the language or type system, only verifiable via derivative correctness tests (finite differences).