Zygote.jl gradient returning `nothing` · helpdesk (published)

julia> using Zygote

julia> f(x, y) = y
f (generic function with 1 method)

julia> gradient(f, 0, 0)
(nothing, 1.0)

Júlio Hoffimann (Nov 08 2024 at 12:17):

Mason Protter (Nov 08 2024 at 12:32):

Mason Protter (Nov 08 2024 at 12:33):

i.e. a differential that's known at compile time to be zero is represented as nothing.

Júlio Hoffimann (Nov 08 2024 at 12:35):

That is somewhat unexpected mathematically speaking. Makes it difficult to write generic code. Is there a good practice to handle nothing in this context ?

Mason Protter (Nov 08 2024 at 12:54):

denothing(x) = x
denothing(::Nothing) = false

my_gradient(args...; kwargs...) = denothing.(gradient(args...; kwargs...))

Mason Protter (Nov 08 2024 at 12:56):

julia> sum(my_gradient(f, 0, 0))
1.0

Júlio Hoffimann (Nov 08 2024 at 13:01):

Júlio Hoffimann (Nov 08 2024 at 13:02):

Michael Abbott (Nov 08 2024 at 15:25):

One reason for a special flag is that Zygote can avoid some work in the backward pass, as the gradient of any operations done before f is certain to be zero. Whereas with a runtime 0.0 it can't tell & must do the work.

Michael Abbott (Nov 08 2024 at 15:26):

The other is that for larger things like x::Array, allocating zero(x) is expensive.

Michael Abbott (Nov 08 2024 at 15:28):

julia> Enzyme.gradient(Reverse, (x,y) -> sum(abs2, x .* y), [1, 2.], [3 4 5.])
([100.0, 200.0], [30.0 40.0 50.0])

julia> Enzyme.gradient(Reverse, (x,y) -> sum(abs2, x .* x), [1, 2.], [3 4 5.])
([4.0, 32.0], [0.0 0.0 0.0])

julia> Enzyme.gradient(Reverse, (x,y) -> sum(abs2, x .* y), [1, 2.], Const([3 4 5.]))
([100.0, 200.0], nothing)

Mason Protter (Nov 08 2024 at 15:29):

It could have had a design like ChainRulesCore.ZeroTangent(), since that at least supports math ops

Mason Protter (Nov 08 2024 at 15:29):

Mason Protter (Nov 08 2024 at 15:40):

Mason Protter (Nov 08 2024 at 15:41):

julia> let x = Ref(0.0), y = Ref(0.0)
           dx, dy = make_zero(x), make_zero(y)
           autodiff(Reverse, Duplicated(x, dx), Duplicated(y, dy)) do x, y
               f(x[], y[])
           end
           dx[], dy[]
       end
(0.0, 1.0)

Júlio Hoffimann (Nov 08 2024 at 15:45):

I'm considering moving to Enzyme.jl because of this design of Zygote.jl. It is pretty counter intuitive to have a mathematical gradient with those entries

Júlio Hoffimann (Nov 08 2024 at 15:46):

Júlio Hoffimann (Nov 08 2024 at 15:47):

And another question: is Zygote.jl the recommended package for autodiff in native Julia or there is something new?

Nils (Nov 08 2024 at 16:13):

Michael Abbott (Nov 08 2024 at 16:19):

Can you say what problem nothing causes, more narrowly than just being surprising?

Mason Protter (Nov 08 2024 at 16:21):

Michael Abbott (Nov 08 2024 at 16:22):

For instance, I agree that the fact that x + dx won't always work is a bit sad. (I think + needs to be replaced with Zygote.accum which knows about nothing.) ChainRules.jl took making this work as an axiom, and the result was massive complexity of Tangent which has all kinds of sharp edges. (Not to mention several kinds of zeros which nobody knows how to use correctly, and resulting type-instabilities.) So there are trade-offs, and nothing (plus NamedTuple for any struct) has the advantage of being very simple.

Júlio Hoffimann (Nov 08 2024 at 16:45):

We are simply doing Newton-Rhapson iteration with automatic gradients. The problem with this nothing design is that it relies on all third-party packages handling it. Even if we workaround the situation in our own package, this solution doesn't compose well.

Mason Protter (Nov 08 2024 at 16:47):

Wouldn't Enzyme be a much better fit for stuff like Newton Rhapson because it supports mutation?

Júlio Hoffimann (Nov 08 2024 at 16:47):

Mason Protter (Nov 08 2024 at 16:49):

Júlio Hoffimann (Nov 08 2024 at 16:52):

Will take a look. I am assuming that ForwardDiff.jl provides autodiff like Zygote.jl but without the nothing.

Mason Protter (Nov 08 2024 at 16:57):

You really only want to reach for reverse mode AD like Zygote if you need the derivatives of functions from N dimensions to M dimensions where N >> M

Júlio Hoffimann (Nov 08 2024 at 16:59):

Mason Protter (Nov 08 2024 at 17:00):

The classic use-case for reverse-mode is deep learning where N might be in the many thousands and M = 1

Mason Protter (Nov 08 2024 at 17:00):

Mason Protter (Nov 08 2024 at 17:01):

I'd say it's actively maintained, but I wouldn't say it's actively developed (on account of said maturity)

Júlio Hoffimann (Nov 08 2024 at 17:01):

I already like that it has a much smaller list of dependencies compared to Zygote.jl

Mason Protter (Nov 08 2024 at 17:02):

Mason Protter (Nov 08 2024 at 17:03):

If you feel like trying out something bleeding edge instead, Diffractor.jl actually has a pretty well working forwards mode nowadays (probably don't actually do this)

Expanding Man (Nov 08 2024 at 17:06):

As a general rule, you should avoid reverse mode like the plague unless you are absolutely sure you need it.

Júlio Hoffimann (Nov 08 2024 at 17:13):

Expanding Man (Nov 08 2024 at 17:19):

Also, since it hasn't been mentioned yet, I highly recommend using DifferentiationInterface.jl which makes it trivial to swap out AD back-ends and has no performance penalty in simple cases.

Mason Protter (Nov 08 2024 at 17:23):

Júlio Hoffimann (Nov 08 2024 at 17:24):

I think in this case we will go ahead with the ForwardDiff.jl package directly. There are no plans to swap the backend given that it is ideal for the application at hand.

Júlio Hoffimann (Nov 08 2024 at 18:15):

Can you try to reproduce this benchmark on the main branch (Zygote) and on the forwarddiff branch?

Do you also see a massive slow down in the last line of the output.csv? The last column has the speedup metric.

Júlio Hoffimann (Nov 08 2024 at 18:15):

For me the Zygote.jl result is 0.28 and the ForwardDiff.jl result is 0.06 (larger is better).

Christopher Rackauckas (Nov 08 2024 at 18:18):

If it's scalar, then you don't want to be diffing through it anyways. BracketingNonlinearSolve or SimpleNonlinearSolve with Zygote/ForwardDiff overloads would just skip the implicit part.

Christopher Rackauckas (Nov 08 2024 at 18:22):

Júlio Hoffimann (Nov 08 2024 at 18:33):

What do we need to do differently to get the expected superior performance of forward diff?

Christopher Rackauckas (Nov 08 2024 at 18:34):

Júlio Hoffimann (Nov 08 2024 at 18:36):

The diff has the formulas. Basically given two functions

x = f_x(\lambda, \phi)

and

y = f_y(\lambda, \phi)

and values

x\star

and

y\star

, we perform newton iteration to find

\lambda\star

and

\phi\star

Júlio Hoffimann (Nov 08 2024 at 18:37):

Christopher Rackauckas (Nov 08 2024 at 18:37):

Yeah so if it's using SimpleNonlinearSolve it should automatically apply the implicit rule

Christopher Rackauckas (Nov 08 2024 at 18:37):

If you did it by hand then you'll need to copy that code / do a similar implicit function push through on the duals

Christopher Rackauckas (Nov 08 2024 at 18:38):

For scalar it's almost equivalent to not differentiating the first n steps of the newton method, re-applying the duals, and then applying it on the n+1th step

Júlio Hoffimann (Nov 08 2024 at 18:38):

I am not sure I am following. As an end-user of ForwardDiff.jl it is not clear what I am doing wrong.

Christopher Rackauckas (Nov 08 2024 at 18:39):

Optimally handling implicit equations is not something automatic differentiation as a tool can do on its own. It requires that the solver library that you're using for the implicit system overloads the AD to avoid differentiation through the method

Júlio Hoffimann (Nov 08 2024 at 18:40):

Christopher Rackauckas (Nov 08 2024 at 18:41):

The derivative of Newton-Rhapson w.r.t. u0 is 0, since the solution is independent of the initial condition (or undefined if it moves to a different solution). So you need to not differentiate the solve and then only differentiate effectively the last step. If the implicit solve is the expensive part of the code, then doing this trick turns O(n) expensive calls differentiating each step into exactly 1. That's hard to beat.

Christopher Rackauckas (Nov 08 2024 at 18:41):

It's not really up to the AD libraries. It's up to the solver libraries, i.e. whomever writes the Newton method (NonlinearSolve) to supply rules for ForwardDiff/Zygote/etc. to do this

Júlio Hoffimann (Nov 08 2024 at 18:42):

You mean that there is a small package that we could take as dependency that already defines newton-rhapson inversion with AD rules?

Christopher Rackauckas (Nov 08 2024 at 18:42):

Since an AD library cannot really know by looking at code that it should have this convergence property, i.e. that the solution is independent of the previous steps, not in code (since in the code, each step of newton depends on the previous step), but in the solution (since it converges to the same value regardless of where you start)

Christopher Rackauckas (Nov 08 2024 at 18:43):

It's a split out of NonlinearSolve that is focused only on very simple Newton Rhapson + the required AD rules.

Júlio Hoffimann (Nov 08 2024 at 18:44):

Júlio Hoffimann (Nov 08 2024 at 18:45):

Christopher Rackauckas (Nov 08 2024 at 18:45):

Well with a scalar nonlinear solve you probably want to be using ITP instead of Newton for stability if you have bounds. In that case, BracketingNonlinearSolve would then be an even smaller dep.

Christopher Rackauckas (Nov 08 2024 at 18:48):

What exactly do you need different in the size? The import time is ~200ms and most of that is the precompilation load of the Newton method itself.

Expanding Man (Nov 08 2024 at 18:48):

It beggars belief that the code in the diff you pasted is much slower in forwarddiff than in zygote, though of course I don't know what functions you are running through it. I think there is something else wrong.

Christopher Rackauckas (Nov 08 2024 at 18:48):

Christopher Rackauckas (Nov 08 2024 at 18:49):

But even then, the next thing you'd want to do is do the implicit rule for either ForwardDiff or Zygote :shrug:

Júlio Hoffimann (Nov 08 2024 at 18:49):

Christopher Rackauckas (Nov 08 2024 at 18:55):

using SimpleNonlinearSolve
f(u, p::Number) = u * u - p
f(u, p::Vector) = u * u - p[1]
u0 = 1.0
p = 1.0
const cprob = NonlinearProblem(f, u0, p)
sol = solve(prob_int, SimpleNewtonRaphson())

function loss(p)
    solve(remake(cprob, p=p),SimpleNewtonRaphson()).u - 4.0
end

using ForwardDiff, BenchmarkTools
@btime ForwardDiff.derivative(loss, p)

16.741 ns (1 allocation: 16 bytes)

Christopher Rackauckas (Nov 08 2024 at 18:56):

Christopher Rackauckas (Nov 08 2024 at 18:58):

using BracketingNonlinearSolve
f(u, p::Number) = u * u - p
u0 = 1.0
p = 1.0
uspan = (1.0, 2.0) # brackets
const cprob_int = IntervalNonlinearProblem(f, uspan, p)
sol = solve(prob_int)

function loss(p)
    solve(remake(cprob_int, p=p)).u - 4.0
end

using ForwardDiff, BenchmarkTools
@btime ForwardDiff.derivative(loss, p);

18.495 ns (1 allocation: 16 bytes)

Christopher Rackauckas (Nov 08 2024 at 18:59):

You can probably specialize on a lot of other properties too though. What kind of system is it? Is it polynomial? Rational polynomial?

Júlio Hoffimann (Nov 08 2024 at 19:00):

I am creating a MWE with the exact code that is slower. Will share here in a few minutes...

Júlio Hoffimann (Nov 08 2024 at 19:12):

using CoordRefSystems
using BenchmarkTools

latlon = LatLon(45, 90)
winkel = convert(WinkelTripel, latlon)

@btime convert($LatLon, $winkel)

1.491 μs (10 allocations: 192 bytes) # main

6.356 μs (144 allocations: 2.88 KiB) # PR

Júlio Hoffimann (Nov 08 2024 at 19:13):

You can see that the ForwardDiff.jl in the PR is 6x slower. The underlying functions fx and fx are here:

Júlio Hoffimann (Nov 08 2024 at 19:13):

Christopher Rackauckas (Nov 08 2024 at 19:14):

Júlio Hoffimann (Nov 08 2024 at 19:14):

Christopher Rackauckas (Nov 08 2024 at 19:20):

Júlio Hoffimann (Nov 08 2024 at 19:21):

Júlio Hoffimann (Nov 08 2024 at 19:22):

I was assuming that this should be instantaneous given the "simplicity" of these trigonometric functions.

Christopher Rackauckas (Nov 08 2024 at 19:22):

Júlio Hoffimann (Nov 08 2024 at 19:22):

Christopher Rackauckas (Nov 08 2024 at 19:22):

Brian Chen (Nov 08 2024 at 19:23):

I think this kind of scalar, branch-free straight line code is the best-case performance scenario for Zygote. So it's not crazy that it'd be faster than ForwardDiff.

Júlio Hoffimann (Nov 08 2024 at 19:23):

Christopher Rackauckas (Nov 08 2024 at 19:24):

Christopher Rackauckas (Nov 08 2024 at 19:25):

but for this kind of case, AD inside the ODE for a scalar output, Zygote should just optimize out all allocs which is usually what would kill it

Christopher Rackauckas (Nov 08 2024 at 19:26):

So Zygote should be fine, and should almost even match Enzyme here without some Reactant tricks.

Christopher Rackauckas (Nov 08 2024 at 19:26):

The only other thing to try really is just avoiding the AD with something like an ITP and seeing how that does.

Júlio Hoffimann (Nov 08 2024 at 19:26):

So the moral of the story is Zygote.jl is still recommended even in this scalar case with N=2 and M=1

Christopher Rackauckas (Nov 08 2024 at 19:27):

in this case, yes, because it can compile away a bunch of stuff so its normal issues don't come up here.

Christopher Rackauckas (Nov 08 2024 at 19:27):

Brian Chen (Nov 08 2024 at 19:27):

Zygote sucks at optimizing code with arrays and falls off a cliff any time there's a branch, but yes this is one of the few niches it's perf-competitive in.

Brian Chen (Nov 08 2024 at 19:29):

Hence the demos Mike and others used to do where they showed it constant-folding all the way to the correct gradient

Júlio Hoffimann (Nov 08 2024 at 19:29):

These heuristics to pick an AD backend are super hard. Every time we dive into it, we unlearn something that was told.

Júlio Hoffimann (Nov 08 2024 at 19:29):

It still bothers me the original issue of this thread where Zygote.jl returns nothing. That is really annoying.

Christopher Rackauckas (Nov 08 2024 at 19:30):

I think it's better to have a hard zero? It's annoying with AD just treats structural zeros as 0.0 because then it's harder to debug.

Christopher Rackauckas (Nov 08 2024 at 19:30):

Brian Chen (Nov 08 2024 at 19:31):

A lot of inputs Zygote accepts are not conducive to having natural Zeros. Structs with arbitrary type constraints, for example

Christopher Rackauckas (Nov 08 2024 at 19:31):

though almost certainly if you get that nothing in your code, it's likely a bug and you should throw an error saying "you likely have a bug in your f"

Brian Chen (Nov 08 2024 at 19:32):

One challenge ChainRules and later Mooncake ran into is that some types can't even be reliably represented by structural zeroes! Self-referential structs being a big culprit

Júlio Hoffimann (Nov 08 2024 at 19:32):

It is not a bug in f. It is common to have formulas that only depend on a subset of the arguments in this context.

Christopher Rackauckas (Nov 08 2024 at 19:32):

in this specific case, if you get nothing, that means f is not a function of the parameter

Christopher Rackauckas (Nov 08 2024 at 19:32):

Christopher Rackauckas (Nov 08 2024 at 19:33):

Brian Chen (Nov 08 2024 at 19:33):

I think in an alternate world where ChainRules matured a little earlier, Zygote could've used ZeroTangent and NoTangent instead of nothing

Júlio Hoffimann (Nov 08 2024 at 19:34):

That is a good point. Maybe refactoring the algorithm with a branch that handles nothing is not that bad. In any case, I wish we had Enzyme.jl behavior here, it always returns 0.0 for zero gradient.

Christopher Rackauckas (Nov 08 2024 at 19:34):

Regardless though, this code should want the nothing or whatever structural zero because then it should just branch down to doing a scalar rootfind and double its speed

Christopher Rackauckas (Nov 08 2024 at 19:34):

Júlio Hoffimann (Nov 08 2024 at 19:35):

It is. It is just that we are trying to keep it native Julia as much as possible, at least for now. Maybe we will consider Enzyme.jl as the only exception.

Christopher Rackauckas (Nov 08 2024 at 19:36):

Júlio Hoffimann (Nov 08 2024 at 19:36):

The full stack is native Julia, which facilitates deployment in exotic platforms.

Christopher Rackauckas (Nov 08 2024 at 19:37):

Chopping out the fy gradient could be like half of the compute, so I'd just exploit the nothing and call it a day.

Christopher Rackauckas (Nov 08 2024 at 19:37):

Júlio Hoffimann (Nov 08 2024 at 19:38):

Christopher Rackauckas (Nov 08 2024 at 19:38):

The other thing you could potentially do is use fastmath approximations to the trig functions in the gradient context.

Christopher Rackauckas (Nov 08 2024 at 19:38):

Júlio Hoffimann (Nov 08 2024 at 19:39):

We are investigating some heterogeneous cluster setups. I understand that external binary dependencies may support a subset of the platforms that Julia supports.

Júlio Hoffimann (Nov 08 2024 at 19:40):

So we avoid external binary deps as much as possible. What is the situation with Enzyme.jl? Does it support all platforms that Julia does because it is LLVM-based?

Christopher Rackauckas (Nov 08 2024 at 19:40):

Julia doesn't even support all LLVM supported platforms because of runtime things

Júlio Hoffimann (Nov 08 2024 at 19:41):

Nothing specific at the moment. We are just trying to save ourselves from build issues that we can't address easily.

Júlio Hoffimann (Nov 08 2024 at 19:42):

So adding Enzyme.jl as a dependency shouldn't reduce the list of supported platforms, right?

Christopher Rackauckas (Nov 08 2024 at 19:42):

I mean, it might be easier to get Julia to kick something out for like a TI C600 without Enzyme, but the chances that will ever be in a cluster is zero.

Júlio Hoffimann (Nov 08 2024 at 19:43):

In this case, I see it as precaution. If we can stick to a native Julia app, why not? :smile:

Júlio Hoffimann (Nov 08 2024 at 19:44):

If Enzyme.jl is indeed the best thing to adopt, and the benefits outweigh the downsides, we will go for it.

Christopher Rackauckas (Nov 08 2024 at 19:45):

I mean, I see eVTOLs and satellites deploying to ARMv8 these days. I would be surprised if your case is actually all that exotic unless it's for a microsat

Stream: helpdesk (published)

Topic: Zygote.jl gradient returning `nothing`

Júlio Hoffimann (Nov 08 2024 at 12:11):

Júlio Hoffimann (Nov 08 2024 at 12:17):

Mason Protter (Nov 08 2024 at 12:32):

Mason Protter (Nov 08 2024 at 12:33):

Júlio Hoffimann (Nov 08 2024 at 12:35):

Mason Protter (Nov 08 2024 at 12:54):

Mason Protter (Nov 08 2024 at 12:56):

Júlio Hoffimann (Nov 08 2024 at 13:01):

Júlio Hoffimann (Nov 08 2024 at 13:02):

Michael Abbott (Nov 08 2024 at 15:25):

Michael Abbott (Nov 08 2024 at 15:26):

Michael Abbott (Nov 08 2024 at 15:28):

Mason Protter (Nov 08 2024 at 15:29):

Mason Protter (Nov 08 2024 at 15:29):

Mason Protter (Nov 08 2024 at 15:40):

Mason Protter (Nov 08 2024 at 15:41):

Júlio Hoffimann (Nov 08 2024 at 15:45):

Júlio Hoffimann (Nov 08 2024 at 15:46):

Júlio Hoffimann (Nov 08 2024 at 15:46):

Júlio Hoffimann (Nov 08 2024 at 15:47):

Nils (Nov 08 2024 at 16:13):

Michael Abbott (Nov 08 2024 at 16:19):

Mason Protter (Nov 08 2024 at 16:21):

Michael Abbott (Nov 08 2024 at 16:22):

Júlio Hoffimann (Nov 08 2024 at 16:45):

Mason Protter (Nov 08 2024 at 16:47):

Júlio Hoffimann (Nov 08 2024 at 16:47):

Mason Protter (Nov 08 2024 at 16:49):

Júlio Hoffimann (Nov 08 2024 at 16:52):

Mason Protter (Nov 08 2024 at 16:57):

Júlio Hoffimann (Nov 08 2024 at 16:59):

Mason Protter (Nov 08 2024 at 17:00):

Mason Protter (Nov 08 2024 at 17:00):

Mason Protter (Nov 08 2024 at 17:01):

Júlio Hoffimann (Nov 08 2024 at 17:01):

Mason Protter (Nov 08 2024 at 17:02):

Mason Protter (Nov 08 2024 at 17:03):

Expanding Man (Nov 08 2024 at 17:06):

Júlio Hoffimann (Nov 08 2024 at 17:13):

Expanding Man (Nov 08 2024 at 17:19):

Mason Protter (Nov 08 2024 at 17:23):

Júlio Hoffimann (Nov 08 2024 at 17:24):

Júlio Hoffimann (Nov 08 2024 at 18:15):

Júlio Hoffimann (Nov 08 2024 at 18:15):

Christopher Rackauckas (Nov 08 2024 at 18:18):

Christopher Rackauckas (Nov 08 2024 at 18:22):

Júlio Hoffimann (Nov 08 2024 at 18:33):

Júlio Hoffimann (Nov 08 2024 at 18:33):

Christopher Rackauckas (Nov 08 2024 at 18:34):

Júlio Hoffimann (Nov 08 2024 at 18:36):

Júlio Hoffimann (Nov 08 2024 at 18:37):

Christopher Rackauckas (Nov 08 2024 at 18:37):

Christopher Rackauckas (Nov 08 2024 at 18:37):

Christopher Rackauckas (Nov 08 2024 at 18:38):

Júlio Hoffimann (Nov 08 2024 at 18:38):

Christopher Rackauckas (Nov 08 2024 at 18:39):

Júlio Hoffimann (Nov 08 2024 at 18:40):

Christopher Rackauckas (Nov 08 2024 at 18:41):

Christopher Rackauckas (Nov 08 2024 at 18:41):

Júlio Hoffimann (Nov 08 2024 at 18:42):

Christopher Rackauckas (Nov 08 2024 at 18:42):

Christopher Rackauckas (Nov 08 2024 at 18:43):

Christopher Rackauckas (Nov 08 2024 at 18:43):

Júlio Hoffimann (Nov 08 2024 at 18:44):

Júlio Hoffimann (Nov 08 2024 at 18:45):

Christopher Rackauckas (Nov 08 2024 at 18:45):

Christopher Rackauckas (Nov 08 2024 at 18:48):

Expanding Man (Nov 08 2024 at 18:48):

Christopher Rackauckas (Nov 08 2024 at 18:48):

Christopher Rackauckas (Nov 08 2024 at 18:49):

Júlio Hoffimann (Nov 08 2024 at 18:49):

Christopher Rackauckas (Nov 08 2024 at 18:55):

Christopher Rackauckas (Nov 08 2024 at 18:56):

Christopher Rackauckas (Nov 08 2024 at 18:58):

Christopher Rackauckas (Nov 08 2024 at 18:58):

Christopher Rackauckas (Nov 08 2024 at 18:59):

Júlio Hoffimann (Nov 08 2024 at 19:00):

Júlio Hoffimann (Nov 08 2024 at 19:12):