why is Optim.jl so slow? · helpdesk (published)

I have a research code which does multivariate optimization which runs for ~30s for each data point in Python. The code has been too slow to be analyzed, so I ported it to Julia, only to find that it takes ~160s (I used @time optimize(...), precompilation time was probably included?) in Julia using Optim.jl!

So I tried doing a benchmark comparing Optim.jl with scipy.optimize using the simplest function, and found that Optim.jl is much slower. Any idea on how to speed up Optim.jl? My Julia version is 1.5.3, my Python version is 3.8.6. I'm on NixOS and have had trouble upgrading to the latest version of the OS, so can't try 1.6rc.

using Optim

f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
@time optimize(f, [0.0, 0.0], LBFGS())

$ time julia bench.jl
  2.475465 seconds (4.23 M allocations: 217.432 MiB, 4.38% gc time)
julia bench.jl  10.81s user 0.35s system 99% cpu 11.176 total

import time
from scipy import optimize

f = lambda x: (1.0 - x[0])**2 + 100.0 * (x[1] - x[0]**2)**2
tic = time.time()
optimize.minimize(f, [0.0, 0.0], method='L-BFGS-B')
print('elapsed', time.time() - tic)

$ time python bench.py
elapsed 0.0051517486572265625
python bench.py  0.86s user 0.88s system 205% cpu 0.843 total

Andrés Riedemann (Feb 21 2021 at 01:26):

optimize(f, [0.0,0.0],LBFGS(),autodiff=:forward)

Rein Zustand (Feb 21 2021 at 01:31):

   3.208170 seconds (6.03 M allocations: 308.185 MiB, 4.39% gc time)
julia bench.jl  11.42s user 0.33s system 99% cpu 11.767 total

Andrés Riedemann (Feb 21 2021 at 01:35):

Rein Zustand (Feb 21 2021 at 01:36):

Andrés Riedemann (Feb 21 2021 at 01:36):

Andrés Riedemann (Feb 21 2021 at 01:37):

With autodiff=forward, you are creating the exact derivative via AD, so the evaluation of the gradient is faster

Rein Zustand (Feb 21 2021 at 01:41):

Optim.ConjugateGradient
ERROR: LoadError: MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{var"#_calc_U#16"{Array{Any,1},Array{Any,1},Int64,Int64,Array{Any,1},Array{Any,1},Array{Float64,1},Int64,Array{Float64,1}},Float64},Float64,10})
Closest candidates are:
  Float64(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
  Float64(::T) where T<:Number at boot.jl:716
  Float64(::Float32) at float.jl:255

        result = Optim.optimize(fn, lower, upper, xs0, Optim.Fminbox(inner_optimizer()), autodiff=:forward)

Andrés Riedemann (Feb 21 2021 at 01:42):

(What is the size of your input on the real problem? If it's too big, maybe using a reverse mode AD could speed up things significantly )

Rein Zustand (Feb 21 2021 at 01:43):

Rein Zustand (Feb 21 2021 at 01:44):

Andrés Riedemann (Feb 21 2021 at 01:44):

On the problem found, that happens because you have somewhere in your function the expression a=Float64(b)

Rein Zustand (Feb 21 2021 at 01:45):

Andrés Riedemann (Feb 21 2021 at 01:45):

Optim doesn't have automatic reverse mode included, but at that size I think Forward is fine

Andrés Riedemann (Feb 21 2021 at 01:46):

Rein Zustand (Feb 21 2021 at 01:47):

The only place where the text Float64 exists at all is in a mutable struct definition, e.g.

mutable struct Firm <: AbstractFirm
    Eg::Float64
    E_greens::Array{Float64,1}
    Eb::Float64
    E_browns::Array{Float64,1}

Rein Zustand (Feb 21 2021 at 01:49):

(What a shame that I have to go back to Python again after convincing my collaborator to have 2-3 days to reimplementing in Julia)

Andrés Riedemann (Feb 21 2021 at 01:50):

Do you mind putting the optimization problem here? , I cannot help much more without a working examole

Rein Zustand (Feb 21 2021 at 01:52):

Thanks a lot for that! I'm not allowed to share the code at this point until the paper has been published, but I can send you the code via PM.

Andrés Riedemann (Feb 21 2021 at 01:53):

Andrés Riedemann (Feb 21 2021 at 01:55):

If you are running an optimization problem on each point, and your points are not related, you could try multithreading the calculation

Rein Zustand (Feb 21 2021 at 01:57):

I will post relevant details in this topic so others can benefit once we have pinpointed the problem. Unfortunately to optimize xs[i], it depends on all the values from xs[1] to xs[i-1], because it is a time evolution.

Rein Zustand (Feb 21 2021 at 02:03):

Though my concern still stands. Optim.jl is much slower than scipy.optimize even on the simplest example.

Júlio Hoffimann (Feb 21 2021 at 02:04):

Make sure that you are benchmarking things properly and not counting precompilation time. After that, try to share some minimum working example in both languages so that people can try to debug and help further.

Rein Zustand (Feb 21 2021 at 02:04):

Júlio Hoffimann (Feb 21 2021 at 02:04):

Júlio Hoffimann (Feb 21 2021 at 02:05):

Also, consider Julia v1.6-rc1, it is really flying compared to the current Julia v1.5

Júlio Hoffimann (Feb 21 2021 at 02:06):

To time the runtime of the optimize call from Optim.jl you can start a Julia session and include( "MWE.jl") to introduce the definitions. After that you can using BenchmarkTools and @btime optimize(...)

Rein Zustand (Feb 21 2021 at 02:07):

Rein Zustand (Feb 21 2021 at 02:24):

  79.084 μs (699 allocations: 32.27 KiB)
julia bench.jl  81.97s user 3.43s system 100% cpu 1:25.28 total

Rein Zustand (Feb 21 2021 at 03:01):

Júlio Hoffimann (Feb 21 2021 at 04:16):

You can probably decrease it a bit more with a few code changes. I don't understand how you get a microsecond in Julia and a millisecond in Python and at the end your Python version is twice as fast? You said 64s vs 30s at the end right ?

Rein Zustand (Feb 21 2021 at 04:38):

The millisecond vs microsecond is for the example code, which is already fully explained. The 64s vs 30s is the real code one.

Christopher Rackauckas (Feb 21 2021 at 11:22):

Christopher Rackauckas (Feb 21 2021 at 11:23):

Rein Zustand (Feb 21 2021 at 11:34):

Huh, TIL. I specified the model params as global const, and my functions read from these consts directly instead of receiving them from function arguments.

Rein Zustand (Feb 21 2021 at 11:35):

I assume Andrés Riedemann 's suggestion to use autodiff=:forward is to enable AD?

Christopher Rackauckas (Feb 21 2021 at 11:55):

Christopher Rackauckas (Feb 21 2021 at 11:56):

Rein Zustand (Feb 21 2021 at 12:22):

Optim.ConjugateGradient
elapsed: 74.16639494895935
 76.332368 seconds (885.22 M allocations: 25.710 GiB, 6.79% gc time)

Christopher Rackauckas (Feb 21 2021 at 12:23):

Rein Zustand (Feb 21 2021 at 12:26):

Rein Zustand (Feb 21 2021 at 12:28):

 1.026235 seconds (1.75 M allocations: 91.610 MiB, 1.96% gc time)

Christopher Rackauckas (Feb 21 2021 at 12:28):

Rein Zustand (Feb 21 2021 at 12:31):

ERROR: LoadError: LoadError: UndefVarError: @code_warntype not defined

Christopher Rackauckas (Feb 21 2021 at 12:32):

it seems like you need a lot more help than that. You should learn a bit about the language first

Christopher Rackauckas (Feb 21 2021 at 12:32):

Christopher Rackauckas (Feb 21 2021 at 12:33):

Kwaku Oskin (Feb 21 2021 at 12:34):

As a side note, it would be interesting to have a macro, something like @error_if_allocate which you can add to your mission critical function and it wouldn't run, until you fix it.

Christopher Rackauckas (Feb 21 2021 at 12:34):

Christopher Rackauckas (Feb 21 2021 at 12:35):

Rein Zustand (Feb 21 2021 at 12:49):

Rein Zustand (Feb 22 2021 at 12:20):

Quick update. Andrés Riedemann uncovered the reason why method=:forward autodiff causes error. It is because I use a mutable struct inside my objective functions, where:

Florian Große (Feb 24 2021 at 23:48):

Rein Zustand (Feb 25 2021 at 02:15):

@Florian Große not yet. My concern is that without optimizations such as autodiff or 0-allocation, the Julia version shouldn't be slower than the Python version, since the Python version is also doing wasteful allocations and not using autodiff. I tried to profile using StatProfilerHTML.jl, but the line-by-line time spent report is not complete. Not all the relevant lines are time. I wish there is something like Python's https://github.com/pyutils/line_profiler.

Florian Große (Feb 25 2021 at 04:30):

Agreed. Once you have the paper published and can share the code, it would probably good to open this discussion on Discourse again. If it requires a few tricks to make it run fast, it's probably good to have it there as an example. Either that or you find room for improvement in a julia package, which is also fine.

Nils (Feb 25 2021 at 09:10):

And maybe also #math-optimization, I don't think @Patrick Kofod Mogensen is more active over there I believe

Patrick Kofod Mogensen (Feb 25 2021 at 21:20):

Yeah, hard to say much with the given information. First, I saw you used conjugate gradient descent in the only timing of the real problem I saw. That is really only going to be good (relatively speaking) on very large problems. However, 99% of the times I get these questions, the problem is in the objective function evaluation as @Christopher Rackauckas also tried to guide the conversation towards :) You did not provide a side-by-side timing of your runtime for your objective from python and from julia. I'd like to see that before we can conclude this is related to Optim.jl at all. Next, I'd ask you to print the progress, number of objective function evaluations, number of gradient function evaluations etc. I'm sure we'd be able to match the speed you see in Python, but if you're relatively new to Julia, 2-3 days to get optimal code is not a lot. Hope you get your paper through review, and I'm of course sad that Optim.jl lost a citation over this ;)

Christopher Rackauckas (Feb 25 2021 at 22:07):

Yeah I could've been more direct haha. I was trying to slowly guide him to finding out that he likely needed to optimize the objective function that he wrote. But yes, in more direct terms, unless you show your model is non-allocating and type-stable, there is zero evidence to suggest Optim.jl has an issue and so right now the onus is on the modeler.

Christopher Rackauckas (Feb 25 2021 at 22:07):

The speed of the optimization package itself has relatively little effect on the speed of optimization, other than in the number of steps it takes to get to the optimum. The speed is largely based on the way the user writes the objective (except in the limit of tiny f). A good optimization package takes less steps, which there hasn't been any discussion of "Optim.jl takes 2-3x as many steps", which would be what points to Optim. Pure timings don't mean that.

Rein Zustand (Feb 26 2021 at 01:22):

Update: the Julia version is now 2x faster! One major problem is that I defined a named function inside my objective function, to deduplicate some steps, and because the function uses variables local to the objective function (without being passed as arguments). Note: I also do this function-definition-inside-obj-fn in the Python version without much penalty?? Once I move out the function and pass in the local vars as arguments, there are a lot fewer allocations. The Python version takes 5 min 25 s. The Julia version now takes 2 min 12 s (sorry can't publish the code yet). I'm sure there is still room for improvement, but this is already a strong case to convince my collaborators to switch to Julia.

Rein Zustand (Feb 26 2021 at 01:25):

Note: moving out this function to outside my objective function resulted in 4.5x speedup relative to the previous iteration.

Rein Zustand (Feb 26 2021 at 01:42):

Optim.ConjugateGradient
elapsed: 74.16639494895935
 76.332368 seconds (885.22 M allocations: 25.710 GiB, 6.79% gc time)

Optim.ConjugateGradient
elapsed: 15.73147702217102
 17.780150 seconds (97.40 M allocations: 9.558 GiB, 6.34% gc time)

Optim.ConjugateGradient
elapsed: 12.357237100601196
  12.357 s (92734943 allocations: 9.33 GiB)

Patrick Kofod Mogensen (Feb 26 2021 at 07:42):

That sounds great. I of course have to ask: have you checked that the objective values/solution(minimizer) are comparable? There's always a chance that Optim stopped early for some reason. I still have a feeling you could get more performance out of your objective function, but without knowing your exact code it's hard to say. There are indeed a few pitfalls in Julia, and you need to know a little bit (Chris' links would be a great place to start) to get the most out of Julia. I think the function definition that closes over some variables might be a known issue that can sometimes surprise even me and others who are seasoned Julia users (I think you might be hitting this issue specifically https://github.com/JuliaLang/julia/issues/15276 )

Henrique Ferrolho (Feb 26 2021 at 09:03):

@Rein Zustand, do you mean that your previous approach was using a closure? I.e., something like below?

function f(x)
    a = rand(100)
    b = rand(100)

    function aux()
        a .* b .+ x
    end

    s = aux()
    t = aux()

    s .* t
end

function aux(a, b, c)
    a .* b .+ c
end

function g(x)
    a = rand(100)
    b = rand(100)

    s = aux(a, b, x)
    t = aux(a, b, x)

    s .* t
end

Rein Zustand (Feb 26 2021 at 09:09):

Henrique Ferrolho (Feb 26 2021 at 09:11):

That is odd. I don't think it should make a difference in time or allocations. These are the results I get on my machine when I benchmark the snippets above:

julia> using BenchmarkTools

julia> @btime f(x);
  456.736 ns (5 allocations: 4.38 KiB)

julia> @btime g(x);
  452.822 ns (5 allocations: 4.38 KiB)

Rein Zustand (Feb 26 2021 at 09:16):

Henrique Ferrolho (Feb 26 2021 at 09:17):

Perhaps something else changed when you switched from one approach to the other? What I am trying to say is that the performance improvement you observe should not be due to the use of a closure. I'd be interested to know what the key change was, then.

Rein Zustand (Feb 26 2021 at 09:27):

! Gotcha. I just checked my commit. I did turn a model parameter defined in a global setting into a const, together with the change that moved out the function to be outside of the objective commit, all within a single commit. My bad.

I tried reverting by putting the function back inside my objective function, while keeping the variable a const, and I got the same allocation. I am very sorry for the false alarm.

Henrique Ferrolho (Feb 26 2021 at 09:37):

No need to apologise. So, the improvement was actually thanks to declaring a global variable as const. Did I get that right?

Rein Zustand (Feb 26 2021 at 09:40):

Yes, I have isolated the cause to be indeed setting the const. Reverting it back to without const slows down the btime by ~4x.

Henrique Ferrolho (Feb 26 2021 at 09:44):

Great! In general, it is best to avoid global variables. But if you do want to have a global variable, it is critical to make it const, otherwise the compiler assumes that its value and type can change at any point, and thus it is not able to specialise on one given type. This is mentioned in Performance Tips > Avoid global variables. (The entire Performance Tips is gold. Do read it, if you haven't. :smiley:)

Rein Zustand (Feb 26 2021 at 10:07):

For that Performance tips article. It recommends @time instead of @btime. I suppose because BenchmarkTools.jl is an external tool, and so the official manual can't recommend it?

Kwaku Oskin (Feb 26 2021 at 10:11):

Rein Zustand (Mar 02 2021 at 05:28):

I hit a new roadblock with Optim's conjugate gradient. The parameters for my objective function are bounded by 0 <= x <= 1. While Optim.ConjugateGradient is fast, sometimes I observe a discontinuous spike in my result. I observed similar spikes too in scipy, but at least in scipy, the spikes consistently disappear if I use L-BFGS-B. There is a Optim.LBFGS in Optim, but it is 7x slower than Optim.ConjugateGradient (so, in effect, Optim is 3.5x slower than scipy).

By using Optim.ConjugateGradient I'm making my comparison closer to apple-to-apple with the scipy version. I have pre-allocated my mutable structs to reduce allocations, and this still doesn't improve the time spent significantly.

Unfortunately, I'm short of time to investigate the cause further, and am forced to use the scipy version to continue my research. :oh_no:

Rein Zustand (Mar 02 2021 at 05:33):

s/By using Optim.ConjugateGradient/By using Optim.LBFGS/. (I realized that zulip-archive doesn't handle message edits). test

Mason Protter (Mar 02 2021 at 05:57):

Rein Zustand (Mar 02 2021 at 09:54):

My allocations are not exactly zero, but they are about 33k now (used to be 6M). But then, the Python version is doing wasteful allocations. I'd expect the Julia version to be at worst comparable to the Python version (because I'm doing equivalent wasteful allocations in Python), and then progressively gets better when I reduce allocations. Is it wrong to think this way?

Kwaku Oskin (Mar 02 2021 at 10:03):

If for example you are slicing your array, you are making it's copy (and python uses analogue of Julia views).
So, if you have something like

v1 = rand(1_000_000_000)
v2 = v1[1:500_000_000]

you are doing maybe 2 or 3 allocations, but you are also copying half of your computer memory to a new place and as a result computations are slow.

Also, if there are some issues with type stability, you also waste precious time on dynamical dispatch.

v = Any[1, 2, 3]
c = v[1] + 1

compiler spend 0.01 ns to make actual addition and 10ns to figure out the type of the v[1] variable (and also it has to pack c back to Any). So if this addition is a part of tight loop you can get huge slowdown. Since python usually vectorize such operation, it avoids this problem. I think in this case you can compare vectorization to function barriers, i.e. vector has rather bad type as an input, but inside the function it knows that all elements have some good type and do not waste time trying to figure out type of each element individually.

Rein Zustand (Mar 02 2021 at 10:20):

Thank you for the elaboration. I have checked the allocations using --track-allocation=user and found no such copying. For the type stability issue, my @code_warntype output is mostly green and yellow (for the loop variable). I have to note though that in my case, I do regular loops in the Python version because the operation can't be vectorized (it's a time evolution), so surely Python must be doing regular object operations.

Kwaku Oskin (Mar 02 2021 at 10:29):

Oh, those were meant only as examples. There can be other reasons why reasonably small number of allocations can have a large impact on the performance.

Rein Zustand (Mar 02 2021 at 10:52):

Rein Zustand (Mar 02 2021 at 11:36):

I'm down to only 1 allocation, which is due to ~~@assert length(xs) == length(Ts)~~ (hold up, it's not it, still investigating). Still about as slow. By process of elimination, the remaining uncertainty is in the LBFGS implementation of Optim vs scipy.optimize.

Stream: helpdesk (published)

Topic: why is Optim.jl so slow?

Rein Zustand (Feb 21 2021 at 01:05):

Andrés Riedemann (Feb 21 2021 at 01:26):

Rein Zustand (Feb 21 2021 at 01:31):

Andrés Riedemann (Feb 21 2021 at 01:35):

Rein Zustand (Feb 21 2021 at 01:36):

Andrés Riedemann (Feb 21 2021 at 01:36):

Andrés Riedemann (Feb 21 2021 at 01:37):

Rein Zustand (Feb 21 2021 at 01:41):

Andrés Riedemann (Feb 21 2021 at 01:42):

Rein Zustand (Feb 21 2021 at 01:43):

Rein Zustand (Feb 21 2021 at 01:44):

Andrés Riedemann (Feb 21 2021 at 01:44):

Rein Zustand (Feb 21 2021 at 01:45):

Andrés Riedemann (Feb 21 2021 at 01:45):

Andrés Riedemann (Feb 21 2021 at 01:46):

Andrés Riedemann (Feb 21 2021 at 01:46):

Rein Zustand (Feb 21 2021 at 01:47):

Rein Zustand (Feb 21 2021 at 01:49):

Andrés Riedemann (Feb 21 2021 at 01:50):

Rein Zustand (Feb 21 2021 at 01:52):

Andrés Riedemann (Feb 21 2021 at 01:53):

Andrés Riedemann (Feb 21 2021 at 01:55):

Rein Zustand (Feb 21 2021 at 01:57):

Rein Zustand (Feb 21 2021 at 02:03):

Júlio Hoffimann (Feb 21 2021 at 02:04):

Rein Zustand (Feb 21 2021 at 02:04):

Júlio Hoffimann (Feb 21 2021 at 02:04):

Júlio Hoffimann (Feb 21 2021 at 02:05):

Júlio Hoffimann (Feb 21 2021 at 02:06):

Rein Zustand (Feb 21 2021 at 02:07):

Rein Zustand (Feb 21 2021 at 02:24):

Rein Zustand (Feb 21 2021 at 03:01):

Júlio Hoffimann (Feb 21 2021 at 04:16):

Rein Zustand (Feb 21 2021 at 04:38):

Christopher Rackauckas (Feb 21 2021 at 11:22):

Christopher Rackauckas (Feb 21 2021 at 11:23):

Rein Zustand (Feb 21 2021 at 11:34):

Rein Zustand (Feb 21 2021 at 11:35):

Christopher Rackauckas (Feb 21 2021 at 11:55):

Christopher Rackauckas (Feb 21 2021 at 11:56):

Rein Zustand (Feb 21 2021 at 12:22):

Christopher Rackauckas (Feb 21 2021 at 12:23):

Christopher Rackauckas (Feb 21 2021 at 12:23):

Rein Zustand (Feb 21 2021 at 12:26):

Rein Zustand (Feb 21 2021 at 12:28):

Christopher Rackauckas (Feb 21 2021 at 12:28):

Christopher Rackauckas (Feb 21 2021 at 12:28):

Rein Zustand (Feb 21 2021 at 12:31):

Christopher Rackauckas (Feb 21 2021 at 12:32):

Christopher Rackauckas (Feb 21 2021 at 12:32):

Christopher Rackauckas (Feb 21 2021 at 12:32):

Christopher Rackauckas (Feb 21 2021 at 12:32):

Christopher Rackauckas (Feb 21 2021 at 12:33):

Christopher Rackauckas (Feb 21 2021 at 12:33):

Christopher Rackauckas (Feb 21 2021 at 12:33):

Christopher Rackauckas (Feb 21 2021 at 12:33):

Christopher Rackauckas (Feb 21 2021 at 12:33):

Kwaku Oskin (Feb 21 2021 at 12:34):

Christopher Rackauckas (Feb 21 2021 at 12:34):

Christopher Rackauckas (Feb 21 2021 at 12:35):

Rein Zustand (Feb 21 2021 at 12:49):

Rein Zustand (Feb 22 2021 at 12:20):

Florian Große (Feb 24 2021 at 23:48):

Rein Zustand (Feb 25 2021 at 02:15):

Florian Große (Feb 25 2021 at 04:30):

Nils (Feb 25 2021 at 09:10):

Patrick Kofod Mogensen (Feb 25 2021 at 21:20):

Christopher Rackauckas (Feb 25 2021 at 22:07):

Christopher Rackauckas (Feb 25 2021 at 22:07):

Rein Zustand (Feb 26 2021 at 01:22):

Rein Zustand (Feb 26 2021 at 01:25):

Rein Zustand (Feb 26 2021 at 01:42):

Patrick Kofod Mogensen (Feb 26 2021 at 07:42):

Henrique Ferrolho (Feb 26 2021 at 09:03):

Rein Zustand (Feb 26 2021 at 09:09):

Henrique Ferrolho (Feb 26 2021 at 09:11):

Rein Zustand (Feb 26 2021 at 09:16):

Henrique Ferrolho (Feb 26 2021 at 09:17):