Stream: helpdesk (published)

Topic: why is Optim.jl so slow?


view this post on Zulip Rein Zustand (Feb 21 2021 at 01:05):

I have a research code which does multivariate optimization which runs for ~30s for each data point in Python. The code has been too slow to be analyzed, so I ported it to Julia, only to find that it takes ~160s (I used @time optimize(...), precompilation time was probably included?) in Julia using Optim.jl!

So I tried doing a benchmark comparing Optim.jl with scipy.optimize using the simplest function, and found that Optim.jl is much slower. Any idea on how to speed up Optim.jl? My Julia version is 1.5.3, my Python version is 3.8.6. I'm on NixOS and have had trouble upgrading to the latest version of the OS, so can't try 1.6rc.

My Julia code:

using Optim

f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
@time optimize(f, [0.0, 0.0], LBFGS())
$ time julia bench.jl
  2.475465 seconds (4.23 M allocations: 217.432 MiB, 4.38% gc time)
julia bench.jl  10.81s user 0.35s system 99% cpu 11.176 total

My Python code:

import time
from scipy import optimize

f = lambda x: (1.0 - x[0])**2 + 100.0 * (x[1] - x[0]**2)**2
tic = time.time()
optimize.minimize(f, [0.0, 0.0], method='L-BFGS-B')
print('elapsed', time.time() - tic)
$ time python bench.py
elapsed 0.0051517486572265625
python bench.py  0.86s user 0.88s system 205% cpu 0.843 total

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:26):

Mmm, I don't remember the optim defaults on derivatives, but, can you try

optimize(f, [0.0,0.0],LBFGS(),autodiff=:forward)

?

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:31):

With autodiff=:forward I got

   3.208170 seconds (6.03 M allocations: 308.185 MiB, 4.39% gc time)
julia bench.jl  11.42s user 0.33s system 99% cpu 11.767 total

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:35):

And what about the real problem?

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:36):

OK, I will try on my real problem.

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:36):

That benchmark is measuring mainly the startup time of optim

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:37):

With autodiff=forward, you are creating the exact derivative via AD, so the evaluation of the gradient is faster

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:41):

I got this error:

Optim.ConjugateGradient
ERROR: LoadError: MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{var"#_calc_U#16"{Array{Any,1},Array{Any,1},Int64,Int64,Array{Any,1},Array{Any,1},Array{Float64,1},Int64,Array{Float64,1}},Float64},Float64,10})
Closest candidates are:
  Float64(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
  Float64(::T) where T<:Number at boot.jl:716
  Float64(::Float32) at float.jl:255

This is the line that calls the function:

        result = Optim.optimize(fn, lower, upper, xs0, Optim.Fminbox(inner_optimizer()), autodiff=:forward)

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:42):

(What is the size of your input on the real problem? If it's too big, maybe using a reverse mode AD could speed up things significantly )

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:43):

The size of xs0 is 30.

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:44):

There is no option for :reverse, I only saw :finite and :forward.

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:44):

On the problem found, that happens because you have somewhere in your function the expression a=Float64(b)

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:45):

I grepped for the pattern Float64(, didn't find any.

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:45):

Optim doesn't have automatic reverse mode included, but at that size I think Forward is fine

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:46):

or maybe in the function signature?

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:46):

f(x::Float64)

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:47):

The only place where the text Float64 exists at all is in a mutable struct definition, e.g.

mutable struct Firm <: AbstractFirm
    Eg::Float64
    E_greens::Array{Float64,1}
    Eb::Float64
    E_browns::Array{Float64,1}

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:49):

(What a shame that I have to go back to Python again after convincing my collaborator to have 2-3 days to reimplementing in Julia)

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:50):

Do you mind putting the optimization problem here? , I cannot help much more without a working examole

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:52):

Thanks a lot for that! I'm not allowed to share the code at this point until the paper has been published, but I can send you the code via PM.

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:53):

Ok

view this post on Zulip Andrés Riedemann (Feb 21 2021 at 01:55):

If you are running an optimization problem on each point, and your points are not related, you could try multithreading the calculation

view this post on Zulip Rein Zustand (Feb 21 2021 at 01:57):

I will post relevant details in this topic so others can benefit once we have pinpointed the problem. Unfortunately to optimize xs[i], it depends on all the values from xs[1] to xs[i-1], because it is a time evolution.

view this post on Zulip Rein Zustand (Feb 21 2021 at 02:03):

Though my concern still stands. Optim.jl is much slower than scipy.optimize even on the simplest example.

view this post on Zulip Júlio Hoffimann (Feb 21 2021 at 02:04):

Make sure that you are benchmarking things properly and not counting precompilation time. After that, try to share some minimum working example in both languages so that people can try to debug and help further.

view this post on Zulip Rein Zustand (Feb 21 2021 at 02:04):

OK, nvm it's a precompilation problem.

view this post on Zulip Júlio Hoffimann (Feb 21 2021 at 02:04):

Yes, as usual :smile:

view this post on Zulip Júlio Hoffimann (Feb 21 2021 at 02:05):

Also, consider Julia v1.6-rc1, it is really flying compared to the current Julia v1.5

view this post on Zulip Júlio Hoffimann (Feb 21 2021 at 02:06):

To time the runtime of the optimize call from Optim.jl you can start a Julia session and include( "MWE.jl") to introduce the definitions. After that you can using BenchmarkTools and @btime optimize(...)

view this post on Zulip Rein Zustand (Feb 21 2021 at 02:07):

OK, I will use @btime to avoid precompilation time.

view this post on Zulip Rein Zustand (Feb 21 2021 at 02:24):

With btime.

  79.084 μs (699 allocations: 32.27 KiB)
julia bench.jl  81.97s user 3.43s system 100% cpu 1:25.28 total

view this post on Zulip Rein Zustand (Feb 21 2021 at 03:01):

@Andrés Riedemann has provided feedback in PM (appreciate it a lot):

Note that the Scipy version is 30s.

view this post on Zulip Júlio Hoffimann (Feb 21 2021 at 04:16):

You can probably decrease it a bit more with a few code changes. I don't understand how you get a microsecond in Julia and a millisecond in Python and at the end your Python version is twice as fast? You said 64s vs 30s at the end right ?

view this post on Zulip Rein Zustand (Feb 21 2021 at 04:38):

The millisecond vs microsecond is for the example code, which is already fully explained. The 64s vs 30s is the real code one.

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 11:22):

Did you make the real code fully non-allocating?

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 11:23):

and AD

view this post on Zulip Rein Zustand (Feb 21 2021 at 11:34):

Christopher Rackauckas said:

Did you make the real code fully non-allocating?

Huh, TIL. I specified the model params as global const, and my functions read from these consts directly instead of receiving them from function arguments.

view this post on Zulip Rein Zustand (Feb 21 2021 at 11:35):

Christopher Rackauckas said:

and AD

I assume Andrés Riedemann 's suggestion to use autodiff=:forward is to enable AD?

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 11:55):

I assume Andrés Riedemann 's suggestion to use autodiff=:forward is to enable AD?

yes

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 11:56):

so if you @time your objective function you get 0 allocs?

view this post on Zulip Rein Zustand (Feb 21 2021 at 12:22):

Nope, currently not yet lol

Optim.ConjugateGradient
elapsed: 74.16639494895935
 76.332368 seconds (885.22 M allocations: 25.710 GiB, 6.79% gc time)

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:23):

no

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:23):

that's not your objective function

view this post on Zulip Rein Zustand (Feb 21 2021 at 12:26):

Oh, ok let me try again.

view this post on Zulip Rein Zustand (Feb 21 2021 at 12:28):

 1.026235 seconds (1.75 M allocations: 91.610 MiB, 1.96% gc time)

This is if I simply run my objective function on the initial condition.

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:28):

yeah so that's generating a ton. That's not good.

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:28):

What if you @code_warntype it?

view this post on Zulip Rein Zustand (Feb 21 2021 at 12:31):

ERROR: LoadError: LoadError: UndefVarError: @code_warntype not defined

I assume that @code_warntype can only be used in the interpreter?

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:32):

it seems like you need a lot more help than that. You should learn a bit about the language first

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:32):

here's a tutorial that walks through a few things

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:32):

https://mitmath.github.io/18337/lecture2/optimizing

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:32):

or as a lecture

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:33):

https://www.youtube.com/watch?v=M2i7sSRcSIw&feature=youtu.be

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:33):

https://www.youtube.com/watch?v=10_Ukm9wr9g&feature=youtu.be

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:33):

I assume that @code_warntype can only be used in the interpreter?

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:33):

no

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:33):

using InteractiveUtils or just do Main.@code_warntype as a quick check

view this post on Zulip Kwaku Oskin (Feb 21 2021 at 12:34):

As a side note, it would be interesting to have a macro, something like @error_if_allocate which you can add to your mission critical function and it wouldn't run, until you fix it.

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:34):

If you want a more direct code optimization and profiling lecture, there's https://www.youtube.com/watch?v=h-xVBD2Pk9o&feature=youtu.be

view this post on Zulip Christopher Rackauckas (Feb 21 2021 at 12:35):

As a side note, it would be interesting to have a macro, something like @error_if_allocate which you can add to your mission critical function and it wouldn't run, until you fix it.

Sure, just do @assert @allocated(y=f(x)) == 0

view this post on Zulip Rein Zustand (Feb 21 2021 at 12:49):

@Christopher Rackauckas thank you for the pointers. I have about 1 or 2 days left to speed up the Julia code. I hope I will have enough time to go through the lecture and https://www.youtube.com/watch?v=h-xVBD2Pk9o&feature=youtu.be.

view this post on Zulip Rein Zustand (Feb 22 2021 at 12:20):

Quick update. Andrés Riedemann uncovered the reason why method=:forward autodiff causes error. It is because I use a mutable struct inside my objective functions, where:

  1. Its fields are restricted to Float64's. This causes error because ForwardDiff wants to pass a Dual number containing information about the value and its derivatives. The fields need to be of a Real type.
  2. An array push! update also breaks autodiff, possibly due to the typing as well.

view this post on Zulip Florian Große (Feb 24 2021 at 23:48):

Did you achieve the performance you wanted?

view this post on Zulip Rein Zustand (Feb 25 2021 at 02:15):

@Florian Große not yet. My concern is that without optimizations such as autodiff or 0-allocation, the Julia version shouldn't be slower than the Python version, since the Python version is also doing wasteful allocations and not using autodiff. I tried to profile using StatProfilerHTML.jl, but the line-by-line time spent report is not complete. Not all the relevant lines are time. I wish there is something like Python's https://github.com/pyutils/line_profiler.

view this post on Zulip Florian Große (Feb 25 2021 at 04:30):

Agreed. Once you have the paper published and can share the code, it would probably good to open this discussion on Discourse again. If it requires a few tricks to make it run fast, it's probably good to have it there as an example. Either that or you find room for improvement in a julia package, which is also fine.

view this post on Zulip Nils (Feb 25 2021 at 09:10):

And maybe also #math-optimization, I don't think @Patrick Kofod Mogensen is more active over there I believe

view this post on Zulip Patrick Kofod Mogensen (Feb 25 2021 at 21:20):

Yeah, hard to say much with the given information. First, I saw you used conjugate gradient descent in the only timing of the real problem I saw. That is really only going to be good (relatively speaking) on very large problems. However, 99% of the times I get these questions, the problem is in the objective function evaluation as @Christopher Rackauckas also tried to guide the conversation towards :) You did not provide a side-by-side timing of your runtime for your objective from python and from julia. I'd like to see that before we can conclude this is related to Optim.jl at all. Next, I'd ask you to print the progress, number of objective function evaluations, number of gradient function evaluations etc. I'm sure we'd be able to match the speed you see in Python, but if you're relatively new to Julia, 2-3 days to get optimal code is not a lot. Hope you get your paper through review, and I'm of course sad that Optim.jl lost a citation over this ;)

view this post on Zulip Christopher Rackauckas (Feb 25 2021 at 22:07):

Yeah I could've been more direct haha. I was trying to slowly guide him to finding out that he likely needed to optimize the objective function that he wrote. But yes, in more direct terms, unless you show your model is non-allocating and type-stable, there is zero evidence to suggest Optim.jl has an issue and so right now the onus is on the modeler.

view this post on Zulip Christopher Rackauckas (Feb 25 2021 at 22:07):

The speed of the optimization package itself has relatively little effect on the speed of optimization, other than in the number of steps it takes to get to the optimum. The speed is largely based on the way the user writes the objective (except in the limit of tiny f). A good optimization package takes less steps, which there hasn't been any discussion of "Optim.jl takes 2-3x as many steps", which would be what points to Optim. Pure timings don't mean that.

view this post on Zulip Rein Zustand (Feb 26 2021 at 01:22):

Update: the Julia version is now 2x faster! One major problem is that I defined a named function inside my objective function, to deduplicate some steps, and because the function uses variables local to the objective function (without being passed as arguments). Note: I also do this function-definition-inside-obj-fn in the Python version without much penalty?? Once I move out the function and pass in the local vars as arguments, there are a lot fewer allocations. The Python version takes 5 min 25 s. The Julia version now takes 2 min 12 s (sorry can't publish the code yet). I'm sure there is still room for improvement, but this is already a strong case to convince my collaborators to switch to Julia.

view this post on Zulip Rein Zustand (Feb 26 2021 at 01:25):

Note: moving out this function to outside my objective function resulted in 4.5x speedup relative to the previous iteration.

view this post on Zulip Rein Zustand (Feb 26 2021 at 01:42):

Before (@time, taken from my previous post)

Optim.ConjugateGradient
elapsed: 74.16639494895935
 76.332368 seconds (885.22 M allocations: 25.710 GiB, 6.79% gc time)

After (@time, including precompilation)

Optim.ConjugateGradient
elapsed: 15.73147702217102
 17.780150 seconds (97.40 M allocations: 9.558 GiB, 6.34% gc time)

After (@btime)

Optim.ConjugateGradient
elapsed: 12.357237100601196
  12.357 s (92734943 allocations: 9.33 GiB)

Python version is ~30s.

view this post on Zulip Patrick Kofod Mogensen (Feb 26 2021 at 07:42):

That sounds great. I of course have to ask: have you checked that the objective values/solution(minimizer) are comparable? There's always a chance that Optim stopped early for some reason. I still have a feeling you could get more performance out of your objective function, but without knowing your exact code it's hard to say. There are indeed a few pitfalls in Julia, and you need to know a little bit (Chris' links would be a great place to start) to get the most out of Julia. I think the function definition that closes over some variables might be a known issue that can sometimes surprise even me and others who are seasoned Julia users (I think you might be hitting this issue specifically https://github.com/JuliaLang/julia/issues/15276 )

view this post on Zulip Henrique Ferrolho (Feb 26 2021 at 09:03):

@Rein Zustand, do you mean that your previous approach was using a closure? I.e., something like below?

function f(x)
    a = rand(100)
    b = rand(100)

    function aux()
        a .* b .+ x
    end

    s = aux()
    t = aux()

    s .* t
end

And now you have changed it to something like this?

function aux(a, b, c)
    a .* b .+ c
end

function g(x)
    a = rand(100)
    b = rand(100)

    s = aux(a, b, x)
    t = aux(a, b, x)

    s .* t
end

view this post on Zulip Rein Zustand (Feb 26 2021 at 09:09):

@Henrique Ferrolho yes, exactly.

view this post on Zulip Henrique Ferrolho (Feb 26 2021 at 09:11):

That is odd. I don't think it should make a difference in time or allocations. These are the results I get on my machine when I benchmark the snippets above:

julia> using BenchmarkTools

julia> @btime f(x);
  456.736 ns (5 allocations: 4.38 KiB)

julia> @btime g(x);
  452.822 ns (5 allocations: 4.38 KiB)

view this post on Zulip Rein Zustand (Feb 26 2021 at 09:16):

I will try to provide a snippet that reproduces my situation more closely.

view this post on Zulip Henrique Ferrolho (Feb 26 2021 at 09:17):

Perhaps something else changed when you switched from one approach to the other? What I am trying to say is that the performance improvement you observe should not be due to the use of a closure. I'd be interested to know what the key change was, then.

view this post on Zulip Rein Zustand (Feb 26 2021 at 09:27):

! Gotcha. I just checked my commit. I did turn a model parameter defined in a global setting into a const, together with the change that moved out the function to be outside of the objective commit, all within a single commit. My bad.

I tried reverting by putting the function back inside my objective function, while keeping the variable a const, and I got the same allocation. I am very sorry for the false alarm.

view this post on Zulip Henrique Ferrolho (Feb 26 2021 at 09:37):

No need to apologise. So, the improvement was actually thanks to declaring a global variable as const. Did I get that right?

view this post on Zulip Rein Zustand (Feb 26 2021 at 09:40):

Yes, I have isolated the cause to be indeed setting the const. Reverting it back to without const slows down the btime by ~4x.

view this post on Zulip Henrique Ferrolho (Feb 26 2021 at 09:44):

Great! In general, it is best to avoid global variables. But if you do want to have a global variable, it is critical to make it const, otherwise the compiler assumes that its value and type can change at any point, and thus it is not able to specialise on one given type. This is mentioned in Performance Tips > Avoid global variables. (The entire Performance Tips is gold. Do read it, if you haven't. :smiley:)

view this post on Zulip Rein Zustand (Feb 26 2021 at 10:07):

For that Performance tips article. It recommends @time instead of @btime. I suppose because BenchmarkTools.jl is an external tool, and so the official manual can't recommend it?

view this post on Zulip Kwaku Oskin (Feb 26 2021 at 10:11):

Hmmm... At the end of the section https://docs.julialang.org/en/v1/manual/performance-tips/#Measure-performance-with-@time-and-pay-attention-to-memory-allocation there is a note "For more serious benchmarking, consider the BenchmarkTools.jl package which among other things evaluates the function multiple times in order to reduce noise."

view this post on Zulip Rein Zustand (Mar 02 2021 at 05:28):

I hit a new roadblock with Optim's conjugate gradient. The parameters for my objective function are bounded by 0 <= x <= 1. While Optim.ConjugateGradient is fast, sometimes I observe a discontinuous spike in my result. I observed similar spikes too in scipy, but at least in scipy, the spikes consistently disappear if I use L-BFGS-B. There is a Optim.LBFGS in Optim, but it is 7x slower than Optim.ConjugateGradient (so, in effect, Optim is 3.5x slower than scipy).

By using Optim.ConjugateGradient I'm making my comparison closer to apple-to-apple with the scipy version. I have pre-allocated my mutable structs to reduce allocations, and this still doesn't improve the time spent significantly.

Unfortunately, I'm short of time to investigate the cause further, and am forced to use the scipy version to continue my research. :oh_no:

view this post on Zulip Rein Zustand (Mar 02 2021 at 05:33):

s/By using Optim.ConjugateGradient/By using Optim.LBFGS/. (I realized that zulip-archive doesn't handle message edits). test

view this post on Zulip Mason Protter (Mar 02 2021 at 05:57):

Are your allocations down to zero?

view this post on Zulip Mason Protter (Mar 02 2021 at 05:57):

I.e. are you allocating new arrays in each iteration?

view this post on Zulip Rein Zustand (Mar 02 2021 at 09:54):

My allocations are not exactly zero, but they are about 33k now (used to be 6M). But then, the Python version is doing wasteful allocations. I'd expect the Julia version to be at worst comparable to the Python version (because I'm doing equivalent wasteful allocations in Python), and then progressively gets better when I reduce allocations. Is it wrong to think this way?

Another degree of freedom to consider is that the default params of the LBFGS in each params might differ as Patrick Kofod Mogensen stated in https://julialang.zulipchat.com/#narrow/stream/274208-helpdesk-%28published%29/topic/why.20is.20Optim.2Ejl.20so.20slow.3F/near/227892581.

view this post on Zulip Kwaku Oskin (Mar 02 2021 at 10:03):

It's not only about number of allocation, but about there "quality".

If for example you are slicing your array, you are making it's copy (and python uses analogue of Julia views).
So, if you have something like

v1 = rand(1_000_000_000)
v2 = v1[1:500_000_000]

you are doing maybe 2 or 3 allocations, but you are also copying half of your computer memory to a new place and as a result computations are slow.

Also, if there are some issues with type stability, you also waste precious time on dynamical dispatch.

If the following example

v = Any[1, 2, 3]
c = v[1] + 1

compiler spend 0.01 ns to make actual addition and 10ns to figure out the type of the v[1] variable (and also it has to pack c back to Any). So if this addition is a part of tight loop you can get huge slowdown. Since python usually vectorize such operation, it avoids this problem. I think in this case you can compare vectorization to function barriers, i.e. vector has rather bad type as an input, but inside the function it knows that all elements have some good type and do not waste time trying to figure out type of each element individually.

view this post on Zulip Rein Zustand (Mar 02 2021 at 10:20):

Thank you for the elaboration. I have checked the allocations using --track-allocation=user and found no such copying. For the type stability issue, my @code_warntype output is mostly green and yellow (for the loop variable). I have to note though that in my case, I do regular loops in the Python version because the operation can't be vectorized (it's a time evolution), so surely Python must be doing regular object operations.

view this post on Zulip Kwaku Oskin (Mar 02 2021 at 10:29):

Oh, those were meant only as examples. There can be other reasons why reasonably small number of allocations can have a large impact on the performance.

view this post on Zulip Rein Zustand (Mar 02 2021 at 10:52):

s/mostly green and yellow/mostly green, but the rest is yellow/

view this post on Zulip Rein Zustand (Mar 02 2021 at 11:36):

I'm down to only 1 allocation, which is due to @assert length(xs) == length(Ts) (hold up, it's not it, still investigating). Still about as slow. By process of elimination, the remaining uncertainty is in the LBFGS implementation of Optim vs scipy.optimize.


Last updated: Nov 22 2024 at 04:41 UTC