I have a research code which does multivariate optimization which runs for ~30s for each data point in Python. The code has been too slow to be analyzed, so I ported it to Julia, only to find that it takes ~160s (I used @time optimize(...)
, precompilation time was probably included?) in Julia using Optim.jl!
So I tried doing a benchmark comparing Optim.jl with scipy.optimize using the simplest function, and found that Optim.jl is much slower. Any idea on how to speed up Optim.jl? My Julia version is 1.5.3, my Python version is 3.8.6. I'm on NixOS and have had trouble upgrading to the latest version of the OS, so can't try 1.6rc.
My Julia code:
using Optim
f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
@time optimize(f, [0.0, 0.0], LBFGS())
$ time julia bench.jl
2.475465 seconds (4.23 M allocations: 217.432 MiB, 4.38% gc time)
julia bench.jl 10.81s user 0.35s system 99% cpu 11.176 total
My Python code:
import time
from scipy import optimize
f = lambda x: (1.0 - x[0])**2 + 100.0 * (x[1] - x[0]**2)**2
tic = time.time()
optimize.minimize(f, [0.0, 0.0], method='L-BFGS-B')
print('elapsed', time.time() - tic)
$ time python bench.py
elapsed 0.0051517486572265625
python bench.py 0.86s user 0.88s system 205% cpu 0.843 total
Mmm, I don't remember the optim defaults on derivatives, but, can you try
optimize(f, [0.0,0.0],LBFGS(),autodiff=:forward)
?
With autodiff=:forward
I got
3.208170 seconds (6.03 M allocations: 308.185 MiB, 4.39% gc time)
julia bench.jl 11.42s user 0.33s system 99% cpu 11.767 total
And what about the real problem?
OK, I will try on my real problem.
That benchmark is measuring mainly the startup time of optim
With autodiff=forward, you are creating the exact derivative via AD, so the evaluation of the gradient is faster
I got this error:
Optim.ConjugateGradient
ERROR: LoadError: MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{var"#_calc_U#16"{Array{Any,1},Array{Any,1},Int64,Int64,Array{Any,1},Array{Any,1},Array{Float64,1},Int64,Array{Float64,1}},Float64},Float64,10})
Closest candidates are:
Float64(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
Float64(::T) where T<:Number at boot.jl:716
Float64(::Float32) at float.jl:255
This is the line that calls the function:
result = Optim.optimize(fn, lower, upper, xs0, Optim.Fminbox(inner_optimizer()), autodiff=:forward)
(What is the size of your input on the real problem? If it's too big, maybe using a reverse mode AD could speed up things significantly )
The size of xs0
is 30.
There is no option for :reverse
, I only saw :finite
and :forward
.
On the problem found, that happens because you have somewhere in your function the expression a=Float64(b)
I grepped for the pattern Float64(
, didn't find any.
Optim doesn't have automatic reverse mode included, but at that size I think Forward is fine
or maybe in the function signature?
f(x::Float64)
The only place where the text Float64
exists at all is in a mutable struct definition, e.g.
mutable struct Firm <: AbstractFirm
Eg::Float64
E_greens::Array{Float64,1}
Eb::Float64
E_browns::Array{Float64,1}
(What a shame that I have to go back to Python again after convincing my collaborator to have 2-3 days to reimplementing in Julia)
Do you mind putting the optimization problem here? , I cannot help much more without a working examole
Thanks a lot for that! I'm not allowed to share the code at this point until the paper has been published, but I can send you the code via PM.
Ok
If you are running an optimization problem on each point, and your points are not related, you could try multithreading the calculation
I will post relevant details in this topic so others can benefit once we have pinpointed the problem. Unfortunately to optimize xs[i], it depends on all the values from xs[1] to xs[i-1], because it is a time evolution.
Though my concern still stands. Optim.jl is much slower than scipy.optimize even on the simplest example.
Make sure that you are benchmarking things properly and not counting precompilation time. After that, try to share some minimum working example in both languages so that people can try to debug and help further.
OK, nvm it's a precompilation problem.
Yes, as usual :smile:
Also, consider Julia v1.6-rc1, it is really flying compared to the current Julia v1.5
To time the runtime of the optimize call from Optim.jl you can start a Julia session and include( "MWE.jl")
to introduce the definitions. After that you can using BenchmarkTools
and @btime optimize(...)
OK, I will use @btime
to avoid precompilation time.
With btime.
79.084 μs (699 allocations: 32.27 KiB)
julia bench.jl 81.97s user 3.43s system 100% cpu 1:25.28 total
@Andrés Riedemann has provided feedback in PM (appreciate it a lot):
const
. Specifying const
shaved down the time to ~76s (originally ~160s)= []
. After specifying them to Float64[]
or Array{Float64,1}[]
, it's down to 64s.Note that the Scipy version is 30s.
You can probably decrease it a bit more with a few code changes. I don't understand how you get a microsecond in Julia and a millisecond in Python and at the end your Python version is twice as fast? You said 64s vs 30s at the end right ?
The millisecond vs microsecond is for the example code, which is already fully explained. The 64s vs 30s is the real code one.
Did you make the real code fully non-allocating?
and AD
Christopher Rackauckas said:
Did you make the real code fully non-allocating?
Huh, TIL. I specified the model params as global const
, and my functions read from these consts directly instead of receiving them from function arguments.
Christopher Rackauckas said:
and AD
I assume Andrés Riedemann 's suggestion to use autodiff=:forward
is to enable AD?
I assume Andrés Riedemann 's suggestion to use autodiff=:forward is to enable AD?
yes
so if you @time
your objective function you get 0 allocs?
Nope, currently not yet lol
Optim.ConjugateGradient
elapsed: 74.16639494895935
76.332368 seconds (885.22 M allocations: 25.710 GiB, 6.79% gc time)
no
that's not your objective function
Oh, ok let me try again.
1.026235 seconds (1.75 M allocations: 91.610 MiB, 1.96% gc time)
This is if I simply run my objective function on the initial condition.
yeah so that's generating a ton. That's not good.
What if you @code_warntype
it?
ERROR: LoadError: LoadError: UndefVarError: @code_warntype not defined
I assume that @code_warntype
can only be used in the interpreter?
it seems like you need a lot more help than that. You should learn a bit about the language first
here's a tutorial that walks through a few things
https://mitmath.github.io/18337/lecture2/optimizing
or as a lecture
https://www.youtube.com/watch?v=M2i7sSRcSIw&feature=youtu.be
https://www.youtube.com/watch?v=10_Ukm9wr9g&feature=youtu.be
I assume that @code_warntype can only be used in the interpreter?
no
using InteractiveUtils
or just do Main.@code_warntype
as a quick check
As a side note, it would be interesting to have a macro, something like @error_if_allocate
which you can add to your mission critical function and it wouldn't run, until you fix it.
If you want a more direct code optimization and profiling lecture, there's https://www.youtube.com/watch?v=h-xVBD2Pk9o&feature=youtu.be
As a side note, it would be interesting to have a macro, something like @error_if_allocate which you can add to your mission critical function and it wouldn't run, until you fix it.
Sure, just do @assert @allocated(y=f(x)) == 0
@Christopher Rackauckas thank you for the pointers. I have about 1 or 2 days left to speed up the Julia code. I hope I will have enough time to go through the lecture and https://www.youtube.com/watch?v=h-xVBD2Pk9o&feature=youtu.be.
Quick update. Andrés Riedemann uncovered the reason why method=:forward
autodiff causes error. It is because I use a mutable struct inside my objective functions, where:
Float64
's. This causes error because ForwardDiff
wants to pass a Dual number containing information about the value and its derivatives. The fields need to be of a Real
type.push!
update also breaks autodiff, possibly due to the typing as well.Did you achieve the performance you wanted?
@Florian Große not yet. My concern is that without optimizations such as autodiff or 0-allocation, the Julia version shouldn't be slower than the Python version, since the Python version is also doing wasteful allocations and not using autodiff. I tried to profile using StatProfilerHTML.jl, but the line-by-line time spent report is not complete. Not all the relevant lines are time. I wish there is something like Python's https://github.com/pyutils/line_profiler.
Agreed. Once you have the paper published and can share the code, it would probably good to open this discussion on Discourse again. If it requires a few tricks to make it run fast, it's probably good to have it there as an example. Either that or you find room for improvement in a julia package, which is also fine.
And maybe also #math-optimization, I don't think @Patrick Kofod Mogensen is more active over there I believe
Yeah, hard to say much with the given information. First, I saw you used conjugate gradient descent in the only timing of the real problem I saw. That is really only going to be good (relatively speaking) on very large problems. However, 99% of the times I get these questions, the problem is in the objective function evaluation as @Christopher Rackauckas also tried to guide the conversation towards :) You did not provide a side-by-side timing of your runtime for your objective from python and from julia. I'd like to see that before we can conclude this is related to Optim.jl at all. Next, I'd ask you to print the progress, number of objective function evaluations, number of gradient function evaluations etc. I'm sure we'd be able to match the speed you see in Python, but if you're relatively new to Julia, 2-3 days to get optimal code is not a lot. Hope you get your paper through review, and I'm of course sad that Optim.jl lost a citation over this ;)
Yeah I could've been more direct haha. I was trying to slowly guide him to finding out that he likely needed to optimize the objective function that he wrote. But yes, in more direct terms, unless you show your model is non-allocating and type-stable, there is zero evidence to suggest Optim.jl has an issue and so right now the onus is on the modeler.
The speed of the optimization package itself has relatively little effect on the speed of optimization, other than in the number of steps it takes to get to the optimum. The speed is largely based on the way the user writes the objective (except in the limit of tiny f
). A good optimization package takes less steps, which there hasn't been any discussion of "Optim.jl takes 2-3x as many steps", which would be what points to Optim. Pure timings don't mean that.
Update: the Julia version is now 2x faster! One major problem is that I defined a named function inside my objective function, to deduplicate some steps, and because the function uses variables local to the objective function (without being passed as arguments). Note: I also do this function-definition-inside-obj-fn in the Python version without much penalty?? Once I move out the function and pass in the local vars as arguments, there are a lot fewer allocations. The Python version takes 5 min 25 s. The Julia version now takes 2 min 12 s (sorry can't publish the code yet). I'm sure there is still room for improvement, but this is already a strong case to convince my collaborators to switch to Julia.
Note: moving out this function to outside my objective function resulted in 4.5x speedup relative to the previous iteration.
Before (@time, taken from my previous post)
Optim.ConjugateGradient
elapsed: 74.16639494895935
76.332368 seconds (885.22 M allocations: 25.710 GiB, 6.79% gc time)
After (@time, including precompilation)
Optim.ConjugateGradient
elapsed: 15.73147702217102
17.780150 seconds (97.40 M allocations: 9.558 GiB, 6.34% gc time)
After (@btime)
Optim.ConjugateGradient
elapsed: 12.357237100601196
12.357 s (92734943 allocations: 9.33 GiB)
Python version is ~30s.
That sounds great. I of course have to ask: have you checked that the objective values/solution(minimizer) are comparable? There's always a chance that Optim stopped early for some reason. I still have a feeling you could get more performance out of your objective function, but without knowing your exact code it's hard to say. There are indeed a few pitfalls in Julia, and you need to know a little bit (Chris' links would be a great place to start) to get the most out of Julia. I think the function definition that closes over some variables might be a known issue that can sometimes surprise even me and others who are seasoned Julia users (I think you might be hitting this issue specifically https://github.com/JuliaLang/julia/issues/15276 )
@Rein Zustand, do you mean that your previous approach was using a closure? I.e., something like below?
function f(x)
a = rand(100)
b = rand(100)
function aux()
a .* b .+ x
end
s = aux()
t = aux()
s .* t
end
And now you have changed it to something like this?
function aux(a, b, c)
a .* b .+ c
end
function g(x)
a = rand(100)
b = rand(100)
s = aux(a, b, x)
t = aux(a, b, x)
s .* t
end
@Henrique Ferrolho yes, exactly.
That is odd. I don't think it should make a difference in time or allocations. These are the results I get on my machine when I benchmark the snippets above:
julia> using BenchmarkTools
julia> @btime f(x);
456.736 ns (5 allocations: 4.38 KiB)
julia> @btime g(x);
452.822 ns (5 allocations: 4.38 KiB)
I will try to provide a snippet that reproduces my situation more closely.
Perhaps something else changed when you switched from one approach to the other? What I am trying to say is that the performance improvement you observe should not be due to the use of a closure. I'd be interested to know what the key change was, then.
! Gotcha. I just checked my commit. I did turn a model parameter defined in a global setting into a const, together with the change that moved out the function to be outside of the objective commit, all within a single commit. My bad.
I tried reverting by putting the function back inside my objective function, while keeping the variable a const, and I got the same allocation. I am very sorry for the false alarm.
No need to apologise. So, the improvement was actually thanks to declaring a global variable as const
. Did I get that right?
Yes, I have isolated the cause to be indeed setting the const. Reverting it back to without const slows down the btime by ~4x.
Great! In general, it is best to avoid global variables. But if you do want to have a global variable, it is critical to make it const
, otherwise the compiler assumes that its value and type can change at any point, and thus it is not able to specialise on one given type. This is mentioned in Performance Tips > Avoid global variables. (The entire Performance Tips is gold. Do read it, if you haven't. :smiley:)
For that Performance tips article. It recommends @time
instead of @btime
. I suppose because BenchmarkTools.jl is an external tool, and so the official manual can't recommend it?
Hmmm... At the end of the section https://docs.julialang.org/en/v1/manual/performance-tips/#Measure-performance-with-@time-and-pay-attention-to-memory-allocation there is a note "For more serious benchmarking, consider the BenchmarkTools.jl package which among other things evaluates the function multiple times in order to reduce noise."
I hit a new roadblock with Optim's conjugate gradient. The parameters for my objective function are bounded by 0 <= x <= 1. While Optim.ConjugateGradient
is fast, sometimes I observe a discontinuous spike in my result. I observed similar spikes too in scipy, but at least in scipy, the spikes consistently disappear if I use L-BFGS-B
. There is a Optim.LBFGS
in Optim
, but it is 7x slower than Optim.ConjugateGradient
(so, in effect, Optim is 3.5x slower than scipy).
By using Optim.ConjugateGradient
I'm making my comparison closer to apple-to-apple with the scipy version. I have pre-allocated my mutable structs to reduce allocations, and this still doesn't improve the time spent significantly.
Unfortunately, I'm short of time to investigate the cause further, and am forced to use the scipy version to continue my research. :oh_no:
s/By using Optim.ConjugateGradient
/By using Optim.LBFGS
/. (I realized that zulip-archive doesn't handle message edits). test
Are your allocations down to zero?
I.e. are you allocating new arrays in each iteration?
My allocations are not exactly zero, but they are about 33k now (used to be 6M). But then, the Python version is doing wasteful allocations. I'd expect the Julia version to be at worst comparable to the Python version (because I'm doing equivalent wasteful allocations in Python), and then progressively gets better when I reduce allocations. Is it wrong to think this way?
Another degree of freedom to consider is that the default params of the LBFGS in each params might differ as Patrick Kofod Mogensen stated in https://julialang.zulipchat.com/#narrow/stream/274208-helpdesk-%28published%29/topic/why.20is.20Optim.2Ejl.20so.20slow.3F/near/227892581.
It's not only about number of allocation, but about there "quality".
If for example you are slicing your array, you are making it's copy (and python uses analogue of Julia views).
So, if you have something like
v1 = rand(1_000_000_000)
v2 = v1[1:500_000_000]
you are doing maybe 2 or 3 allocations, but you are also copying half of your computer memory to a new place and as a result computations are slow.
Also, if there are some issues with type stability, you also waste precious time on dynamical dispatch.
If the following example
v = Any[1, 2, 3]
c = v[1] + 1
compiler spend 0.01 ns to make actual addition and 10ns to figure out the type of the v[1]
variable (and also it has to pack c
back to Any
). So if this addition is a part of tight loop you can get huge slowdown. Since python usually vectorize such operation, it avoids this problem. I think in this case you can compare vectorization to function barriers, i.e. vector has rather bad type as an input, but inside the function it knows that all elements have some good type and do not waste time trying to figure out type of each element individually.
Thank you for the elaboration. I have checked the allocations using --track-allocation=user
and found no such copying. For the type stability issue, my @code_warntype
output is mostly green and yellow (for the loop variable). I have to note though that in my case, I do regular loops in the Python version because the operation can't be vectorized (it's a time evolution), so surely Python must be doing regular object
operations.
Oh, those were meant only as examples. There can be other reasons why reasonably small number of allocations can have a large impact on the performance.
s/mostly green and yellow/mostly green, but the rest is yellow/
I'm down to only 1 allocation, which is due to (hold up, it's not it, still investigating). Still about as slow. By process of elimination, the remaining uncertainty is in the LBFGS implementation of @assert length(xs) == length(Ts)
Optim
vs scipy.optimize
.
Last updated: Nov 22 2024 at 04:41 UTC