In some code that I'm writing, I have to calculate a distance matrix comparing 60k images. This results in a 60k by 60k matrix, which is too large to store in memory. I was wondering what would be an appropriate way to handle this types of situations. Is there a datatype which I should save the data to? Also, this matrix will be passed on to another function, which will perform dimensionality reduction... Any ideas on how to make this work?
Could you instead of a matrix use a lazy approach, e.g., just use a function of the indices? Otherwise, in your distance matrix, which is important to you? the large distances or the sort ones? If you are looking for similar images, maybe you are more interested in the inverse of the distance, and then you could discard all the values below a certain threshold and store only the large inverse distances as a sparse matrix?
Can you use a dataframe?
You may want to look into a spatially chunked array format that transparently spills to disk. https://github.com/JuliaIO/HDF5.jl and https://github.com/meggart/Zarr.jl would be good candidates, perhaps https://github.com/JuliaGeo/NetCDF.jl as well.
There's also https://docs.julialang.org/en/v1/stdlib/Mmap/#Mmap.mmap and https://docs.julialang.org/en/v1/stdlib/SharedArrays/. Might be worth trying one of these first.
A DataFrame surely wouldn’t help, they’re all in memory right?
Here's something you can do, which is essentially what @Benoit Pasquier suggests I think:
julia> using LinearMaps, Arpack
julia> @time let N = 10_000
i1 = sin.(range(0, 2π, length=N))
i2 = exp.(range(0, 2π, length=N))
function lazy_distance_kernel(v)
@assert size(v) == (N,)
map(1:N) do i
sum(j -> (i1[i] - i2[j]) * v[j], 1:N)
end
end
function lazy_distance_kernel_tr(v)
@assert size(v) == (N,)
map(1:N) do i
sum(j -> (i1[j] - i2[i]) * v[j], 1:N)
end
end
M = LinearMap(lazy_distance_kernel, lazy_distance_kernel_tr, N, N)
U, S, V = svds(M; nsv=4)[1]
S
end
7.158159 seconds (2.43 M allocations: 150.013 MiB, 0.58% gc time, 17.93% compilation time)
4-element Vector{Float64}:
1.510995769332172e6
5843.092821965402
0.0
0.0
So here, I defined a lazy version of a distance matrix using LinearMaps.jl, and then just stuck it in Arpack.svds
and got the singular value decomposition of the matrix truncated to 4 singular values.
This is able to act without ever actually instantiating the big matrix, so it's at least memory efficient.
And of course, if we want to get better performance, we should do things in-place and make use of LoopVectorization.jl:
julia> @time let N = 10_000
i1 = sin.(range(0, 2π, length=N))
i2 = exp.(range(0, 2π, length=N))
function lazy_distance_kernel(u, v)
@assert size(v) == (N,)
@tturbo for i ∈ 1:N
ui = 0.0
for j ∈ 1:N
ui += (i1[i] - i2[j]) * v[j]
end
u[i] = ui
end
end
function lazy_distance_kernel_tr(u, v)
@assert size(v) == (N,)
@tturbo for i ∈ 1:N
ui = 0.0
for j ∈ 1:N
ui += (i1[j] - i2[i]) * v[j]
end
u[i] = ui
end
end
M = LinearMap(lazy_distance_kernel, lazy_distance_kernel_tr, N, N, ismutating=true)
U, S, V = svds(M; nsv=4)[1]
S
end
1.146122 seconds (2.07 M allocations: 122.852 MiB, 3.97% gc time, 82.02% compilation time)
4-element Vector{Float64}:
1.5109957693321719e6
5843.092821965383
0.0
0.0
Here it is on the actual problem size of 60k x 60k:
julia> @time let N = 60_000
i1 = sin.(range(0, 2π, length=N))
i2 = exp.(range(0, 2π, length=N))
function lazy_distance_kernel(u, v)
@assert size(v) == (N,)
@tturbo for i ∈ 1:N
ui = 0.0
for j ∈ 1:N
ui += (i1[i] - i2[j]) * v[j]
end
u[i] = ui
end
end
function lazy_distance_kernel_tr(u, v)
@assert size(v) == (N,)
@tturbo for i ∈ 1:N
ui = 0.0
for j ∈ 1:N
ui += (i1[j] - i2[i]) * v[j]
end
u[i] = ui
end
end
M = LinearMap(lazy_distance_kernel, lazy_distance_kernel_tr, N, N, ismutating=true)
U, S, V = svds(M; nsv=4)[1]
S
end
6.174817 seconds (2.07 M allocations: 140.818 MiB, 0.69% gc time, 15.12% compilation time)
4-element Vector{Float64}:
9.063979155608427e6
35059.355544173464
0.0
0.0
Note that 15% of that is compilation overhead (your first run may be worse)
Most importantly though, the memory used here never gets anywhere near being problematically large. Does that help at all @Davi Sales Barreira?
Awesome, thank!
Last updated: Nov 22 2024 at 04:41 UTC