Stream: helpdesk (published)

Topic: Working with Huge Matrices


view this post on Zulip Davi Sales Barreira (May 24 2021 at 23:57):

In some code that I'm writing, I have to calculate a distance matrix comparing 60k images. This results in a 60k by 60k matrix, which is too large to store in memory. I was wondering what would be an appropriate way to handle this types of situations. Is there a datatype which I should save the data to? Also, this matrix will be passed on to another function, which will perform dimensionality reduction... Any ideas on how to make this work?

view this post on Zulip Benoit Pasquier (May 25 2021 at 00:48):

Could you instead of a matrix use a lazy approach, e.g., just use a function of the indices? Otherwise, in your distance matrix, which is important to you? the large distances or the sort ones? If you are looking for similar images, maybe you are more interested in the inverse of the distance, and then you could discard all the values below a certain threshold and store only the large inverse distances as a sparse matrix?

view this post on Zulip brett knoss (Jun 01 2021 at 22:05):

Can you use a dataframe?

view this post on Zulip Brian Chen (Jun 01 2021 at 22:08):

You may want to look into a spatially chunked array format that transparently spills to disk. https://github.com/JuliaIO/HDF5.jl and https://github.com/meggart/Zarr.jl would be good candidates, perhaps https://github.com/JuliaGeo/NetCDF.jl as well.

view this post on Zulip Brian Chen (Jun 01 2021 at 22:15):

There's also https://docs.julialang.org/en/v1/stdlib/Mmap/#Mmap.mmap and https://docs.julialang.org/en/v1/stdlib/SharedArrays/. Might be worth trying one of these first.

view this post on Zulip Mason Protter (Jun 01 2021 at 22:20):

A DataFrame surely wouldn’t help, they’re all in memory right?

view this post on Zulip Mason Protter (Jun 01 2021 at 23:12):

Here's something you can do, which is essentially what @Benoit Pasquier suggests I think:

julia> using LinearMaps, Arpack

julia> @time let N = 10_000
           i1 = sin.(range(0, 2π, length=N))
           i2 = exp.(range(0, 2π, length=N))
           function lazy_distance_kernel(v)
               @assert size(v) == (N,)
               map(1:N) do i
                   sum(j -> (i1[i] - i2[j]) * v[j], 1:N)
               end
           end
           function lazy_distance_kernel_tr(v)
               @assert size(v) == (N,)
               map(1:N) do i
                  sum(j -> (i1[j] - i2[i]) * v[j], 1:N)
               end
           end
           M = LinearMap(lazy_distance_kernel, lazy_distance_kernel_tr, N, N)
           U, S, V = svds(M; nsv=4)[1]

           S
       end
  7.158159 seconds (2.43 M allocations: 150.013 MiB, 0.58% gc time, 17.93% compilation time)
4-element Vector{Float64}:
    1.510995769332172e6
 5843.092821965402
    0.0
    0.0

So here, I defined a lazy version of a distance matrix using LinearMaps.jl, and then just stuck it in Arpack.svds and got the singular value decomposition of the matrix truncated to 4 singular values.

view this post on Zulip Mason Protter (Jun 01 2021 at 23:13):

This is able to act without ever actually instantiating the big matrix, so it's at least memory efficient.

view this post on Zulip Mason Protter (Jun 01 2021 at 23:24):

And of course, if we want to get better performance, we should do things in-place and make use of LoopVectorization.jl:

julia> @time let N = 10_000
           i1 = sin.(range(0, 2π, length=N))
           i2 = exp.(range(0, 2π, length=N))
           function lazy_distance_kernel(u, v)
               @assert size(v) == (N,)
               @tturbo for i  1:N
                   ui = 0.0
                   for j  1:N
                       ui += (i1[i] - i2[j]) * v[j]
                   end
                   u[i] = ui
               end
           end
           function lazy_distance_kernel_tr(u, v)
               @assert size(v) == (N,)
               @tturbo for i  1:N
                   ui = 0.0
                   for j  1:N
                       ui += (i1[j] - i2[i]) * v[j]
                   end
                   u[i] = ui
               end
           end
           M = LinearMap(lazy_distance_kernel, lazy_distance_kernel_tr, N, N, ismutating=true)
           U, S, V = svds(M; nsv=4)[1]

           S
       end
  1.146122 seconds (2.07 M allocations: 122.852 MiB, 3.97% gc time, 82.02% compilation time)
4-element Vector{Float64}:
    1.5109957693321719e6
 5843.092821965383
    0.0
    0.0

view this post on Zulip Mason Protter (Jun 01 2021 at 23:24):

Here it is on the actual problem size of 60k x 60k:

julia> @time let N = 60_000
           i1 = sin.(range(0, 2π, length=N))
           i2 = exp.(range(0, 2π, length=N))
           function lazy_distance_kernel(u, v)
               @assert size(v) == (N,)
               @tturbo for i  1:N
                   ui = 0.0
                   for j  1:N
                       ui += (i1[i] - i2[j]) * v[j]
                   end
                   u[i] = ui
               end
           end
           function lazy_distance_kernel_tr(u, v)
               @assert size(v) == (N,)
               @tturbo for i  1:N
                   ui = 0.0
                   for j  1:N
                       ui += (i1[j] - i2[i]) * v[j]
                   end
                   u[i] = ui
               end
           end
           M = LinearMap(lazy_distance_kernel, lazy_distance_kernel_tr, N, N, ismutating=true)
           U, S, V = svds(M; nsv=4)[1]

           S
       end
  6.174817 seconds (2.07 M allocations: 140.818 MiB, 0.69% gc time, 15.12% compilation time)
4-element Vector{Float64}:
     9.063979155608427e6
 35059.355544173464
     0.0
     0.0

view this post on Zulip Mason Protter (Jun 01 2021 at 23:25):

Note that 15% of that is compilation overhead (your first run may be worse)

view this post on Zulip Mason Protter (Jun 01 2021 at 23:27):

Most importantly though, the memory used here never gets anywhere near being problematically large. Does that help at all @Davi Sales Barreira?

view this post on Zulip Davi Sales Barreira (Jun 01 2021 at 23:31):

Awesome, thank!


Last updated: Nov 22 2024 at 04:41 UTC