dataframe to matrix · helpdesk (published)

julia> map(Iterators.product('a':'c', 1:3)) do (x, y)
           (;x, y, z=rand())
       end[:] |> DataFrame
9×3 DataFrame
 Row │ x     y      z
     │ Char  Int64  Float64
─────┼────────────────────────
   1 │ a         1  0.74997
   2 │ b         1  0.828512
   3 │ c         1  0.840793
   4 │ a         2  0.97316
   5 │ b         2  0.329049
   6 │ c         2  0.963159
   7 │ a         3  0.263909
   8 │ b         3  0.0101475
   9 │ c         3  0.162562

Is there a natural way to kinda disaggregate this back into a matrix? I.e. I want one dimension of the array to correspond to the unique values of the x column, another dimension to correspond to the unique values of the y column, and then entries of the matrix would be the corresponding entries in the dataframe

Mason Protter (Mar 31 2022 at 21:58):

julia> map(Iterators.product(unique(df.x), unique(df.y))) do (x, y)
           row = only(filter(row -> row.x == x && row.y == y, df))
           row.z
       end
3×3 Matrix{Float64}:
 0.71842   0.992283   0.102177
 0.901056  0.481119   0.835783
 0.921512  0.0621665  0.822107

but this is very inefficient, and I suspect there's a smart way to do this that I'm not seeing

Michael Abbott (Mar 31 2022 at 22:27):

julia> using DataFrames, AxisKeys

julia> df = map(Iterators.product('a':'c', 1:3)) do (x, y)
                  (;x, y, z=rand())
              end[:] |> DataFrame
9×3 DataFrame
 Row │ x     y      z
     │ Char  Int64  Float64
─────┼────────────────────────
   1 │ a         1  0.972967
   2 │ b         1  0.255974
   3 │ c         1  0.0945194
   4 │ a         2  0.621327
   5 │ b         2  0.0908171
   6 │ c         2  0.763769
   7 │ a         3  0.342196
   8 │ b         3  0.187913
   9 │ c         3  0.972685

julia> wrapdims(df, :z, :x, :y)
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   x ∈ 3-element Vector{Char}
→   y ∈ 3-element Vector{Int64}
And data, 3×3 Matrix{Float64}:
         (1)          (2)          (3)
  ('a')    0.972967     0.621327     0.342196
  ('b')    0.255974     0.0908171    0.187913
  ('c')    0.0945194    0.763769     0.972685

But am far from an expert at these things. I don't know if this is efficient or not.

jar (Mar 31 2022 at 22:29):

julia> (df = map(Iterators.product('a':'c', 1:3)) do (x, y)
                  (;x, y)
              end[:] |> DataFrame); df.z = 1:9;df
9×3 DataFrame
 Row │ x     y      z
     │ Char  Int64  Int64
─────┼────────────────────
   1 │ a         1      1
   2 │ b         1      2
   3 │ c         1      3
   4 │ a         2      4
   5 │ b         2      5
   6 │ c         2      6
   7 │ a         3      7
   8 │ b         3      8
   9 │ c         3      9

julia> select(unstack(df, :x,:y,:z), Not(:x))
3×3 DataFrame
 Row │ 1       2       3
     │ Int64?  Int64?  Int64?
─────┼────────────────────────
   1 │      1       4       7
   2 │      2       5       8
   3 │      3       6       9

Mason Protter (Mar 31 2022 at 22:41):

Nice, thanks guys. AxisKeys is a very slick solution. It's about an order of magnitude slower than @jar's solution, but also an order of magnitude faster than my attempt with filter.

Mason Protter (Mar 31 2022 at 22:42):

My problem with Jar's solution though is that it's not a dense matrix at the end, but rather a collection of vectors

Mason Protter (Mar 31 2022 at 22:42):

let arr = Matrix{Float64}(undef, length(unique($df.x)), length(unique($df.y)))
    gdf = groupby($df, :x)
    for (i, sdf) ∈ enumerate(gdf)
        arr[i, :] .= sdf.z
    end
    arr
end

Michael Abbott (Mar 31 2022 at 22:44):

How does something like Matrix{Float64}(unstack(df, :x,:y,:z)[:, 2:end]) perform?

Stream: helpdesk (published)

Topic: dataframe to matrix

Mason Protter (Mar 31 2022 at 21:55):

Mason Protter (Mar 31 2022 at 21:58):

Michael Abbott (Mar 31 2022 at 22:27):

jar (Mar 31 2022 at 22:29):

Mason Protter (Mar 31 2022 at 22:41):

Mason Protter (Mar 31 2022 at 22:42):

Mason Protter (Mar 31 2022 at 22:42):

Michael Abbott (Mar 31 2022 at 22:44):

Mason Protter (Mar 31 2022 at 22:47):

Mason Protter (Mar 31 2022 at 23:02):