Stream: helpdesk (published)

Topic: dataframe to matrix


view this post on Zulip Mason Protter (Mar 31 2022 at 21:55):

Suppose I have a dataframe like this:

julia> map(Iterators.product('a':'c', 1:3)) do (x, y)
           (;x, y, z=rand())
       end[:] |> DataFrame
9×3 DataFrame
 Row  x     y      z
      Char  Int64  Float64
─────┼────────────────────────
   1  a         1  0.74997
   2  b         1  0.828512
   3  c         1  0.840793
   4  a         2  0.97316
   5  b         2  0.329049
   6  c         2  0.963159
   7  a         3  0.263909
   8  b         3  0.0101475
   9  c         3  0.162562

Is there a natural way to kinda disaggregate this back into a matrix? I.e. I want one dimension of the array to correspond to the unique values of the x column, another dimension to correspond to the unique values of the y column, and then entries of the matrix would be the corresponding entries in the dataframe

view this post on Zulip Mason Protter (Mar 31 2022 at 21:58):

I could do this like so:

julia> map(Iterators.product(unique(df.x), unique(df.y))) do (x, y)
           row = only(filter(row -> row.x == x && row.y == y, df))
           row.z
       end
3×3 Matrix{Float64}:
 0.71842   0.992283   0.102177
 0.901056  0.481119   0.835783
 0.921512  0.0621665  0.822107

but this is very inefficient, and I suspect there's a smart way to do this that I'm not seeing

view this post on Zulip Michael Abbott (Mar 31 2022 at 22:27):

Here is one way:

julia> using DataFrames, AxisKeys

julia> df = map(Iterators.product('a':'c', 1:3)) do (x, y)
                  (;x, y, z=rand())
              end[:] |> DataFrame
9×3 DataFrame
 Row  x     y      z
      Char  Int64  Float64
─────┼────────────────────────
   1  a         1  0.972967
   2  b         1  0.255974
   3  c         1  0.0945194
   4  a         2  0.621327
   5  b         2  0.0908171
   6  c         2  0.763769
   7  a         3  0.342196
   8  b         3  0.187913
   9  c         3  0.972685

julia> wrapdims(df, :z, :x, :y)
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
   x  3-element Vector{Char}
   y  3-element Vector{Int64}
And data, 3×3 Matrix{Float64}:
         (1)          (2)          (3)
  ('a')    0.972967     0.621327     0.342196
  ('b')    0.255974     0.0908171    0.187913
  ('c')    0.0945194    0.763769     0.972685

But am far from an expert at these things. I don't know if this is efficient or not.

view this post on Zulip jar (Mar 31 2022 at 22:29):

julia> (df = map(Iterators.product('a':'c', 1:3)) do (x, y)
                  (;x, y)
              end[:] |> DataFrame); df.z = 1:9;df
9×3 DataFrame
 Row  x     y      z
      Char  Int64  Int64
─────┼────────────────────
   1  a         1      1
   2  b         1      2
   3  c         1      3
   4  a         2      4
   5  b         2      5
   6  c         2      6
   7  a         3      7
   8  b         3      8
   9  c         3      9

julia> select(unstack(df, :x,:y,:z), Not(:x))
3×3 DataFrame
 Row  1       2       3
      Int64?  Int64?  Int64?
─────┼────────────────────────
   1       1       4       7
   2       2       5       8
   3       3       6       9

view this post on Zulip Mason Protter (Mar 31 2022 at 22:41):

Nice, thanks guys. AxisKeys is a very slick solution. It's about an order of magnitude slower than @jar's solution, but also an order of magnitude faster than my attempt with filter.

view this post on Zulip Mason Protter (Mar 31 2022 at 22:42):

My problem with Jar's solution though is that it's not a dense matrix at the end, but rather a collection of vectors

view this post on Zulip Mason Protter (Mar 31 2022 at 22:42):

The fastest method I've fouind so far is

let arr = Matrix{Float64}(undef, length(unique($df.x)), length(unique($df.y)))
    gdf = groupby($df, :x)
    for (i, sdf)  enumerate(gdf)
        arr[i, :] .= sdf.z
    end
    arr
end

but it's ugly and not very generic

view this post on Zulip Michael Abbott (Mar 31 2022 at 22:44):

How does something like Matrix{Float64}(unstack(df, :x,:y,:z)[:, 2:end]) perform?

view this post on Zulip Mason Protter (Mar 31 2022 at 22:47):

Seems about as fast as Jar's solution

view this post on Zulip Mason Protter (Mar 31 2022 at 23:02):

Thanks guys!


Last updated: Nov 22 2024 at 04:41 UTC