Suppose I have a dataframe like this:
julia> map(Iterators.product('a':'c', 1:3)) do (x, y)
(;x, y, z=rand())
end[:] |> DataFrame
9×3 DataFrame
Row │ x y z
│ Char Int64 Float64
─────┼────────────────────────
1 │ a 1 0.74997
2 │ b 1 0.828512
3 │ c 1 0.840793
4 │ a 2 0.97316
5 │ b 2 0.329049
6 │ c 2 0.963159
7 │ a 3 0.263909
8 │ b 3 0.0101475
9 │ c 3 0.162562
Is there a natural way to kinda disaggregate this back into a matrix? I.e. I want one dimension of the array to correspond to the unique values of the x
column, another dimension to correspond to the unique values of the y
column, and then entries of the matrix would be the corresponding entries in the dataframe
I could do this like so:
julia> map(Iterators.product(unique(df.x), unique(df.y))) do (x, y)
row = only(filter(row -> row.x == x && row.y == y, df))
row.z
end
3×3 Matrix{Float64}:
0.71842 0.992283 0.102177
0.901056 0.481119 0.835783
0.921512 0.0621665 0.822107
but this is very inefficient, and I suspect there's a smart way to do this that I'm not seeing
Here is one way:
julia> using DataFrames, AxisKeys
julia> df = map(Iterators.product('a':'c', 1:3)) do (x, y)
(;x, y, z=rand())
end[:] |> DataFrame
9×3 DataFrame
Row │ x y z
│ Char Int64 Float64
─────┼────────────────────────
1 │ a 1 0.972967
2 │ b 1 0.255974
3 │ c 1 0.0945194
4 │ a 2 0.621327
5 │ b 2 0.0908171
6 │ c 2 0.763769
7 │ a 3 0.342196
8 │ b 3 0.187913
9 │ c 3 0.972685
julia> wrapdims(df, :z, :x, :y)
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓ x ∈ 3-element Vector{Char}
→ y ∈ 3-element Vector{Int64}
And data, 3×3 Matrix{Float64}:
(1) (2) (3)
('a') 0.972967 0.621327 0.342196
('b') 0.255974 0.0908171 0.187913
('c') 0.0945194 0.763769 0.972685
But am far from an expert at these things. I don't know if this is efficient or not.
julia> (df = map(Iterators.product('a':'c', 1:3)) do (x, y)
(;x, y)
end[:] |> DataFrame); df.z = 1:9;df
9×3 DataFrame
Row │ x y z
│ Char Int64 Int64
─────┼────────────────────
1 │ a 1 1
2 │ b 1 2
3 │ c 1 3
4 │ a 2 4
5 │ b 2 5
6 │ c 2 6
7 │ a 3 7
8 │ b 3 8
9 │ c 3 9
julia> select(unstack(df, :x,:y,:z), Not(:x))
3×3 DataFrame
Row │ 1 2 3
│ Int64? Int64? Int64?
─────┼────────────────────────
1 │ 1 4 7
2 │ 2 5 8
3 │ 3 6 9
Nice, thanks guys. AxisKeys is a very slick solution. It's about an order of magnitude slower than @jar's solution, but also an order of magnitude faster than my attempt with filter
.
My problem with Jar's solution though is that it's not a dense matrix at the end, but rather a collection of vectors
The fastest method I've fouind so far is
let arr = Matrix{Float64}(undef, length(unique($df.x)), length(unique($df.y)))
gdf = groupby($df, :x)
for (i, sdf) ∈ enumerate(gdf)
arr[i, :] .= sdf.z
end
arr
end
but it's ugly and not very generic
How does something like Matrix{Float64}(unstack(df, :x,:y,:z)[:, 2:end])
perform?
Seems about as fast as Jar's solution
Thanks guys!
Last updated: Dec 28 2024 at 04:38 UTC