Suppose I have this dataframe:
julia> df = DataFrame(x = [1, 2, 3, 1, 2, 3, 1, 2], y = [:foo, :foo, :foo, :bar, :bar, :bar, :baz, :baz], z = rand(8))
8×3 DataFrame
Row │ x y z
│ Int64 Symbol Float64
─────┼─────────────────────────
1 │ 1 foo 0.467047
2 │ 2 foo 0.63518
3 │ 3 foo 0.310892
4 │ 1 bar 0.696665
5 │ 2 bar 0.206687
6 │ 3 bar 0.998533
7 │ 1 baz 0.642528
8 │ 2 baz 0.721937
What I want is to reduce this to just have the entries with maxmimal x
for each y
. E.g. I want the output to be
Row │ x y z
│ Int64 Symbol Float64
─────┼─────────────────────────
1 │ 3 foo 0.310892
2 │ 3 bar 0.998533
3 │ 2 baz 0.721937
I feel like this should be a rather basic operation, but I can't figure out the right way to do it.
The closest I can figure out would be
julia> groupby(df, :x)[end]
2×3 SubDataFrame
Row │ x y z
│ Int64 Symbol Float64
─────┼─────────────────────────
1 │ 3 foo 0.310892
2 │ 3 bar 0.998533
but because there is no entry for :baz
with x=3
, it doesn't get included, so that's not quite there.
I like SQL approach (so called window functions, which in DataFrames are just groupby
with combine
)
combine(groupby(df, :y), x -> last(x, 1))
For it to work properly, you need to sort over :x
at the beginning, of course.
Very nice, thank you. I had been having some trouble getting my head around combine
I have very simple mental image: groupby
splits dataframe into a vector of dataframes and combine
applies function f
to each dataframe in this vector and combines them together.
You can think of combine as combine(gdf, f) = vcat(map(f, gdf))
I do not think that it is made this way, but it's easier to understand.
Makes sense
Note that you can index the SubDataFrame
s in your GroupedDataFrame
just like normal DataFrame
s, so a more general solution which doesn't require sorting is:
julia> combine(groupby(df, :y), sdf -> sdf[argmax(sdf.x), :])
3×3 DataFrame
Row │ y x z
│ Symbol Int64 Float64
─────┼─────────────────────────
1 │ foo 3 0.215512
2 │ bar 3 0.896303
3 │ baz 2 0.97978
(this will of course return more than one row for y
groups in which the maximum value of x
occurs multiple times)
Nice, I like that. Thanks Nils and Andrey!
Mason Protter has marked this topic as resolved.
Nils said:
(this will of course return more than one row for
y
groups in which the maximum value ofx
occurs multiple times)
That's not true, right? Since argmax
always returns a single index. E.g. modifying the example a bit to be degenerate,
julia> df = DataFrame(x = [1, 2, 3, 1, 2, 3, 1, 2], y = [:foo, :foo, :foo, :bar, :bar, :bar, :baz, :baz], z = ones(8))
8×3 DataFrame
Row │ x y z
│ Int64 Symbol Float64
─────┼────────────────────────
1 │ 1 foo 1.0
2 │ 2 foo 1.0
3 │ 3 foo 1.0
4 │ 1 bar 1.0
5 │ 2 bar 1.0
6 │ 3 bar 1.0
7 │ 1 baz 1.0
8 │ 2 baz 1.0
julia> combine(groupby(df, :y), sdf -> sdf[argmax(sdf.x), :])
3×3 DataFrame
Row │ y x z
│ Symbol Int64 Float64
─────┼────────────────────────
1 │ foo 3 1.0
2 │ bar 3 1.0
3 │ baz 2 1.0
I think you're correct Eric, but I don't think your example demonstrates it
For a given y
group, there's no repeat x
values in your example
Rather I think the demonstration would be
julia> df = DataFrame(x = [1, 2, 3, 3, 1, 2, 3, 1, 2], y = [:foo, :foo, :foo, :foo, :bar, :bar, :bar, :baz, :baz], z = rand(9))
9×3 DataFrame
Row │ x y z
│ Int64 Symbol Float64
─────┼──────────────────────────
1 │ 1 foo 0.288488
2 │ 2 foo 0.722006
3 │ 3 foo 0.654092
4 │ 3 foo 0.262445
5 │ 1 bar 0.932314
6 │ 2 bar 0.0627638
7 │ 3 bar 0.856708
8 │ 1 baz 0.45854
9 │ 2 baz 0.233986
julia> combine(groupby(df, :y), sdf -> sdf[argmax(sdf.x), :])
3×3 DataFrame
Row │ y x z
│ Symbol Int64 Float64
─────┼─────────────────────────
1 │ foo 3 0.654092
2 │ bar 3 0.856708
3 │ baz 2 0.233986
Ah oops, thought we were maximizing z
, not x
Ah yes of course, it would have been combine(groupby(df, :y), sdf -> sdf[sdf.x .== argmax(sdf.x), :])
to return multiple maxima. Migh actually be the better way to write this to guard against duplicates silently vanishing
Last updated: Dec 28 2024 at 04:38 UTC