✔ get rows of dataframes which are maximal in some value · helpdesk (published)

julia> df = DataFrame(x = [1, 2, 3, 1, 2, 3, 1, 2], y = [:foo, :foo, :foo, :bar, :bar, :bar, :baz, :baz], z = rand(8))
8×3 DataFrame
 Row │ x      y       z
     │ Int64  Symbol  Float64
─────┼─────────────────────────
   1 │     1  foo     0.467047
   2 │     2  foo     0.63518
   3 │     3  foo     0.310892
   4 │     1  bar     0.696665
   5 │     2  bar     0.206687
   6 │     3  bar     0.998533
   7 │     1  baz     0.642528
   8 │     2  baz     0.721937

What I want is to reduce this to just have the entries with maxmimal x for each y. E.g. I want the output to be

 Row │ x      y       z
     │ Int64  Symbol  Float64
─────┼─────────────────────────
   1 │     3  foo     0.310892
   2 │     3  bar     0.998533
   3 │     2  baz     0.721937

I feel like this should be a rather basic operation, but I can't figure out the right way to do it.

Mason Protter (Jan 24 2022 at 18:20):

julia> groupby(df, :x)[end]
2×3 SubDataFrame
 Row │ x      y       z
     │ Int64  Symbol  Float64
─────┼─────────────────────────
   1 │     3  foo     0.310892
   2 │     3  bar     0.998533

but because there is no entry for :baz with x=3, it doesn't get included, so that's not quite there.

Kwaku Oskin (Jan 24 2022 at 18:26):

I like SQL approach (so called window functions, which in DataFrames are just groupby with combine)

combine(groupby(df, :y), x -> last(x, 1))

Kwaku Oskin (Jan 24 2022 at 18:28):

For it to work properly, you need to sort over :x at the beginning, of course.

Mason Protter (Jan 24 2022 at 18:34):

Very nice, thank you. I had been having some trouble getting my head around combine

Kwaku Oskin (Jan 24 2022 at 18:39):

I have very simple mental image: groupby splits dataframe into a vector of dataframes and combine applies function f to each dataframe in this vector and combines them together.

You can think of combine as combine(gdf, f) = vcat(map(f, gdf)) I do not think that it is made this way, but it's easier to understand.

Mason Protter (Jan 24 2022 at 18:47):

Nils (Jan 25 2022 at 10:47):

Note that you can index the SubDataFrames in your GroupedDataFrame just like normal DataFrames, so a more general solution which doesn't require sorting is:

julia> combine(groupby(df, :y), sdf -> sdf[argmax(sdf.x), :])
3×3 DataFrame
 Row │ y       x      z
     │ Symbol  Int64  Float64
─────┼─────────────────────────
   1 │ foo         3  0.215512
   2 │ bar         3  0.896303
   3 │ baz         2  0.97978

Nils (Jan 25 2022 at 10:48):

(this will of course return more than one row for y groups in which the maximum value of x occurs multiple times)

Mason Protter (Jan 26 2022 at 00:12):

Notification Bot (Jan 26 2022 at 00:12):

Eric Hanson (Jan 28 2022 at 00:20):

That's not true, right? Since argmax always returns a single index. E.g. modifying the example a bit to be degenerate,

julia> df = DataFrame(x = [1, 2, 3, 1, 2, 3, 1, 2], y = [:foo, :foo, :foo, :bar, :bar, :bar, :baz, :baz], z = ones(8))
8×3 DataFrame
 Row │ x      y       z
     │ Int64  Symbol  Float64
─────┼────────────────────────
   1 │     1  foo         1.0
   2 │     2  foo         1.0
   3 │     3  foo         1.0
   4 │     1  bar         1.0
   5 │     2  bar         1.0
   6 │     3  bar         1.0
   7 │     1  baz         1.0
   8 │     2  baz         1.0

julia> combine(groupby(df, :y), sdf -> sdf[argmax(sdf.x), :])
3×3 DataFrame
 Row │ y       x      z
     │ Symbol  Int64  Float64
─────┼────────────────────────
   1 │ foo         3      1.0
   2 │ bar         3      1.0
   3 │ baz         2      1.0

Mason Protter (Jan 28 2022 at 00:46):

Mason Protter (Jan 28 2022 at 00:47):

Mason Protter (Jan 28 2022 at 00:48):

julia> df = DataFrame(x = [1, 2, 3, 3, 1, 2, 3, 1, 2], y = [:foo, :foo, :foo, :foo, :bar, :bar, :bar, :baz, :baz], z = rand(9))
9×3 DataFrame
 Row │ x      y       z
     │ Int64  Symbol  Float64
─────┼──────────────────────────
   1 │     1  foo     0.288488
   2 │     2  foo     0.722006
   3 │     3  foo     0.654092
   4 │     3  foo     0.262445
   5 │     1  bar     0.932314
   6 │     2  bar     0.0627638
   7 │     3  bar     0.856708
   8 │     1  baz     0.45854
   9 │     2  baz     0.233986

julia> combine(groupby(df, :y), sdf -> sdf[argmax(sdf.x), :])
3×3 DataFrame
 Row │ y       x      z
     │ Symbol  Int64  Float64
─────┼─────────────────────────
   1 │ foo         3  0.654092
   2 │ bar         3  0.856708
   3 │ baz         2  0.233986

Eric Hanson (Jan 28 2022 at 00:56):

Nils (Jan 28 2022 at 12:34):

Ah yes of course, it would have been combine(groupby(df, :y), sdf -> sdf[sdf.x .== argmax(sdf.x), :]) to return multiple maxima. Migh actually be the better way to write this to guard against duplicates silently vanishing

Stream: helpdesk (published)

Topic: ✔ get rows of dataframes which are maximal in some value

Mason Protter (Jan 24 2022 at 18:20):

Mason Protter (Jan 24 2022 at 18:20):

Kwaku Oskin (Jan 24 2022 at 18:26):

Kwaku Oskin (Jan 24 2022 at 18:28):

Mason Protter (Jan 24 2022 at 18:34):

Kwaku Oskin (Jan 24 2022 at 18:39):

Mason Protter (Jan 24 2022 at 18:47):

Nils (Jan 25 2022 at 10:47):

Nils (Jan 25 2022 at 10:48):

Mason Protter (Jan 26 2022 at 00:12):

Notification Bot (Jan 26 2022 at 00:12):

Eric Hanson (Jan 28 2022 at 00:20):

Mason Protter (Jan 28 2022 at 00:46):

Mason Protter (Jan 28 2022 at 00:47):

Mason Protter (Jan 28 2022 at 00:48):

Eric Hanson (Jan 28 2022 at 00:56):

Nils (Jan 28 2022 at 12:34):