Stream: helpdesk (published)

Topic: Strange bug with DataFrames.jl `unique` function


view this post on Zulip Davi Sales Barreira (Jul 26 2022 at 11:05):

So, I'm facing a very strange bug with DataFrames.jl . I don't know what might be causing it. I have a dataframe with 778 columns and 31 k rows. Once I use unique(df) everything freezes, and the CPU usage goes to the roof. I'm using Julia 1.7.2 and DataFrames v1.3.2.

view this post on Zulip Nils (Jul 27 2022 at 12:23):

Not sure that's a bug - just a lot of work for unique to perform? What happens if you do combine(groupby(df, names(df)), nrow)?

view this post on Zulip Expanding Man (Jul 27 2022 at 14:35):

unique is not a cheap operation. Not also you might be hitting a hashing catastrophe, I've been burned by that a couple of times. What I mean by that is that the default hashing algorithm gets expensive for large or deeply-nested objects. Do you have any objects in your dataframe that might be very expensive to hash? (sometimes I've run into that problem with DataFrames themselves, which tend to be quite expensive to hash)

view this post on Zulip Nils (Jul 29 2022 at 13:27):

I'd imagine this could easily go wrong with plenty of String in the data and the ensuing GC pressure, so maybe InlineStrings can help.


Last updated: Nov 22 2024 at 04:41 UTC