So, I'm facing a very strange bug with DataFrames.jl . I don't know what might be causing it. I have a dataframe with 778 columns and 31 k rows. Once I use unique(df)
everything freezes, and the CPU usage goes to the roof. I'm using Julia 1.7.2 and DataFrames v1.3.2.
Not sure that's a bug - just a lot of work for unique
to perform? What happens if you do combine(groupby(df, names(df)), nrow)
?
unique
is not a cheap operation. Not also you might be hitting a hashing catastrophe, I've been burned by that a couple of times. What I mean by that is that the default hashing algorithm gets expensive for large or deeply-nested objects. Do you have any objects in your dataframe that might be very expensive to hash? (sometimes I've run into that problem with DataFrames themselves, which tend to be quite expensive to hash)
I'd imagine this could easily go wrong with plenty of String
in the data and the ensuing GC pressure, so maybe InlineString
s can help.
Last updated: Nov 06 2024 at 04:40 UTC