I have been trying out statistical analysis with Julia and have had difficulties working with missing values. For example, I have a DataFrame with different columns, some have missing values, some don't. I don't want to remove all the lines with missing values, only omit them when actually analysing the column that has them.
What is the intended way to deal with the missing values? For example, if I am doing histograms with StatsPlots.jl then, as expected, I get an error when trying to plot a vector with missing values. If I use the skipmissing
function on the data before plotting, I get the error type Base.SkipMissing{Vector{Union{Missing, Float64}}}
is not supported. So my solution is to use collect(skipmissing(...))
, which works but doesn't feel right. Is there anything better? I have the same problem with HypothesisTests.jl.
I've looked at the issues for StatsBase.jl and some others and see that there is a bigger issue with the skipmissing
function. My question is, is there a better workaround than collect(skipmissing(...))
?
You want dropmissing
rather than skipmissing
If I understand correctly, then this only works with a DataFrame, not with a vector, and the output is a new dataframe (or overwriting the original one). I have columns with different rows that have missing values and only wish to remove the missing values from the columns that I am currently working with. This means that I have to make new DataFrames for each different column that has missing values, so I wouldn't remove too much data. Maybe I am using DataFrames inefficiently, but the solution I was thinking of is something that I could apply to vectors as I extract the columns from the DataFrame.
You can specify which columns using dropmissing(df, [:a, :b, :c])
I would prefer if I could apply the function to vectors, but I guess I'll make do with dropmissing
and collect(skipmissing(...))
. Thanks for your help!
There's also filter(!ismissing, x)
You can also have a look at the Missings.jl package.
collect(skipmissing(x))
is the right way of doing it in StatsPlots. skipmissing
returns an iterator for efficiency, so you don't have to allocate in reduction operations like sum(skipmissing(x))
but that won't work for plotting. Another thing you might consider is replacing missings with a sort of "sentinel value" that shows up as a separate category in the histogram by doing coalesce.(x, val)
.
Thanks for the tips everybody!
Moorits Muru has marked this topic as resolved.
Last updated: Nov 06 2024 at 04:40 UTC