Stream: helpdesk (published)

Topic: Doing statistics with missing values


view this post on Zulip Moorits Muru (Feb 26 2022 at 12:21):

I have been trying out statistical analysis with Julia and have had difficulties working with missing values. For example, I have a DataFrame with different columns, some have missing values, some don't. I don't want to remove all the lines with missing values, only omit them when actually analysing the column that has them.

What is the intended way to deal with the missing values? For example, if I am doing histograms with StatsPlots.jl then, as expected, I get an error when trying to plot a vector with missing values. If I use the skipmissing function on the data before plotting, I get the error type Base.SkipMissing{Vector{Union{Missing, Float64}}} is not supported. So my solution is to use collect(skipmissing(...)), which works but doesn't feel right. Is there anything better? I have the same problem with HypothesisTests.jl.

I've looked at the issues for StatsBase.jl and some others and see that there is a bigger issue with the skipmissing function. My question is, is there a better workaround than collect(skipmissing(...))?

view this post on Zulip jar (Feb 26 2022 at 18:35):

You want dropmissing rather than skipmissing

view this post on Zulip Moorits Muru (Feb 27 2022 at 12:35):

If I understand correctly, then this only works with a DataFrame, not with a vector, and the output is a new dataframe (or overwriting the original one). I have columns with different rows that have missing values and only wish to remove the missing values from the columns that I am currently working with. This means that I have to make new DataFrames for each different column that has missing values, so I wouldn't remove too much data. Maybe I am using DataFrames inefficiently, but the solution I was thinking of is something that I could apply to vectors as I extract the columns from the DataFrame.

view this post on Zulip jar (Feb 27 2022 at 18:35):

You can specify which columns using dropmissing(df, [:a, :b, :c])

view this post on Zulip Moorits Muru (Feb 27 2022 at 22:23):

I would prefer if I could apply the function to vectors, but I guess I'll make do with dropmissing and collect(skipmissing(...)). Thanks for your help!

view this post on Zulip jar (Feb 27 2022 at 22:24):

There's also filter(!ismissing, x)

view this post on Zulip Felix Cremer (Feb 28 2022 at 08:05):

You can also have a look at the Missings.jl package.

view this post on Zulip Nils (Feb 28 2022 at 10:13):

collect(skipmissing(x)) is the right way of doing it in StatsPlots. skipmissing returns an iterator for efficiency, so you don't have to allocate in reduction operations like sum(skipmissing(x)) but that won't work for plotting. Another thing you might consider is replacing missings with a sort of "sentinel value" that shows up as a separate category in the histogram by doing coalesce.(x, val).


Last updated: Oct 02 2023 at 04:34 UTC