✔ Doing statistics with missing values · helpdesk (published)

Stream: helpdesk (published)

Topic: ✔ Doing statistics with missing values

Moorits Muru (Feb 26 2022 at 12:21):

I have been trying out statistical analysis with Julia and have had difficulties working with missing values. For example, I have a DataFrame with different columns, some have missing values, some don't. I don't want to remove all the lines with missing values, only omit them when actually analysing the column that has them.

What is the intended way to deal with the missing values? For example, if I am doing histograms with StatsPlots.jl then, as expected, I get an error when trying to plot a vector with missing values. If I use the skipmissing function on the data before plotting, I get the error type Base.SkipMissing{Vector{Union{Missing, Float64}}} is not supported. So my solution is to use collect(skipmissing(...)), which works but doesn't feel right. Is there anything better? I have the same problem with HypothesisTests.jl.

I've looked at the issues for StatsBase.jl and some others and see that there is a bigger issue with the skipmissing function. My question is, is there a better workaround than collect(skipmissing(...))?

jar (Feb 26 2022 at 18:35):

You want dropmissing rather than skipmissing

Moorits Muru (Feb 27 2022 at 12:35):

If I understand correctly, then this only works with a DataFrame, not with a vector, and the output is a new dataframe (or overwriting the original one). I have columns with different rows that have missing values and only wish to remove the missing values from the columns that I am currently working with. This means that I have to make new DataFrames for each different column that has missing values, so I wouldn't remove too much data. Maybe I am using DataFrames inefficiently, but the solution I was thinking of is something that I could apply to vectors as I extract the columns from the DataFrame.

jar (Feb 27 2022 at 18:35):

You can specify which columns using dropmissing(df, [:a, :b, :c])

Moorits Muru (Feb 27 2022 at 22:23):

I would prefer if I could apply the function to vectors, but I guess I'll make do with dropmissing and collect(skipmissing(...)). Thanks for your help!

jar (Feb 27 2022 at 22:24):

There's also filter(!ismissing, x)

Felix Cremer (Feb 28 2022 at 08:05):

You can also have a look at the Missings.jl package.

Nils (Feb 28 2022 at 10:13):

collect(skipmissing(x)) is the right way of doing it in StatsPlots. skipmissing returns an iterator for efficiency, so you don't have to allocate in reduction operations like sum(skipmissing(x)) but that won't work for plotting. Another thing you might consider is replacing missings with a sort of "sentinel value" that shows up as a separate category in the histogram by doing coalesce.(x, val).

Moorits Muru (Feb 28 2022 at 20:41):

Thanks for the tips everybody!

Notification Bot (Feb 28 2022 at 20:41):

Moorits Muru has marked this topic as resolved.

Last updated: Aug 14 2025 at 04:51 UTC