Replacing 'nothing' with NaN · helpdesk (published)

Stream: helpdesk (published)

Topic: Replacing 'nothing' with NaN

qu bit (Mar 01 2021 at 11:32):

Hello:

I implemented tryparse broadcast and coalesce methods
on my dataframe. Now, the missing cells read 'nothing' so
that the column no longer has a datatype.

I would like to convert the cells that read 'nothing' to 'NaN'

How might I achieve this?

Jeffrey Sarnoff (Mar 01 2021 at 13:04):

A very similar question came up on Slack last night.

using DataFrames
a = [1.0, nothing, 3.0]
b = [nothing, 20.0, 30.0]
df = DataFrame(a = a, b = b)
# ok, now df is your data frame with `nothings`
# and you want them to be NaNs

function nan_all_nothings(x)
  x[isnothing.(x)] .= NaN
  return
end

nan_all_nothings(df.a)
nan_all_nothings(df.b)

julia> df

julia> df
3×2 DataFrame
 Row │ a       b
     │ Union…  Union…
─────┼────────────────
   1 │ 1.0       NaN
   2 │ NaN     20.0
   3 │ 3.0       30.0

qu bit (Mar 01 2021 at 16:19):

Jeffrey Sarnoff said:

A very similar question came up on Slack last night.

using DataFrames
a = [1.0, nothing, 3.0]
b = [nothing, 20.0, 30.0]
df = DataFrame(a = a, b = b)
# ok, now df is your data frame with `nothings`
# and you want them to be NaNs

function nan_all_nothings(x)
  x[isnothing.(x)] .= NaN
  return
end

nan_all_nothings(df.a)
nan_all_nothings(df.b)

julia> df

julia> df
3×2 DataFrame
 Row │ a       b
     │ Union…  Union…
─────┼────────────────
   1 │ 1.0       NaN
   2 │ NaN     20.0
   3 │ 3.0       30.0

Good Day Jefrfrey,

This worked! I would like to stream-line the
code for this by applying the function you
created for eachcol(df) I attempted:

nan_for_nothings(df[!,7:11])

But am getting an index error. Any additional
tips?

Thanks again,

I guess this is related to your other question around NaNs as well - when working with DataFrames that have missing observations you should really be using missing instead of nothing or NaN. That's what missing is for, and why passmissing and skipmissing exist

qu bit (Mar 01 2021 at 18:05):

Nils said:

I guess this is related to your other question around NaNs as well - when working with DataFrames that have missing observations you should really be using missing instead of nothing or NaN. That's what missing is for, and why passmissing and skipmissing exist

Hi Nils,

I was able to implement:

for col in eachcol(ED4)
replace!(col,NaN => 0)
end

This approach helped to address the NaN
fill issue I was having when I applied the
describe method to the dataframe.

Nils (Mar 01 2021 at 20:07):

Yes that works, but again missings would be more natural - they can be used with coalesce, which is exactly built for this use case.

Jeffrey Sarnoff (Mar 01 2021 at 20:28):

@qu bit I agree with @Nils , use missing

nothing_is_missing(x) = x[isnothing.(x)] .= missing

The dataframe's columns of interest need to allow values of type Missing.
Ask someone who knows .. manipulating DataFrames .

map(nothing_is_missing, eachcol(df[!, 7:11]));

qu bit (Mar 01 2021 at 21:20):

Jeffrey Sarnoff said:

qu bit I agree with Nils , use missing

nothing_is_missing(x) = x[isnothing.(x)] .= missing

The dataframe's columns of interest need to allow values of type Missing.
Ask someone who knows .. manipulating DataFrames .
map(nothing_is_missing, eachcol(df[!, 7:11]));

Thank you Jeffrey. I attempted to apply your principle to converting
the same column set to Float64 using:

map(convert(DataFrame{Float64,1}, eachcol(ED4[!, 7:11])))

I am returning the following error as:
TypeError: in Type{...} expression, expected UnionAll, got Type{DataFrames.DataFrame}

If I use a Array{Float64,1} ... the error message returns:
MethodError: Cannot convert an object of type DataFrames.DataFrameColumns{DataFrames.DataFrame} to an object of type Array{Float64,1

Any suggestions about it?

Jeffrey Sarnoff (Mar 01 2021 at 23:52):

The dataframe's columns of interest need to allow values of type Missing.
Ask someone who knows manipulating DataFrames .

qu bit (Mar 01 2021 at 23:55):

Jeffrey Sarnoff said:

The dataframe's columns of interest need to allow values of type Missing.
Ask someone who knows manipulating DataFrames .

Hi Jeffrey,

Was able to solve it. There is a parameter in the CSV module and 'read'
method called missingstrings which I set to '---'. Problem and coding
averted!

Jeffrey Sarnoff (Mar 02 2021 at 00:22):

great -- meanwhile you can improve this (if it help)

julia> using DataFrames

julia> a = [1.0, nothing, 3.0, 4.0];

julia> b= [1.0, nothing, 3.0, nothing];

julia> c= [1.0, 2.0, nothing, 4.0];

julia> df = DataFrame(a=a, b=b, c=c);

julia> nothing_is_missing(x) = map(y->(isnothing(y) ? missing : y), x);

julia> df2 = similar(df);

julia> for colidx in 1:size(df)[2]
          df2[!, colidx] = nothing_is_missing(df[!, colidx])
       end;

julia> df
4×3 DataFrame
 Row │ a       b       c
     │ Union…  Union…  Union…
─────┼────────────────────────
   1 │ 1.0     1.0     1.0
   2 │                 2.0
   3 │ 3.0     3.0
   4 │ 4.0             4.0

julia> df2
4×3 DataFrame
 Row │ a          b          c
     │ Float64?   Float64?   Float64?
─────┼─────────────────────────────────
   1 │       1.0        1.0        1.0
   2 │ missing    missing          2.0
   3 │       3.0        3.0  missing
   4 │       4.0  missing          4.0

Last updated: Oct 02 2023 at 04:34 UTC

Stream: helpdesk (published)

Topic: Replacing 'nothing' with NaN

qu bit (Mar 01 2021 at 11:32):

Jeffrey Sarnoff (Mar 01 2021 at 13:04):

qu bit (Mar 01 2021 at 16:19):

Nils (Mar 01 2021 at 17:14):

qu bit (Mar 01 2021 at 18:05):

Nils (Mar 01 2021 at 20:07):

Jeffrey Sarnoff (Mar 01 2021 at 20:28):

qu bit (Mar 01 2021 at 21:20):

Jeffrey Sarnoff (Mar 01 2021 at 23:52):

qu bit (Mar 01 2021 at 23:55):

Jeffrey Sarnoff (Mar 02 2021 at 00:22):