Stream: helpdesk (published)

Topic: Replacing 'nothing' with NaN


view this post on Zulip QuBit (Mar 01 2021 at 11:32):

Hello:

I implemented tryparse broadcast and coalesce methods
on my dataframe. Now, the missing cells read 'nothing' so
that the column no longer has a datatype.

I would like to convert the cells that read 'nothing' to 'NaN'

How might I achieve this?

view this post on Zulip Jeffrey Sarnoff (Mar 01 2021 at 13:04):

A very similar question came up on Slack last night.

using DataFrames
a = [1.0, nothing, 3.0]
b = [nothing, 20.0, 30.0]
df = DataFrame(a = a, b = b)
# ok, now df is your data frame with `nothings`
# and you want them to be NaNs

function nan_all_nothings(x)
  x[isnothing.(x)] .= NaN
  return
end

nan_all_nothings(df.a)
nan_all_nothings(df.b)

julia> df

julia> df
3×2 DataFrame
 Row  a       b
      Union  Union
─────┼────────────────
   1  1.0       NaN
   2  NaN     20.0
   3  3.0       30.0

view this post on Zulip QuBit (Mar 01 2021 at 16:19):

Jeffrey Sarnoff said:

A very similar question came up on Slack last night.

using DataFrames
a = [1.0, nothing, 3.0]
b = [nothing, 20.0, 30.0]
df = DataFrame(a = a, b = b)
# ok, now df is your data frame with `nothings`
# and you want them to be NaNs

function nan_all_nothings(x)
  x[isnothing.(x)] .= NaN
  return
end

nan_all_nothings(df.a)
nan_all_nothings(df.b)

julia> df

julia> df
3×2 DataFrame
 Row  a       b
      Union  Union
─────┼────────────────
   1  1.0       NaN
   2  NaN     20.0
   3  3.0       30.0

Good Day Jefrfrey,

This worked! I would like to stream-line the
code for this by applying the function you
created for eachcol(df) I attempted:

nan_for_nothings(df[!,7:11])

But am getting an index error. Any additional
tips?

Thanks again,

view this post on Zulip Nils (Mar 01 2021 at 17:14):

I guess this is related to your other question around NaNs as well - when working with DataFrames that have missing observations you should really be using missing instead of nothing or NaN. That's what missing is for, and why passmissing and skipmissing exist

view this post on Zulip QuBit (Mar 01 2021 at 18:05):

Nils said:

I guess this is related to your other question around NaNs as well - when working with DataFrames that have missing observations you should really be using missing instead of nothing or NaN. That's what missing is for, and why passmissing and skipmissing exist

Hi Nils,

I was able to implement:

for col in eachcol(ED4)
replace!(col,NaN => 0)
end

This approach helped to address the NaN
fill issue I was having when I applied the
describe method to the dataframe.

view this post on Zulip Nils (Mar 01 2021 at 20:07):

Yes that works, but again missings would be more natural - they can be used with coalesce, which is exactly built for this use case.

view this post on Zulip Jeffrey Sarnoff (Mar 01 2021 at 20:28):

@qu bit I agree with @Nils , use missing

nothing_is_missing(x) = x[isnothing.(x)] .= missing

The dataframe's columns of interest need to allow values of type Missing.
Ask someone who knows .. manipulating DataFrames .

map(nothing_is_missing, eachcol(df[!, 7:11]));

view this post on Zulip QuBit (Mar 01 2021 at 21:20):

Jeffrey Sarnoff said:

qu bit I agree with Nils , use missing

nothing_is_missing(x) = x[isnothing.(x)] .= missing

The dataframe's columns of interest need to allow values of type Missing.
Ask someone who knows .. manipulating DataFrames .

map(nothing_is_missing, eachcol(df[!, 7:11]));

Thank you Jeffrey. I attempted to apply your principle to converting
the same column set to Float64 using:

map(convert(DataFrame{Float64,1}, eachcol(ED4[!, 7:11])))

I am returning the following error as:
TypeError: in Type{...} expression, expected UnionAll, got Type{DataFrames.DataFrame}

If I use a Array{Float64,1} ... the error message returns:
MethodError: Cannot convert an object of type DataFrames.DataFrameColumns{DataFrames.DataFrame} to an object of type Array{Float64,1

Any suggestions about it?

view this post on Zulip Jeffrey Sarnoff (Mar 01 2021 at 23:52):

The dataframe's columns of interest need to allow values of type Missing.
Ask someone who knows manipulating DataFrames .

view this post on Zulip QuBit (Mar 01 2021 at 23:55):

Jeffrey Sarnoff said:

The dataframe's columns of interest need to allow values of type Missing.
Ask someone who knows manipulating DataFrames .

Hi Jeffrey,

Was able to solve it. There is a parameter in the CSV module and 'read'
method called missingstrings which I set to '---'. Problem and coding
averted!

view this post on Zulip Jeffrey Sarnoff (Mar 02 2021 at 00:22):

great -- meanwhile you can improve this (if it help)

julia> using DataFrames

julia> a = [1.0, nothing, 3.0, 4.0];

julia> b= [1.0, nothing, 3.0, nothing];

julia> c= [1.0, 2.0, nothing, 4.0];

julia> df = DataFrame(a=a, b=b, c=c);

julia> nothing_is_missing(x) = map(y->(isnothing(y) ? missing : y), x);

julia> df2 = similar(df);

julia> for colidx in 1:size(df)[2]
          df2[!, colidx] = nothing_is_missing(df[!, colidx])
       end;

julia> df
4×3 DataFrame
 Row  a       b       c
      Union  Union  Union
─────┼────────────────────────
   1  1.0     1.0     1.0
   2                  2.0
   3  3.0     3.0
   4  4.0             4.0

julia> df2
4×3 DataFrame
 Row  a          b          c
      Float64?   Float64?   Float64?
─────┼─────────────────────────────────
   1        1.0        1.0        1.0
   2  missing    missing          2.0
   3        3.0        3.0  missing
   4        4.0  missing          4.0

Last updated: Nov 22 2024 at 04:41 UTC