Stream: helpdesk (published)

Topic: Normalize DataFrames?


view this post on Zulip QuBit (Jul 12 2021 at 22:17):

Hello Everyone,

Might there be a more practical and workable
approach to normalizing all the elements of
a dataframe?

x = CookieMonster[1:15,:]
y = CookieMonster[16:16,:]

Currently, I have found two guidelines that both
result in an error.

1.

 X = (x .- mean(x, dims = 2)) ./ std(x, dims = 2)

Error as: 'no method matching mean'

2.

function normalize(input_df::DataFrame, cols::Array{Int64})
    norm_df = input_df
    for i in cols
        norm_df[i] = (input_df[i] - minimum(input_df[i])) /
            (maximum(input_df[i]) - minimum(input_df[i]))
    end
    norm_df
end

Error as: 'no method matching normalize'

The normalize() method was around since 2015,
but I have not seen any updates to it.

I am programming in Pluto.

Any tips?

view this post on Zulip Felix Kastner (Jul 13 2021 at 05:53):

Are you using LinearAlgebra which is needed for mean?

view this post on Zulip Daniel Karrasch (Jul 13 2021 at 06:06):

I think mean is from Statistics, normalize may be in LinearAlgebra.

view this post on Zulip Felix Kastner (Jul 13 2021 at 06:24):

Oh, you're right. Sry

view this post on Zulip QuBit (Jul 13 2021 at 14:32):

Daniel Karrasch

Thank you much -- when I attempt to qualify
with la.normalize(x), I am getting the following
error:

MethodError: no method matching normalize(::DataFrames.DataFrame)

view this post on Zulip Daniel Karrasch (Jul 13 2021 at 14:38):

Honestly, I'm not familiar with the specifics of the DataFrames package. I think you should consult their documentation to see what manipulation methods they have. normalize is defined in the stdlib LinearAlgebra for some generic types, to which, apparently, DataFrame doesn't subtype. So then it's no surprise that LinearAlgebra doesn't define a normalize method for DataFrames. This function overload would be the task of DataFrames.jl.

view this post on Zulip QuBit (Jul 13 2021 at 14:48):

Daniel Karrasch

 X = (x .- mean(x, dims = 2)) ./ std(x, dims = 2)

Is what I am attempting to achieve from the FluxML
example HERE

When I qualify the methods, I am returning the same
error. As you might suspect, normalizing a DF is not
really the be practice, however, perhaps what they
presented here was pseudo code, because I do not
think it can work, even when I attempted nesting the
eachrow() method.

view this post on Zulip QuBit (Jul 13 2021 at 14:54):

Daniel Karrasch

Okay -- I attempted something like this:

x2 = (x .- mean(Array(x), dims = 2)) ./ std(Array(x), dims = 2)

What do you think?

view this post on Zulip Daniel Karrasch (Jul 13 2021 at 14:57):

That might work, but I wonder if there are more efficient ways that don't allocate the extra array.

view this post on Zulip QuBit (Jul 13 2021 at 15:01):

Daniel Karrasch

So far I am seeing some methods in C++ and Java.

Am seeing tips on Stack OverFlow HERE

std::vector<int> vi;  // if the number of int-s are dynamic
std::array<int, 50> ai; // if the number of int-s are fixed

view this post on Zulip Daniel Karrasch (Jul 13 2021 at 15:04):

Is there a problem for a DataFrame x when you call mean(x, dims=2) or std(x, dims=2)? Do you need this Array(x)?

view this post on Zulip QuBit (Jul 13 2021 at 15:07):

Daniel Karrasch

Yes -- without the Array() method, I get the error that:

  no method matches mean(DataFrames.DataFrame)

view this post on Zulip Daniel Karrasch (Jul 13 2021 at 15:48):

Aha, that may be because it is common that DataFrames are non-numeric? What does DataFrames.jl recommend for operations like that? That should be a common problem.

view this post on Zulip QuBit (Jul 13 2021 at 15:53):

Daniel Karrasch

I am not seeing anything specific to this error message HERE

view this post on Zulip Andrey Oskin (Jul 13 2021 at 16:21):

Recommended way to work with such operations described here: https://bkamins.github.io/julialang/2021/07/09/multicol.html


Last updated: Oct 02 2023 at 04:34 UTC