Stream: helpdesk (published)

Topic: DataFrames Transformation


view this post on Zulip Kwaku Oskin (Feb 16 2021 at 17:05):

:confused: but there is no iloc or iat function in DataFrames...

view this post on Zulip Kwaku Oskin (Feb 16 2021 at 17:09):

Hmmm.,.

Can you start from the beginning and explain what you have and what you are trying to achieve?

May be small example of the csv file that you are trying to parse? How should result look like after all transformations? Is usage of Pandas is mandatory or it is ok to use only vanilla DataFrames?

view this post on Zulip Expanding Man (Feb 16 2021 at 17:19):

I'm not at all sure what you are trying to do here, but it sounds to me a bit like you are looking for IndexedTables.jl. DataFrames are essentially a named tuples of like sized vectors, column values and indices are deliberately kept separate, although the library does have stack and unstack functions

view this post on Zulip Expanding Man (Feb 16 2021 at 17:20):

You may also want to read here

view this post on Zulip Expanding Man (Feb 16 2021 at 17:21):

well, so does dataframes, so I think the first thing you should do is read the above documentation section. However, if using column values as indices is something you want to do very often, then yes, IndexedTables.jl may be a better choice.

view this post on Zulip Expanding Man (Feb 16 2021 at 17:38):

only if the thing you want to stack or unstack is an ID column

view this post on Zulip QuBit (Feb 16 2021 at 17:45):

(deleted)

view this post on Zulip Expanding Man (Feb 16 2021 at 17:51):

Is this what you're trying to do?

df = DataFrame(Time1=[now() + Second(i) for i  1:5],
               Time2=[now() + Hour(i) for i  1:5],
               Time3=[now() + Day(i) for i  1:5]
              )

stack(df, [:Time1, :Time2, :Time3])

which gives

15×2 DataFrame
 Row  variable  value
      String    DateTime
─────┼───────────────────────────────────
   1  Time1     2021-02-16T12:50:57.435
   2  Time1     2021-02-16T12:50:58.435
   3  Time1     2021-02-16T12:50:59.435
   4  Time1     2021-02-16T12:51:00.435
   5  Time1     2021-02-16T12:51:01.435
   6  Time2     2021-02-16T13:50:56.458
   7  Time2     2021-02-16T14:50:56.458
   8  Time2     2021-02-16T15:50:56.458
   9  Time2     2021-02-16T16:50:56.458
  10  Time2     2021-02-16T17:50:56.458
  11  Time3     2021-02-17T12:50:56.483
  12  Time3     2021-02-18T12:50:56.483
  13  Time3     2021-02-19T12:50:56.483
  14  Time3     2021-02-20T12:50:56.483
  15  Time3     2021-02-21T12:50:56.483

view this post on Zulip QuBit (Feb 16 2021 at 18:05):

Expanding Man
Sorry for the confusion. I was listing Time1 and Time2 as labels. I was not
attempting to have them represent timestamps, but this is useful in other
cases for sure.

view this post on Zulip Expanding Man (Feb 16 2021 at 18:11):

Regardless, was a complete stacking aggregation what you were trying to do? Otherwise you can leave some columns unstacked as an extra column, e.g.

stack(df, [:Time1, :Time2], [:Time3])

which gives

10×3 DataFrame
 Row  Time3                    variable  value
      DateTime                 String    DateTime
─────┼────────────────────────────────────────────────────────────
   1  2021-02-17T13:07:03.609  Time1     2021-02-16T13:07:04.561
   2  2021-02-18T13:07:03.609  Time1     2021-02-16T13:07:05.561
   3  2021-02-19T13:07:03.609  Time1     2021-02-16T13:07:06.561
   4  2021-02-20T13:07:03.609  Time1     2021-02-16T13:07:07.561
   5  2021-02-21T13:07:03.609  Time1     2021-02-16T13:07:08.561
   6  2021-02-17T13:07:03.609  Time2     2021-02-16T14:07:03.585
   7  2021-02-18T13:07:03.609  Time2     2021-02-16T15:07:03.585
   8  2021-02-19T13:07:03.609  Time2     2021-02-16T16:07:03.585
   9  2021-02-20T13:07:03.609  Time2     2021-02-16T17:07:03.585
  10  2021-02-21T13:07:03.609  Time2     2021-02-16T18:07:03.585

That just about covers the DataFrames.jl stacking functionality. If that doesn't cover your use case, than I'm afraid I just don't know what you mean by stacking.

As for IndexedTables.jl, this doesn't offer any benefit in terms of stacking, but on the other hand, DataFrames.jl is really not designed to be indexed by value, so, if for example you wanted to index the above by :Time3 value, DataFrames.jl doesn't really give you any nice ways of doing that, though you can of course just do Dict(df.Time3 .=> 1:size(df,1)) to create a lookup table (at least in teh case of unique keys). IndexedTables.jl on the other hand is designed with this type of indexing in mind. IndexedTables.jl doesn't see nearly as much development or use as DataFrames.jl, so you probably want to stick with DataFrames, but again, if what you're really interested in is indexing it may not be the best choice.

view this post on Zulip QuBit (Feb 16 2021 at 18:14):

Expanding Man
As for IndexedTables.jl, this doesn't offer any benefit in terms of stacking, but on the other hand, DataFrames.jl is really not designed to be indexed by value, so, if for example you wanted to index the above by :Time3 value, DataFrames.jl doesn't really give you any nice ways of doing that, though you can of course just do Dict(df.Time3 .=> 1:size(df,1)) to create a lookup table (at least in the case of unique keys). IndexedTables.jl on the other hand is designed with this type of indexing in mind. IndexedTables.jl doesn't see nearly as much development or use as DataFrames.jl, so you probably want to stick with DataFrames, but again, if what you're really interested in is indexing it may not be the best choice.

I appreciate this. I will experiment a little more with the stack method. I also know of a LAJuliaUtils unregistered library that has an explicit 'pivot' function that could perhaps help me simply flip columns for row values. I had trouble implementing it before however.

view this post on Zulip Expanding Man (Feb 16 2021 at 18:19):

One way or another, DataFrames.jl should cover all of the standard relational database operations such as stacking and pivots. As a rule of thumb, you probably want to use DataFrames.jl for all tables except

view this post on Zulip QuBit (Feb 16 2021 at 18:35):

Thank you for providing this explanation.

The original transformation method I provided was
able to switch the header and row 1 positions. However,
the issue is that the entire dataframe is shifted down one
unit. If there was a way to configure the code so that the
transformed R1 values were the headers and the rest of
the dataframe was shifted up one unit it would be the pivot
I was attempting to create.

view this post on Zulip Kwaku Oskin (Feb 16 2021 at 19:00):

It would be slightly easier if you can give example of the dataframe or at least part of csv file.

So, I am unsure, whether it is the structure that you have, but maybe you want something like this?

julia> df = DataFrame(:col => ["S1", "S2", "S3"], :T1 => [1, 2, 3], :T2 => [4, 5, 6], :T3 => [7,
 8, 9])
3×4 DataFrame
 Row  col     T1     T2     T3
      String  Int64  Int64  Int64
─────┼─────────────────────────────
   1  S1          1      4      7
   2  S2          2      5      8
   3  S3          3      6      9

julia> df1 = stack(df, 2:4);
julia> df2 = unstack(df1, :variable, :col, :value)
3×4 DataFrame
 Row  variable  S1      S2      S3
      String    Int64?  Int64?  Int64?
─────┼──────────────────────────────────
   1  T1             1       2       3
   2  T2             4       5       6
   3  T3             7       8       9

view this post on Zulip QuBit (Feb 16 2021 at 21:44):

Expanding Man said:

only if the thing you want to stack or unstack is an ID column

A fellow colleague suggested, I use CSV.read(.... transpose=true); this solved the problem.

view this post on Zulip QuBit (Feb 16 2021 at 21:45):

Andrey Oskin said:

It would be slightly easier if you can give example of the dataframe or at least part of csv file.

So, I am unsure, whether it is the structure that you have, but maybe you want something like this?

julia> df = DataFrame(:col => ["S1", "S2", "S3"], :T1 => [1, 2, 3], :T2 => [4, 5, 6], :T3 => [7,
 8, 9])
3×4 DataFrame
 Row  col     T1     T2     T3
      String  Int64  Int64  Int64
─────┼─────────────────────────────
   1  S1          1      4      7
   2  S2          2      5      8
   3  S3          3      6      9

julia> df1 = stack(df, 2:4);
julia> df2 = unstack(df1, :variable, :col, :value)
3×4 DataFrame
 Row  variable  S1      S2      S3
      String    Int64?  Int64?  Int64?
─────┼──────────────────────────────────
   1  T1             1       2       3
   2  T2             4       5       6
   3  T3             7       8       9

Hi Andrey, Thank you for your explanation here. A colleague mentioned that I should use CSV.read() method setting the '
transpose' parameter to 'true'

view this post on Zulip Florian Große (Feb 18 2021 at 08:32):

So that's what you want, a transpose :D

view this post on Zulip Florian Große (Feb 18 2021 at 08:33):

Didn't came through clearly I suppose


Last updated: Nov 06 2024 at 04:40 UTC