DataFrames Transformation · helpdesk (published)

Stream: helpdesk (published)

Topic: DataFrames Transformation

Andrey Oskin (Feb 16 2021 at 17:05):

:confused: but there is no iloc or iat function in DataFrames...

Andrey Oskin (Feb 16 2021 at 17:09):

Hmmm.,.

Can you start from the beginning and explain what you have and what you are trying to achieve?

May be small example of the csv file that you are trying to parse? How should result look like after all transformations? Is usage of Pandas is mandatory or it is ok to use only vanilla DataFrames?

Expanding Man (Feb 16 2021 at 17:19):

I'm not at all sure what you are trying to do here, but it sounds to me a bit like you are looking for IndexedTables.jl. DataFrames are essentially a named tuples of like sized vectors, column values and indices are deliberately kept separate, although the library does have stack and unstack functions

Expanding Man (Feb 16 2021 at 17:20):

You may also want to read here

Expanding Man (Feb 16 2021 at 17:21):

well, so does dataframes, so I think the first thing you should do is read the above documentation section. However, if using column values as indices is something you want to do very often, then yes, IndexedTables.jl may be a better choice.

Expanding Man (Feb 16 2021 at 17:38):

only if the thing you want to stack or unstack is an ID column

qu bit (Feb 16 2021 at 17:45):

(deleted)

Expanding Man (Feb 16 2021 at 17:51):

Is this what you're trying to do?

df = DataFrame(Time1=[now() + Second(i) for i ∈ 1:5],
               Time2=[now() + Hour(i) for i ∈ 1:5],
               Time3=[now() + Day(i) for i ∈ 1:5]
              )

stack(df, [:Time1, :Time2, :Time3])

which gives

15×2 DataFrame
 Row │ variable  value
     │ String    DateTime
─────┼───────────────────────────────────
   1 │ Time1     2021-02-16T12:50:57.435
   2 │ Time1     2021-02-16T12:50:58.435
   3 │ Time1     2021-02-16T12:50:59.435
   4 │ Time1     2021-02-16T12:51:00.435
   5 │ Time1     2021-02-16T12:51:01.435
   6 │ Time2     2021-02-16T13:50:56.458
   7 │ Time2     2021-02-16T14:50:56.458
   8 │ Time2     2021-02-16T15:50:56.458
   9 │ Time2     2021-02-16T16:50:56.458
  10 │ Time2     2021-02-16T17:50:56.458
  11 │ Time3     2021-02-17T12:50:56.483
  12 │ Time3     2021-02-18T12:50:56.483
  13 │ Time3     2021-02-19T12:50:56.483
  14 │ Time3     2021-02-20T12:50:56.483
  15 │ Time3     2021-02-21T12:50:56.483

qu bit (Feb 16 2021 at 18:05):

Expanding Man
Sorry for the confusion. I was listing Time1 and Time2 as labels. I was not
attempting to have them represent timestamps, but this is useful in other
cases for sure.

Expanding Man (Feb 16 2021 at 18:11):

Regardless, was a complete stacking aggregation what you were trying to do? Otherwise you can leave some columns unstacked as an extra column, e.g.

stack(df, [:Time1, :Time2], [:Time3])

which gives

10×3 DataFrame
 Row │ Time3                    variable  value
     │ DateTime                 String    DateTime
─────┼────────────────────────────────────────────────────────────
   1 │ 2021-02-17T13:07:03.609  Time1     2021-02-16T13:07:04.561
   2 │ 2021-02-18T13:07:03.609  Time1     2021-02-16T13:07:05.561
   3 │ 2021-02-19T13:07:03.609  Time1     2021-02-16T13:07:06.561
   4 │ 2021-02-20T13:07:03.609  Time1     2021-02-16T13:07:07.561
   5 │ 2021-02-21T13:07:03.609  Time1     2021-02-16T13:07:08.561
   6 │ 2021-02-17T13:07:03.609  Time2     2021-02-16T14:07:03.585
   7 │ 2021-02-18T13:07:03.609  Time2     2021-02-16T15:07:03.585
   8 │ 2021-02-19T13:07:03.609  Time2     2021-02-16T16:07:03.585
   9 │ 2021-02-20T13:07:03.609  Time2     2021-02-16T17:07:03.585
  10 │ 2021-02-21T13:07:03.609  Time2     2021-02-16T18:07:03.585

That just about covers the DataFrames.jl stacking functionality. If that doesn't cover your use case, than I'm afraid I just don't know what you mean by stacking.

As for IndexedTables.jl, this doesn't offer any benefit in terms of stacking, but on the other hand, DataFrames.jl is really not designed to be indexed by value, so, if for example you wanted to index the above by :Time3 value, DataFrames.jl doesn't really give you any nice ways of doing that, though you can of course just do Dict(df.Time3 .=> 1:size(df,1)) to create a lookup table (at least in teh case of unique keys). IndexedTables.jl on the other hand is designed with this type of indexing in mind. IndexedTables.jl doesn't see nearly as much development or use as DataFrames.jl, so you probably want to stick with DataFrames, but again, if what you're really interested in is indexing it may not be the best choice.

qu bit (Feb 16 2021 at 18:14):

Expanding Man
As for IndexedTables.jl, this doesn't offer any benefit in terms of stacking, but on the other hand, DataFrames.jl is really not designed to be indexed by value, so, if for example you wanted to index the above by :Time3 value, DataFrames.jl doesn't really give you any nice ways of doing that, though you can of course just do Dict(df.Time3 .=> 1:size(df,1)) to create a lookup table (at least in the case of unique keys). IndexedTables.jl on the other hand is designed with this type of indexing in mind. IndexedTables.jl doesn't see nearly as much development or use as DataFrames.jl, so you probably want to stick with DataFrames, but again, if what you're really interested in is indexing it may not be the best choice.

I appreciate this. I will experiment a little more with the stack method. I also know of a LAJuliaUtils unregistered library that has an explicit 'pivot' function that could perhaps help me simply flip columns for row values. I had trouble implementing it before however.

Expanding Man (Feb 16 2021 at 18:19):

One way or another, DataFrames.jl should cover all of the standard relational database operations such as stacking and pivots. As a rule of thumb, you probably want to use DataFrames.jl for all tables except

If you are mostly interested in looking up rows by value (e.g. the use cases of IndexedTables.jl)
If good row-wise performance is crucial. That is, DataFrames.jl is very inefficient at iterating over a DataFrame by row, because it does not hold any type information. Usually it is quite easy to replace row-wise operations with column-wise operations, so the situations in which this is prohibitive are likely few, but it can happen.
You need huge distributed tables. Unfortunately, I don't think there's any mature, well-maintained package in the ecosystem for this right now.

qu bit (Feb 16 2021 at 18:35):

Thank you for providing this explanation.

The original transformation method I provided was
able to switch the header and row 1 positions. However,
the issue is that the entire dataframe is shifted down one
unit. If there was a way to configure the code so that the
transformed R1 values were the headers and the rest of
the dataframe was shifted up one unit it would be the pivot
I was attempting to create.

Andrey Oskin (Feb 16 2021 at 19:00):

It would be slightly easier if you can give example of the dataframe or at least part of csv file.

So, I am unsure, whether it is the structure that you have, but maybe you want something like this?

julia> df = DataFrame(:col => ["S1", "S2", "S3"], :T1 => [1, 2, 3], :T2 => [4, 5, 6], :T3 => [7,
 8, 9])
3×4 DataFrame
 Row │ col     T1     T2     T3
     │ String  Int64  Int64  Int64
─────┼─────────────────────────────
   1 │ S1          1      4      7
   2 │ S2          2      5      8
   3 │ S3          3      6      9

julia> df1 = stack(df, 2:4);
julia> df2 = unstack(df1, :variable, :col, :value)
3×4 DataFrame
 Row │ variable  S1      S2      S3
     │ String    Int64?  Int64?  Int64?
─────┼──────────────────────────────────
   1 │ T1             1       2       3
   2 │ T2             4       5       6
   3 │ T3             7       8       9

qu bit (Feb 16 2021 at 21:44):

Expanding Man said:

only if the thing you want to stack or unstack is an ID column

A fellow colleague suggested, I use CSV.read(.... transpose=true); this solved the problem.

qu bit (Feb 16 2021 at 21:45):

Andrey Oskin said:

It would be slightly easier if you can give example of the dataframe or at least part of csv file.

So, I am unsure, whether it is the structure that you have, but maybe you want something like this?

julia> df = DataFrame(:col => ["S1", "S2", "S3"], :T1 => [1, 2, 3], :T2 => [4, 5, 6], :T3 => [7,
 8, 9])
3×4 DataFrame
 Row │ col     T1     T2     T3
     │ String  Int64  Int64  Int64
─────┼─────────────────────────────
   1 │ S1          1      4      7
   2 │ S2          2      5      8
   3 │ S3          3      6      9

julia> df1 = stack(df, 2:4);
julia> df2 = unstack(df1, :variable, :col, :value)
3×4 DataFrame
 Row │ variable  S1      S2      S3
     │ String    Int64?  Int64?  Int64?
─────┼──────────────────────────────────
   1 │ T1             1       2       3
   2 │ T2             4       5       6
   3 │ T3             7       8       9

Hi Andrey, Thank you for your explanation here. A colleague mentioned that I should use CSV.read() method setting the '
transpose' parameter to 'true'

Florian Große (Feb 18 2021 at 08:32):

So that's what you want, a transpose :D

Florian Große (Feb 18 2021 at 08:33):

Didn't came through clearly I suppose

Last updated: Oct 02 2023 at 04:34 UTC