Index Matching + List Comprehension · helpdesk (published)

Stream: helpdesk (published)

Topic: Index Matching + List Comprehension

qu bit (Mar 08 2021 at 08:31):

Hello Coders:

I have two tables. Table 1:

T1.A = [1,2,3,4,5]
T1.B = ["Fri", "Sat", "Wed", "Fri", "Mon"]

And Table 2:

T2.A = ["Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"]
T2.B = [0.25, 0.35, 0.50, 0.15, 0.30, 0.75, 0.65]

How can I create a new column in T1 as T1.C that
is populated with T2.B values where T1.B matches?

The general Query.jl approach might be:

T1.C = @from i in T1 begin
              @join j in T2 on i.a T1.B  equals startswith(T2.A, i, n=1:3)
              @select {T1.A,T1.B,T2.A,T2.B}
              @collect Dataframe

Is there a way to execute this using DataFrames
and or some higher-order function?

Thank you,

Andrey Oskin (Mar 08 2021 at 10:03):

Best way to proceed is to build temporary column and join on it.

DataFrames documentation do not have information about joining with random conditions. And it is considered bad practice anyway, since it generates nested loops.

qu bit (Mar 08 2021 at 14:55):

Thank you for this @Andrey Oskin .
Did you examine the Query.jl SQL-LINQ
code I provided?

Andrey Oskin (Mar 08 2021 at 15:01):

Nope, but I suppose they went cross-join + filter road. It's not much you can do, if you have no information about keys.

Andrey Oskin (Mar 08 2021 at 15:02):

Standard way to join is sorting over keys with the following double cursor iteration.

Andrey Oskin (Mar 08 2021 at 15:02):

So it is O(n log(n))

Andrey Oskin (Mar 08 2021 at 15:03):

If you do not know anything in advance, then you have to take all possible pairs and it is O(n^2) operations (assuming that dataframes has roughly the same size).

Andrey Oskin (Mar 08 2021 at 15:04):

So it is either speed or convenience.

Andrey Oskin (Mar 08 2021 at 15:17):

So, you can do something like this

using DataFrames
using Chain

T1 = DataFrame("A" => [1,2,3,4,5], "B" => ["Fri", "Sat", "Wed", "Fri", "Mon"])
T2 = DataFrame("A" => ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"], "B" => [0.25, 0.35, 0.50, 0.15, 0.30, 0.75, 0.65])

@chain T2 begin
    transform(:A => (y -> (x -> x[1:3]).(y)) => :A3)
    innerjoin(T1, on = [:A3 => :B], makeunique = true)
    select(["A_1", "A3", "A", "B"])
    rename("A_1" => "T1A", "A3" => "T1B", "A" => "T2A", "B" => "T2B")
end

5×4 DataFrame
 Row │ T1A    T1B     T2A        T2B
     │ Int64  String  String     Float64
─────┼───────────────────────────────────
   1 │     5  Mon     Monday        0.25
   2 │     3  Wed     Wednesday     0.5
   3 │     1  Fri     Friday        0.3
   4 │     4  Fri     Friday        0.3
   5 │     2  Sat     Saturday      0.75

qu bit (Mar 08 2021 at 18:24):

Andrey Oskin said:

Nope, but I suppose they went cross-join + filter road. It's not much you can do, if you have no information about keys.

Yes -- this was my code. However, the startswith() method does not work
in its current placement. I am glad to some level it could convey the issue.

qu bit (Mar 08 2021 at 18:28):

Yes -- this is excellent @Andrey Oskin .
Thank you.

@Andrey Oskin
In this example, what does the @transform line do?

And how did A mutate to A_1?

Last updated: Oct 02 2023 at 04:34 UTC