Stream: helpdesk (published)

Topic: DataFrame subtraction


view this post on Zulip QuBit (Mar 01 2021 at 16:50):

Hello:

How does one subtract columnar values between two dataframes
when the data frames have different dimensions.

For example,
DF1_dim = Base.OneTo(48)
DF2_dim=Base.OneTo(58)

Ideally, where there are no overlapping values, I would like to fill
the resultant DF's cell with 'NaN'.

Any suggestions?

view this post on Zulip Nils (Mar 01 2021 at 17:09):

You'd need to pad the shorter dataframe with NaN at the bottom

view this post on Zulip QuBit (Mar 01 2021 at 17:16):

Nils
Hi Nils,

Could you provide a short example of this?
Thank you,

view this post on Zulip Nils (Mar 01 2021 at 17:17):

julia> using DataFrames

julia> df1 = DataFrame(a = 1:5, b = rand(5))
5×2 DataFrame
 Row  a      b
      Int64  Float64
─────┼─────────────────
   1      1  0.6121
   2      2  0.807316
   3      3  0.827804
   4      4  0.782406
   5      5  0.228108

julia> df2 = DataFrame(a = 1:3, b = rand(3))
3×2 DataFrame
 Row  a      b
      Int64  Float64
─────┼─────────────────
   1      1  0.211348
   2      2  0.645856
   3      3  0.337383

julia> df1 .- df2
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 5 and 3")
Stacktrace:
 [1] _bcs1
   @ .\broadcast.jl:501 [inlined]
 [2] _bcs(shape::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, newshape::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}})
   @ Base.Broadcast .\broadcast.jl:495
 [3] broadcast_shape
   @ .\broadcast.jl:489 [inlined]
 [4] combine_axes
   @ .\broadcast.jl:484 [inlined]
 [5] instantiate
   @ .\broadcast.jl:266 [inlined]
 [6] materialize(bc::Base.Broadcast.Broadcasted{DataFrames.DataFrameStyle, Nothing, typeof(-), Tuple{DataFrame, DataFrame}})
   @ Base.Broadcast .\broadcast.jl:883
 [7] top-level scope
   @ REPL[8]:1

julia> df1 .- vcat(df2, DataFrame(a = [missing for _  1:2], b = [missing for _  1:2]))
5×2 DataFrame
 Row  a        b
      Int64?   Float64?
─────┼─────────────────────────
   1        0        0.400752
   2        0        0.16146
   3        0        0.490422
   4  missing  missing
   5  missing  missing

view this post on Zulip QuBit (Mar 01 2021 at 17:37):

Nils
Hi Nils,

This would work, but the error I am returning is:
MethodError: no method matching -(::String, ::String)

Which I think has to do with the fact my df has more
than two fields, some are Int64, String, Float64.

Any work around you can think of, almost there.

view this post on Zulip Nils (Mar 01 2021 at 20:09):

Well the workaround is either to parse your strings into numbers, or subtract column-by-column, only using the columns which actually hold numbers

view this post on Zulip QuBit (Mar 01 2021 at 22:13):

Hi Nils,

I am using:

DF1[!,:Total] .- vcat(DF2[!,:Total], DataFrame(Total= [missing for _  1:10]))

And am returning:

DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 58 and 49")

Any suggestions?

view this post on Zulip Nils (Mar 02 2021 at 09:08):

If you're just subtracting two columns it's just

julia> df1.a .- [df2.a; [missing for _  1:nrow(df1)-nrow(df2)]]
5-element Vector{Union{Missing, Int64}}:
 0
 0
 0
  missing
  missing

using my example above

view this post on Zulip QuBit (Mar 02 2021 at 10:48):

Nils [said]
Hi Nils,

The computation time is about 3 times as fast for:

begin
select(DF1, :Total) .- vcat(select(DF2,:Total), DataFrame(Total = [missing for _  1:10]))
end

Compared to:

 df1.a .- [df2.a; [missing for _  1:nrow(df1)-nrow(df2)]]

However, the second approach is a lot cleaner.

If I may, what is your reasoning for using:
missing for _ ∈

Thank you,

view this post on Zulip Nils (Mar 02 2021 at 13:01):

How did you benchmark this? I can't see the first being faster at all, given that you're constructing an additional dataframe. The missing for _ ∈ in ... syntax just means "repeat missing for this many times", the underscore is often used to signify a loop variable that isn't actually used in the loop

view this post on Zulip Nils (Mar 02 2021 at 13:45):

Yes, that's what I'm referring to. The cell timings in Pluto aren't really meant to perform microbenchmarks like this, if you care about this make sure you wrap the operations you're benchmarking in functions so as not to benchmark in global scope (might be less relevant in Pluto as cells are local scopes as well from what I understand) and use @btime from BenchmarkTools:

julia> using DataFrames, BenchmarkTools

julia> df1 = DataFrame(a = rand(1_000_000)); df2 = DataFrame(a = rand(800_000));

julia> diff1(d1, d2) = d1.a .- [df2.a; [missing for _  1:nrow(d1)-nrow(d2)]]
diff1 (generic function with 1 method)

julia> diff2(d1, d2) = select(d1, :a) .- vcat(select(d2, :a), DataFrame(a = [missing for _  1:nrow(d1)-nrow(d2)]))
diff2 (generic function with 1 method)

julia> @btime diff1($df1, $df2);
  12.853 ms (22 allocations: 24.80 MiB)

julia> @btime diff2($df1, $df2);
  24.141 ms (193 allocations: 38.54 MiB)

view this post on Zulip Nils (Mar 02 2021 at 14:06):

They represent the runtime of executing that cell, but that includes compilation time, as well as just random noise on your system, so it's not a reliable estimate of runtime of some code


Last updated: Nov 06 2024 at 04:40 UTC