DataFrame subtraction · helpdesk (published)

How does one subtract columnar values between two dataframes
when the data frames have different dimensions.

Ideally, where there are no overlapping values, I would like to fill
the resultant DF's cell with 'NaN'.

Nils (Mar 01 2021 at 17:09):

QuBit (Mar 01 2021 at 17:16):

Nils (Mar 01 2021 at 17:17):

julia> using DataFrames

julia> df1 = DataFrame(a = 1:5, b = rand(5))
5×2 DataFrame
 Row │ a      b
     │ Int64  Float64
─────┼─────────────────
   1 │     1  0.6121
   2 │     2  0.807316
   3 │     3  0.827804
   4 │     4  0.782406
   5 │     5  0.228108

julia> df2 = DataFrame(a = 1:3, b = rand(3))
3×2 DataFrame
 Row │ a      b
     │ Int64  Float64
─────┼─────────────────
   1 │     1  0.211348
   2 │     2  0.645856
   3 │     3  0.337383

julia> df1 .- df2
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 5 and 3")
Stacktrace:
 [1] _bcs1
   @ .\broadcast.jl:501 [inlined]
 [2] _bcs(shape::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, newshape::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}})
   @ Base.Broadcast .\broadcast.jl:495
 [3] broadcast_shape
   @ .\broadcast.jl:489 [inlined]
 [4] combine_axes
   @ .\broadcast.jl:484 [inlined]
 [5] instantiate
   @ .\broadcast.jl:266 [inlined]
 [6] materialize(bc::Base.Broadcast.Broadcasted{DataFrames.DataFrameStyle, Nothing, typeof(-), Tuple{DataFrame, DataFrame}})
   @ Base.Broadcast .\broadcast.jl:883
 [7] top-level scope
   @ REPL[8]:1

julia> df1 .- vcat(df2, DataFrame(a = [missing for _ ∈ 1:2], b = [missing for _ ∈ 1:2]))
5×2 DataFrame
 Row │ a        b
     │ Int64?   Float64?
─────┼─────────────────────────
   1 │       0        0.400752
   2 │       0        0.16146
   3 │       0        0.490422
   4 │ missing  missing
   5 │ missing  missing

QuBit (Mar 01 2021 at 17:37):

This would work, but the error I am returning is:
MethodError: no method matching -(::String, ::String)

Which I think has to do with the fact my df has more
than two fields, some are Int64, String, Float64.

Nils (Mar 01 2021 at 20:09):

Well the workaround is either to parse your strings into numbers, or subtract column-by-column, only using the columns which actually hold numbers

QuBit (Mar 01 2021 at 22:13):

DF1[!,:Total] .- vcat(DF2[!,:Total], DataFrame(Total= [missing for _ ∈ 1:10]))

DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 58 and 49")

Nils (Mar 02 2021 at 09:08):

julia> df1.a .- [df2.a; [missing for _ ∈ 1:nrow(df1)-nrow(df2)]]
5-element Vector{Union{Missing, Int64}}:
 0
 0
 0
  missing
  missing

QuBit (Mar 02 2021 at 10:48):

begin
select(DF1, :Total) .- vcat(select(DF2,:Total), DataFrame(Total = [missing for _ ∈ 1:10]))
end

 df1.a .- [df2.a; [missing for _ ∈ 1:nrow(df1)-nrow(df2)]]

Nils (Mar 02 2021 at 13:01):

How did you benchmark this? I can't see the first being faster at all, given that you're constructing an additional dataframe. The missing for _ ∈ in ... syntax just means "repeat missing for this many times", the underscore is often used to signify a loop variable that isn't actually used in the loop

Nils (Mar 02 2021 at 13:45):

Yes, that's what I'm referring to. The cell timings in Pluto aren't really meant to perform microbenchmarks like this, if you care about this make sure you wrap the operations you're benchmarking in functions so as not to benchmark in global scope (might be less relevant in Pluto as cells are local scopes as well from what I understand) and use @btime from BenchmarkTools:

julia> using DataFrames, BenchmarkTools

julia> df1 = DataFrame(a = rand(1_000_000)); df2 = DataFrame(a = rand(800_000));

julia> diff1(d1, d2) = d1.a .- [df2.a; [missing for _ ∈ 1:nrow(d1)-nrow(d2)]]
diff1 (generic function with 1 method)

julia> diff2(d1, d2) = select(d1, :a) .- vcat(select(d2, :a), DataFrame(a = [missing for _ ∈ 1:nrow(d1)-nrow(d2)]))
diff2 (generic function with 1 method)

julia> @btime diff1($df1, $df2);
  12.853 ms (22 allocations: 24.80 MiB)

julia> @btime diff2($df1, $df2);
  24.141 ms (193 allocations: 38.54 MiB)

Nils (Mar 02 2021 at 14:06):

They represent the runtime of executing that cell, but that includes compilation time, as well as just random noise on your system, so it's not a reliable estimate of runtime of some code

Stream: helpdesk (published)

Topic: DataFrame subtraction

QuBit (Mar 01 2021 at 16:50):