Hello:
How does one subtract columnar values between two dataframes
when the data frames have different dimensions.
For example,
DF1_dim = Base.OneTo(48)
DF2_dim=Base.OneTo(58)
Ideally, where there are no overlapping values, I would like to fill
the resultant DF's cell with 'NaN'.
Any suggestions?
You'd need to pad the shorter dataframe with NaN
at the bottom
Nils
Hi Nils,
Could you provide a short example of this?
Thank you,
julia> using DataFrames
julia> df1 = DataFrame(a = 1:5, b = rand(5))
5×2 DataFrame
Row │ a b
│ Int64 Float64
─────┼─────────────────
1 │ 1 0.6121
2 │ 2 0.807316
3 │ 3 0.827804
4 │ 4 0.782406
5 │ 5 0.228108
julia> df2 = DataFrame(a = 1:3, b = rand(3))
3×2 DataFrame
Row │ a b
│ Int64 Float64
─────┼─────────────────
1 │ 1 0.211348
2 │ 2 0.645856
3 │ 3 0.337383
julia> df1 .- df2
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 5 and 3")
Stacktrace:
[1] _bcs1
@ .\broadcast.jl:501 [inlined]
[2] _bcs(shape::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, newshape::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}})
@ Base.Broadcast .\broadcast.jl:495
[3] broadcast_shape
@ .\broadcast.jl:489 [inlined]
[4] combine_axes
@ .\broadcast.jl:484 [inlined]
[5] instantiate
@ .\broadcast.jl:266 [inlined]
[6] materialize(bc::Base.Broadcast.Broadcasted{DataFrames.DataFrameStyle, Nothing, typeof(-), Tuple{DataFrame, DataFrame}})
@ Base.Broadcast .\broadcast.jl:883
[7] top-level scope
@ REPL[8]:1
julia> df1 .- vcat(df2, DataFrame(a = [missing for _ ∈ 1:2], b = [missing for _ ∈ 1:2]))
5×2 DataFrame
Row │ a b
│ Int64? Float64?
─────┼─────────────────────────
1 │ 0 0.400752
2 │ 0 0.16146
3 │ 0 0.490422
4 │ missing missing
5 │ missing missing
Nils
Hi Nils,
This would work, but the error I am returning is:
MethodError: no method matching -(::String, ::String)
Which I think has to do with the fact my df has more
than two fields, some are Int64, String, Float64.
Any work around you can think of, almost there.
Well the workaround is either to parse your strings into numbers, or subtract column-by-column, only using the columns which actually hold numbers
Hi Nils,
I am using:
DF1[!,:Total] .- vcat(DF2[!,:Total], DataFrame(Total= [missing for _ ∈ 1:10]))
And am returning:
DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 58 and 49")
Any suggestions?
If you're just subtracting two columns it's just
julia> df1.a .- [df2.a; [missing for _ ∈ 1:nrow(df1)-nrow(df2)]]
5-element Vector{Union{Missing, Int64}}:
0
0
0
missing
missing
using my example above
Nils [said]
Hi Nils,
The computation time is about 3 times as fast for:
begin
select(DF1, :Total) .- vcat(select(DF2,:Total), DataFrame(Total = [missing for _ ∈ 1:10]))
end
Compared to:
df1.a .- [df2.a; [missing for _ ∈ 1:nrow(df1)-nrow(df2)]]
However, the second approach is a lot cleaner.
If I may, what is your reasoning for using:
missing for _ ∈
Thank you,
How did you benchmark this? I can't see the first being faster at all, given that you're constructing an additional dataframe. The missing for _ ∈ in ...
syntax just means "repeat missing
for this many times", the underscore is often used to signify a loop variable that isn't actually used in the loop
Yes, that's what I'm referring to. The cell timings in Pluto aren't really meant to perform microbenchmarks like this, if you care about this make sure you wrap the operations you're benchmarking in functions so as not to benchmark in global scope (might be less relevant in Pluto as cells are local scopes as well from what I understand) and use @btime
from BenchmarkTools
:
julia> using DataFrames, BenchmarkTools
julia> df1 = DataFrame(a = rand(1_000_000)); df2 = DataFrame(a = rand(800_000));
julia> diff1(d1, d2) = d1.a .- [df2.a; [missing for _ ∈ 1:nrow(d1)-nrow(d2)]]
diff1 (generic function with 1 method)
julia> diff2(d1, d2) = select(d1, :a) .- vcat(select(d2, :a), DataFrame(a = [missing for _ ∈ 1:nrow(d1)-nrow(d2)]))
diff2 (generic function with 1 method)
julia> @btime diff1($df1, $df2);
12.853 ms (22 allocations: 24.80 MiB)
julia> @btime diff2($df1, $df2);
24.141 ms (193 allocations: 38.54 MiB)
They represent the runtime of executing that cell, but that includes compilation time, as well as just random noise on your system, so it's not a reliable estimate of runtime of some code
Last updated: Dec 28 2024 at 04:38 UTC