Let's say we have an array{Int64,1} with N=100 values
I would like to create new array with a sub-sample from
that pulls q values for every 5 number subset.
example:
using StatsBase
sample(0:100, q, replace=false)
q = a random number; so for each subset of 5 numbers,
q can be differ.
Any suggestions out there?
Can you make it more explicit?
For simplicity, I use 3 instead of 5 in what follows.
So, you are saying, that we have an array [1, 2, 3, 4]
and we have all 3-length subsets
[1, 2, 3]
[1, 2, 4]
[1, 3, 2]
[1, 4, 2]
[1, 2, 4]
[1, 4, 2]
[2, 1, 3]
# etc
And you want to have sample of the length q
. So if q = 2, then
[[1, 2, 3], [2, 1, 4]]
is a valid sample. Am I correct?
Andrey Oskin said:
Can you make it more explicit?
For simplicity, I use 3 instead of 5 in what follows.
So, you are saying, that we have an array
[1, 2, 3, 4]
and we have all 3-length subsets[1, 2, 3] [1, 2, 4] [1, 3, 2] [1, 4, 2] [1, 2, 4] [1, 4, 2] [2, 1, 3] # etc
And you want to have sample of the length
q
. So if q = 2, then[[1, 2, 3], [2, 1, 4]]
is a valid sample. Am I correct?
Hi Andrey,
Yes you were correct. However, I
only mentioned array to make
things more simple.
I am working with a dataframe and
want to replace the sampled values
with 'missing', then apply the
sample back to the original
dataframe.
Any ideas?
Sorry, is hard to say, since structure of the Data
is unknown. And since you already have something working, what are you trying to achieve? Make it faster and/or shorter?
Can you give an example in terms of example data? Like
"
As an input I have
data = [[1, 2], [3, 4]]
As an output I want
[1, 2, 3, 4]
"
It would be easier to help you then. This way one can reproduce your situation and give some meaningful advice, otherwise we only have to try and guess.
Andrey Oskin said:
Sorry, is hard to say, since structure of the
Data
is unknown. And since you already have something working, what are you trying to achieve? Make it faster and/or shorter?Can you give an example in terms of example data? Like
"
As an input I havedata = [[1, 2], [3, 4]]
As an output I want
[1, 2, 3, 4]
"
It would be easier to help you then. This way one can reproduce your situation and give some meaningful advice, otherwise we only have to try and guess.
Hi Andrey,
the original data =
ID | Location | Value1 | Value2
1 | Chad | 2.4 | 5.4
2 | Chile | 5.3 | 0.6
3 | Mexico | 3.3 | 9.4
4 | Anguilla | 3.5 | 7.1
5 | Bulgaria| 5.7 | 8.2
6 | Slovania | 6.7 | 3.9
the sampled data =
ID | Location | Value1 | Value2
1 | Chad | 2.4 | 5.4
4 | Anguilla | 3.5 | 7.1
6 | Slovania | 6.7 | 3.9
I want to replace the columns in cols (:Value1, :Value2)
with the word 'missing' in the 'sampled data' then apply
this dataframe back into the original dataframe with the
new values at the same id position.
desired data frame=
ID | Location | Value1 | Value2
1 | Chad | missing | missing
2 | Chile | 5.3 | 0.6
3 | Mexico | 3.3 | 9.4
4 | Anguilla | missing | missing
5 | Bulgaria| 5.7 | 8.2
6 | Slovania | missing | missing
Is this a better explanation?
Well, you can do the following
using DataFrames
df = DataFrame(:ID => [1, 2, 3, 4, 5, 6],
:Location => ["Chad", "Chile", "Mexico", "Anguilla", "Bulgaria", "Slovania"],
:Value1 => [2.4, 5.3, 3.3, 3.5, 5.7, 6.7],
:Value2 => [5.4, 0.6, 9.4, 7.1, 8.2, 3.9])
smp = [1, 4, 6]
df.Value1 = ifelse.(in.(df.ID, Ref(smp)), missing, df.Value1)
df.Value2 = ifelse.(in.(df.ID, Ref(smp)), missing, df.Value2)
Then df
looks exactly as you want
julia> df
6×4 DataFrame
Row │ ID Location Value1 Value2
│ Int64 String Float64? Float64?
─────┼───────────────────────────────────────
1 │ 1 Chad missing missing
2 │ 2 Chile 5.3 0.6
3 │ 3 Mexico 3.3 9.4
4 │ 4 Anguilla missing missing
5 │ 5 Bulgaria 5.7 8.2
6 │ 6 Slovania missing missing
You can do it for arbitrary number of columns with
for column in [:Value1, :Value2]
df[!, column] = ifelse.(in.(df.ID, Ref(smp)), missing, df[!, column])
end
Andrey Oskin said:
Well, you can do the following
using DataFrames df = DataFrame(:ID => [1, 2, 3, 4, 5, 6], :Location => ["Chad", "Chile", "Mexico", "Anguilla", "Bulgaria", "Slovania"], :Value1 => [2.4, 5.3, 3.3, 3.5, 5.7, 6.7], :Value2 => [5.4, 0.6, 9.4, 7.1, 8.2, 3.9]) smp = [1, 4, 6] df.Value1 = ifelse.(in.(df.ID, Ref(smp)), missing, df.Value1) df.Value2 = ifelse.(in.(df.ID, Ref(smp)), missing, df.Value2)
Then
df
looks exactly as you wantjulia> df 6×4 DataFrame Row │ ID Location Value1 Value2 │ Int64 String Float64? Float64? ─────┼─────────────────────────────────────── 1 │ 1 Chad missing missing 2 │ 2 Chile 5.3 0.6 3 │ 3 Mexico 3.3 9.4 4 │ 4 Anguilla missing missing 5 │ 5 Bulgaria 5.7 8.2 6 │ 6 Slovania missing missing
Excellent Andrey ! -- thank you much.
I can replace smp with Data[:, :Id] instead
of listing out the indices?
Thanks again!
Of course, it requres indices of sampled data, but they are easily extracted from sampled data
Ha, I was to late to write comment :-)
Well, it's better to use DataSample.ID
or DataSample[!, :id]
instead of DataSample[:, :id]
No one needs extra allocations, no matter how tiny they are.
Andrey Oskin said:
No one needs extra allocations, no matter how tiny they are.
Testing this out on my DataFrame now.
Excellent coding. clean, appreciated.
qu bit said:
Andrey Oskin said:
No one needs extra allocations, no matter how tiny they are.
Testing this out on my DataFrame now.
Excellent coding. clean, appreciated.
Hi Andrey:
Implemented:
begin
Slice = DataMod[!,:]
SMP = DataSample[:,:Id]
Slice[!,4:8] = ifelse.(in.(Slice.Id, Ref(SMP)), missing, Slice[!, 4:8])
end
I suppose I could have used DataMod instead of Slice,
but wanted to keep the former dataframe for
back-up
Could you point me to the resource that describes
the Ref() method?
Thanks
Ref
trick is a usual trick, when you want to use broadcasting, but one of the objects is an array, which should be treated as a single element.
https://discourse.julialang.org/t/how-to-broadcast-over-only-certain-function-arguments/19274
https://discourse.julialang.org/t/marking-types-as-scalar-for-broadcasting-ref-vs-tuple/29105
https://stackoverflow.com/questions/51993802/what-is-the-connection-between-refs-and-broadcasting-in-julia
Andrey Oskin said:
Ref
trick is a usual trick, when you want to use broadcasting, but one of the objects is an array, which should be treated as a single element.https://discourse.julialang.org/t/how-to-broadcast-over-only-certain-function-arguments/19274
https://discourse.julialang.org/t/marking-types-as-scalar-for-broadcasting-ref-vs-tuple/29105
https://stackoverflow.com/questions/51993802/what-is-the-connection-between-refs-and-broadcasting-in-julia
Understood -- in our test case here Ref() could have acted upon
Ref([1,4,6]), but is more legible with the single variable as the
element for larger test cases.
Last updated: Nov 06 2024 at 04:40 UTC