Random Sampling of a column by elemental group · helpdesk (published)

Stream: helpdesk (published)

Topic: Random Sampling of a column by elemental group

qu bit (Mar 06 2021 at 09:48):

Let's say we have an array{Int64,1} with N=100 values

I would like to create new array with a sub-sample from
that pulls q values for every 5 number subset.

example:

using StatsBase
sample(0:100, q, replace=false)

q = a random number; so for each subset of 5 numbers,
q can be differ.

Any suggestions out there?

Andrey Oskin (Mar 06 2021 at 10:10):

Can you make it more explicit?

For simplicity, I use 3 instead of 5 in what follows.

So, you are saying, that we have an array [1, 2, 3, 4] and we have all 3-length subsets

[1, 2, 3]
[1, 2, 4]
[1, 3, 2]
[1, 4, 2]
[1, 2, 4]
[1, 4, 2]
[2, 1, 3]
# etc

And you want to have sample of the length q. So if q = 2, then

[[1, 2, 3], [2, 1, 4]]

is a valid sample. Am I correct?

qu bit (Mar 06 2021 at 11:24):

Andrey Oskin said:

Can you make it more explicit?

For simplicity, I use 3 instead of 5 in what follows.

So, you are saying, that we have an array [1, 2, 3, 4] and we have all 3-length subsets
[1, 2, 3]
[1, 2, 4]
[1, 3, 2]
[1, 4, 2]
[1, 2, 4]
[1, 4, 2]
[2, 1, 3]
# etc
And you want to have sample of the length q. So if q = 2, then
[[1, 2, 3], [2, 1, 4]]
is a valid sample. Am I correct?

Hi Andrey,

Yes you were correct. However, I
only mentioned array to make
things more simple.

I am working with a dataframe and
want to replace the sampled values
with 'missing', then apply the
sample back to the original
dataframe.

Any ideas?

Andrey Oskin (Mar 06 2021 at 11:31):

Sorry, is hard to say, since structure of the Data is unknown. And since you already have something working, what are you trying to achieve? Make it faster and/or shorter?

Can you give an example in terms of example data? Like

"
As an input I have

data = [[1, 2], [3, 4]]

As an output I want

[1, 2, 3, 4]

It would be easier to help you then. This way one can reproduce your situation and give some meaningful advice, otherwise we only have to try and guess.

qu bit (Mar 06 2021 at 11:47):

Andrey Oskin said:

Sorry, is hard to say, since structure of the Data is unknown. And since you already have something working, what are you trying to achieve? Make it faster and/or shorter?

Can you give an example in terms of example data? Like

"
As an input I have
data = [[1, 2], [3, 4]]
As an output I want
[1, 2, 3, 4]
"

It would be easier to help you then. This way one can reproduce your situation and give some meaningful advice, otherwise we only have to try and guess.

Hi Andrey,

the original data =
ID | Location | Value1 | Value2
1 | Chad | 2.4 | 5.4
2 | Chile | 5.3 | 0.6
3 | Mexico | 3.3 | 9.4
4 | Anguilla | 3.5 | 7.1
5 | Bulgaria| 5.7 | 8.2
6 | Slovania | 6.7 | 3.9

the sampled data =
ID | Location | Value1 | Value2
1 | Chad | 2.4 | 5.4
4 | Anguilla | 3.5 | 7.1
6 | Slovania | 6.7 | 3.9

I want to replace the columns in cols (:Value1, :Value2)
with the word 'missing' in the 'sampled data' then apply
this dataframe back into the original dataframe with the
new values at the same id position.

Is this a better explanation?

Andrey Oskin (Mar 06 2021 at 13:14):

Well, you can do the following

using DataFrames

df = DataFrame(:ID => [1, 2, 3, 4, 5, 6],
               :Location => ["Chad", "Chile", "Mexico", "Anguilla", "Bulgaria", "Slovania"],
               :Value1 => [2.4, 5.3, 3.3, 3.5, 5.7, 6.7],
               :Value2 => [5.4, 0.6, 9.4, 7.1, 8.2, 3.9])

smp = [1, 4, 6]
df.Value1 = ifelse.(in.(df.ID, Ref(smp)), missing, df.Value1)
df.Value2 = ifelse.(in.(df.ID, Ref(smp)), missing, df.Value2)

Then df looks exactly as you want

julia> df
6×4 DataFrame
 Row │ ID     Location  Value1     Value2
     │ Int64  String    Float64?   Float64?
─────┼───────────────────────────────────────
   1 │     1  Chad      missing    missing
   2 │     2  Chile           5.3        0.6
   3 │     3  Mexico          3.3        9.4
   4 │     4  Anguilla  missing    missing
   5 │     5  Bulgaria        5.7        8.2
   6 │     6  Slovania  missing    missing

Andrey Oskin (Mar 06 2021 at 13:17):

You can do it for arbitrary number of columns with

for column in [:Value1, :Value2]
    df[!, column] = ifelse.(in.(df.ID, Ref(smp)), missing, df[!, column])
end

qu bit (Mar 06 2021 at 13:19):

Andrey Oskin said:

Well, you can do the following

using DataFrames

df = DataFrame(:ID => [1, 2, 3, 4, 5, 6],
               :Location => ["Chad", "Chile", "Mexico", "Anguilla", "Bulgaria", "Slovania"],
               :Value1 => [2.4, 5.3, 3.3, 3.5, 5.7, 6.7],
               :Value2 => [5.4, 0.6, 9.4, 7.1, 8.2, 3.9])

smp = [1, 4, 6]
df.Value1 = ifelse.(in.(df.ID, Ref(smp)), missing, df.Value1)
df.Value2 = ifelse.(in.(df.ID, Ref(smp)), missing, df.Value2)

Then df looks exactly as you want

julia> df
6×4 DataFrame
 Row │ ID     Location  Value1     Value2
     │ Int64  String    Float64?   Float64?
─────┼───────────────────────────────────────
   1 │     1  Chad      missing    missing
   2 │     2  Chile           5.3        0.6
   3 │     3  Mexico          3.3        9.4
   4 │     4  Anguilla  missing    missing
   5 │     5  Bulgaria        5.7        8.2
   6 │     6  Slovania  missing    missing

Excellent Andrey ! -- thank you much.
I can replace smp with Data[:, :Id] instead
of listing out the indices?

Thanks again!

Andrey Oskin (Mar 06 2021 at 13:20):

Of course, it requres indices of sampled data, but they are easily extracted from sampled data

Andrey Oskin (Mar 06 2021 at 13:20):

Ha, I was to late to write comment :-)

Andrey Oskin (Mar 06 2021 at 13:21):

Well, it's better to use DataSample.ID or DataSample[!, :id] instead of DataSample[:, :id]

Andrey Oskin (Mar 06 2021 at 13:21):

No one needs extra allocations, no matter how tiny they are.

qu bit (Mar 06 2021 at 13:23):

Andrey Oskin said:

No one needs extra allocations, no matter how tiny they are.

Testing this out on my DataFrame now.
Excellent coding. clean, appreciated.

qu bit (Mar 06 2021 at 13:48):

qu bit said:

Andrey Oskin said:

No one needs extra allocations, no matter how tiny they are.

Testing this out on my DataFrame now.
Excellent coding. clean, appreciated.

Hi Andrey:

Implemented:

begin
Slice = DataMod[!,:]
SMP = DataSample[:,:Id]
Slice[!,4:8] = ifelse.(in.(Slice.Id, Ref(SMP)), missing, Slice[!, 4:8])
end

I suppose I could have used DataMod instead of Slice,
but wanted to keep the former dataframe for
back-up

Could you point me to the resource that describes
the Ref() method?

Thanks

Andrey Oskin (Mar 06 2021 at 14:32):

Ref trick is a usual trick, when you want to use broadcasting, but one of the objects is an array, which should be treated as a single element.

https://discourse.julialang.org/t/how-to-broadcast-over-only-certain-function-arguments/19274
https://discourse.julialang.org/t/marking-types-as-scalar-for-broadcasting-ref-vs-tuple/29105
https://stackoverflow.com/questions/51993802/what-is-the-connection-between-refs-and-broadcasting-in-julia

qu bit (Mar 06 2021 at 14:44):

Andrey Oskin said:

Ref trick is a usual trick, when you want to use broadcasting, but one of the objects is an array, which should be treated as a single element.

https://discourse.julialang.org/t/how-to-broadcast-over-only-certain-function-arguments/19274
https://discourse.julialang.org/t/marking-types-as-scalar-for-broadcasting-ref-vs-tuple/29105
https://stackoverflow.com/questions/51993802/what-is-the-connection-between-refs-and-broadcasting-in-julia

Understood -- in our test case here Ref() could have acted upon
Ref([1,4,6]), but is more legible with the single variable as the
element for larger test cases.

Last updated: Oct 02 2023 at 04:34 UTC