Is there some standard method (e.g. a hypothesistest) to determine if two sets of sequences could have been generated by the same process?
Here is a toy example which I think captures the problem I'm having:
function gensequences_processtype1(N, m, possible_values)
seqs = []
for _ in 1:N
push!(seqs, fill(rand(possible_values), m)) # All values are the same!
end
return seqs
end
function gensequences_processtype2(N, m, possible_values)
seqs = []
for _ in 1:N
push!(seqs, rand(possible_values, m))
end
return seqs
end
Where typically N >> m
. Would it be possible to e.g. rank the following vs targetseq
?
julia> targetseq = gensequences_processtype2(1000, 10, [1, 1, 2, 10, 40]);
julia> perfectmatch = gensequences_processtype2(10000, 100, [1, 1, 2, 10, 40]); # All sequences generated by the exact same process
julia> decentmatch = gensequences_processtype2(10000, 100, [1, 3, 10, 40]); # right processtype but wrong values
julia> lessdecentmatchmaybe = gensequences_processtype1(100, 10, [1, 1, 2, 10, 40]); # values are right, but this is processtype 1
I guess the order between the last two is not that important, just showing that I think that aggregating the sequences is risky due to the possible existence of processtype2 and 1.
I can think of a few ad-hoc ways to do this, but I'm curious if there is some Name1-Name2 type of method that can just be referred to instead.
Last updated: Nov 06 2024 at 04:40 UTC