Method to compare sets of sequences · helpdesk (published)

Stream: helpdesk (published)

Topic: Method to compare sets of sequences

DrChainsaw (Feb 24 2023 at 15:26):

Is there some standard method (e.g. a hypothesistest) to determine if two sets of sequences could have been generated by the same process?

Here is a toy example which I think captures the problem I'm having:

function gensequences_processtype1(N, m, possible_values)
    seqs = []
    for _ in 1:N
      push!(seqs, fill(rand(possible_values), m)) # All values are the same!
    end
    return seqs
end

function gensequences_processtype2(N, m, possible_values)
    seqs = []
    for _ in 1:N
      push!(seqs, rand(possible_values, m))
    end
    return seqs
end

Where typically N >> m. Would it be possible to e.g. rank the following vs targetseq?

julia> targetseq = gensequences_processtype2(1000, 10, [1, 1, 2, 10, 40]);

julia> perfectmatch = gensequences_processtype2(10000, 100, [1, 1, 2, 10, 40]); # All sequences generated by the exact same process

julia> decentmatch = gensequences_processtype2(10000, 100, [1, 3, 10, 40]); # right processtype but wrong values

julia> lessdecentmatchmaybe = gensequences_processtype1(100, 10, [1, 1, 2, 10, 40]); # values are right, but this is processtype 1

I guess the order between the last two is not that important, just showing that I think that aggregating the sequences is risky due to the possible existence of processtype2 and 1.

I can think of a few ad-hoc ways to do this, but I'm curious if there is some Name1-Name2 type of method that can just be referred to instead.

Last updated: Oct 02 2023 at 04:34 UTC