Stream: helpdesk (published)

Topic: Sortition with diversity requirements


view this post on Zulip Colin Caine (Jul 20 2021 at 06:31):

I'd like to take a sample of size N from a population that meets certain diversity requirements (e.g. at least X women, at least Y religious). I can't use stratified sampling naively because the groups overlap and the size of the overlap is unknown. Also X + Y + ... > N, so some sampled people must be in multiple groups or the requirements cannot be met.

This sample is for appointments to a political council by sortition, so as much as possible the algorithm should be understandable and at least arguably fair.

Any ideas?

Existing (problematic) ideas:

1) Make a kind of artificial stratified sample set by identifying all of the non-overlapping groups (e.g. gay religious women, straight religious men, etc) and their current size, then calculating what weights they should have to meet my population-level requirements (probably iteratively adjusting them)

2) Randomly select candidates until diversity requirements are met then, while the sample size is too large, randomly remove candidates that are not in a protected group. This is not guaranteed to terminate because we might not keep enough people who are in overlapping groups, but we can just try a few times.

view this post on Zulip Siva Swaminathan (Jul 21 2021 at 00:15):

You might be able to come up with a genetic algorithm / Monte Carlo sampler that’s a better version of option 2. Eg: “Score” the current state on all the desiderata and assign current set members probability of being replaced by someone outside the set. You could also weight the members outside the set suitably to increase probability of desirable scores being met. Hopefully such an iterative (Gibbs Sampling?) process converges in a reasonable time.

view this post on Zulip Colin Caine (Jul 21 2021 at 18:49):

In the end I implemented two approaches. One of them just samples N people, checks if sample meets the requirements, and if not, throws them out. And the other was (2) from above.

2) was much faster, typically only taking one or two attempts on my dataset.

The pick N, throw out and try again approach worked as well, but needed quite a few more attempts. Assuming it can find an answer on the complete dataset, I will use this one because it's easier to prove that it is fair (each of the valid sets has an equal chance of being found).

view this post on Zulip Colin Caine (Jul 21 2021 at 19:01):

I think you'd get the same probability distribution if you do a less naive search:

I think that's probably a simple enough algorithm to perform manually and would hopefully converge fast enough that you wouldn't get too bored doing it.


Last updated: Oct 02 2023 at 04:34 UTC