I'd like to take a sample of size N from a population that meets certain diversity requirements (e.g. at least X women, at least Y religious). I can't use stratified sampling naively because the groups overlap and the size of the overlap is unknown. Also X + Y + ... > N, so some sampled people must be in multiple groups or the requirements cannot be met.
This sample is for appointments to a political council by sortition, so as much as possible the algorithm should be understandable and at least arguably fair.
Any ideas?
Existing (problematic) ideas:
1) Make a kind of artificial stratified sample set by identifying all of the non-overlapping groups (e.g. gay religious women, straight religious men, etc) and their current size, then calculating what weights they should have to meet my population-level requirements (probably iteratively adjusting them)
2) Randomly select candidates until diversity requirements are met then, while the sample size is too large, randomly remove candidates that are not in a protected group. This is not guaranteed to terminate because we might not keep enough people who are in overlapping groups, but we can just try a few times.
You might be able to come up with a genetic algorithm / Monte Carlo sampler that’s a better version of option 2. Eg: “Score” the current state on all the desiderata and assign current set members probability of being replaced by someone outside the set. You could also weight the members outside the set suitably to increase probability of desirable scores being met. Hopefully such an iterative (Gibbs Sampling?) process converges in a reasonable time.
In the end I implemented two approaches. One of them just samples N people, checks if sample meets the requirements, and if not, throws them out. And the other was (2) from above.
2) was much faster, typically only taking one or two attempts on my dataset.
The pick N, throw out and try again approach worked as well, but needed quite a few more attempts. Assuming it can find an answer on the complete dataset, I will use this one because it's easier to prove that it is fair (each of the valid sets has an equal chance of being found).
I think you'd get the same probability distribution if you do a less naive search:
I think that's probably a simple enough algorithm to perform manually and would hopefully converge fast enough that you wouldn't get too bored doing it.
Thanks to @Bogumił Kamiński for suggesting just sampling N people and then checking if they meet the requirements. I was concerned that this would not work for my set of requirements, but it was obviously worthwhile to check!
Colin Caine has marked this topic as resolved.
Thanks for coming back and documenting the solution you ended up with and the reasons why! It was interesting to read up on it
Last updated: Dec 28 2024 at 04:38 UTC