Fitting conditional probability distributions · helpdesk (published)

Suppose I have data of user purchases and want to create a model of time between purchases as well as how much money it was for. Note that this is a completely made up use case so there is no need to discuss whether this is meaningful to do for this use case.

The simplest approach I guess is to just fit distributions for these things individually and sample from them (e.g. sample a next time for purchase from one distribution and sample the money from another distribution). The problem is that for the real use case I think it matters alot what each "user" does individually. If I have a situation where a minority of the users do a majority of the purchases the simplistic model will probably give incorrect predictions (given that the "matters alot what each "user" does indivudally" assumption is true).

One way is to just separate the data into large and small users and fit the distributions for each class as well has having a probability that a "user" is in each class. Since the real case has some unknown number of classes I suppose this quickly becomes quite cumbersome to do manually. Is there some less arbitrary and "rabbit hole-y" way to deal with this, or is the answer just "clustering" unless there is some better domain knowledge way to split the data? I have a suspicion that things like this is why data scientist is a real job these days...

Simone Carlo Surace (Sep 13 2023 at 20:54):

What is the model going to be used for? Predicting the next purchase time and the amount for each customer, or for new customers? I think if you have enough features for individual customers you may try to train a model to predict the targets from those features, and then potentially use it to generalize to new customers. If you have just times and amounts you may still arrange them as a time series based on different groupings and maybe try to predict from the history. Yeah this type of thing can be a full-time job…

DrChainsaw (Sep 14 2023 at 10:30):

It will be used in a simulation. Basically I want a model which generates sequences of events which could just as well have been drawn from the data set itself. One way to accomplish this is ofc to not have a model at all and instead just randomly select sequences from the dataset and feed it into the simulation. The size of the dataset and the general mechanics of making it available to other users of the simulation tools makes this non-attractive though. It is also be nice to have some knobs to turn (e.g. what if this was x% worse/better?). I realize the user purchase analogy kinda breaks down here so it was perhaps a bad choice of an example.

For the real use case there aren't really much in terms of features except the ones I have. For now I use a (manual) hierarchical grouping approach (e.g. user => [whale => [shopping spree, ..., inactive], regular => [shopping spree, ...] ,...] where each node is a set of distributions used to determine the next level and leaves generate events) and it seems to work ok w.r.t how accurate predictions the simulator does. I tried to perform groupings until it looked like the IID assumption was not too strong for each level.

I suspect this approach is basically a poor mans ad-hoc MCMC. I did also try out Turing.jl to fit a hidden markov model but unfortunately there seems to be an issue with gradients for the Dirichlet distribution. I know there are other packages for this, but I did this was a free time learning thing so trying them out will have to wait until I finish or get bored of bg3 :).

Robbie Rosati (Sep 14 2023 at 14:27):

Maybe I'm missing something, but couldn't you just histogram the data in 2d, fit a 2d distribution to it, and use that to inform your new samples? Then any correlations, multimodality, etc is taken into account. In the physics MCMC world the kind of things we use for this are gaussian mixture models, kernel density estimation, and normalizing flows (although, IIRC normalizing flows don't actually give you a probability distribution, just new samples with the right statistics).

DrChainsaw (Sep 14 2023 at 15:17):

Maybe I could. However, if one axis is the amount purchased and another axis is the time until next purchase it seems that the time correlation between different purchase events would be lost. Or did you think of some other way?

At the risk of taking the user shopping example to far, a whale on a shopping spree does multiple purchases within short periods of time, and then goes back to not doing much for some period of time, and then maybe has another shopping spree or maybe just makes a sporadic purchase, or some other class (I kinda struggle to come up with good names for classes here, but hopefully you get the point). Then the statistics for these type of events look different for other types of users as well.

The reason why I brought up MCMC is that the above seems like it could be modelled by a markov chain since there are states and transitions between states. I'm not sure the process I'm trying to model really has the markov property, but it certainly seems like a weaker assumption than just assuming IID from looking at the data.

Robbie Rosati (Sep 14 2023 at 19:32):

Ah I see, so you'd like to keep track of / categorize individual types of users as well?

So just to fully flush out my suggestion, I was thinking that your data might look like this:
display.png
where I made some fake data with three sub-populations.
So if I understand correctly(?), the bottom-left plot fully describes the data you want to fit. The purchase interval and the purchase amount are not independent at all, and the PDF is multi-modal (i.e. there are sub-populations).

(code here if you're curious)

using Distributions
using DataFrames
using CairoMakie
using PairPlots

# here x is seconds between purchases
# y is purchase price

# whale users
μw = [3000.0,50.0]
ρxy = 0.5
σx = 400.0; σy = 10.0
pop1 = MvNormal(μw,[σx^2 ρxy*σx*σy; ρxy*σx*σy σy^2])

# 'normal' users
μs = [5000.0,30.0]
ρxy = 0.9
σx = 1000.0; σy = 10.0
pop2 = MvNormal(μs,[σx^2 ρxy*σx*σy; ρxy*σx*σy σy^2])

# big spenders
μs = [3000.0,100.0]
ρxy = -0.9
σx = 500.0; σy = 20.0
pop3 = MvNormal(μs,[σx^2 ρxy*σx*σy; ρxy*σx*σy σy^2])

total_pop = MixtureModel([pop1,pop2,pop3])

fake_data = DataFrame(transpose(rand(total_pop,10000)),["purchase_interval","purchase_price"])

pairplot(fake_data)

Then generating more data that matches this plot and even trying to infer sub-populations can be done with the techniques I mentioned earlier, e.g. GaussianMixtures.jl should essentially be able to recover the

\mu

and covariance matrix of each component, or fit non-gaussian distributions with sums of gaussians. GMMs can also weight the components, so you don't need to know how many there are to start with. I think you can also try to categorize which gaussian component a particular data point comes from.

DrChainsaw (Sep 15 2023 at 08:01):

Thanks for fleshing out the suggestion. I understand the purpose of the GMM, but for the modelling purpose I'm still worried that when sampling from the mixture model important correlation aspects in the real data are not modelled.

Here are two handcrafted examples which I believe show what I mean. Real sequences are often much longer but I think this gets the point through. The goal is that given a couple of 1000 such sequences, create a model which when asked to generate 1000 new sequences, it should not be possible to tell whether those 1000 generated samples are from the training set or not.
image.png

My understanding is that if I fit a GMM to the above data and just sample from it, it is unlikely to create sequences which look like the above. The data simply does not look IID.

Code for generating (I basically have no reason to believe that the real data is generated like this).

       rng = Xoshiro(1)
       t1 = sort(cumsum(vcat(rand(rng, Normal(5, 1), 40), rand(rng, Normal(0.2, 0.1), 100), rand(rng, Normal(10, 20), 30), rand(rng, Normal(20, 30), 10))))
       v2 = abs.(vcat(rand(rng, Normal(10, 10), 40), rand(rng, Normal(30, 30), 100), rand(rng, Normal(25, 20), 40)))
       p1 = plot(t1, v2; seriestype=:scatter, label=nothing, title="Seq 1")
       t2 = sort(cumsum(abs.(vcat(rand(rng, Normal(100, 100), 5), rand(rng, Normal(1, .1), 3), rand(rng, Normal(50, 20), 2)))))
       v2 = abs.(vcat(rand(rng, Normal(15, 5), 5), rand(rng, Normal(10, 5), 3), rand(rng, Normal(5, 2), 2)))
       p2 = plot(t2, v2; seriestype=:scatter, label=nothing, title="Seq 2")
       plot(p1, p2; layout=(2,1))

Stream: helpdesk (published)