Hi,
I need a data structure for representing log-weighted observations. I was originally thinking of using a PriorityQueue from DataStructures.jl. But this stores an array of pairs and mutates this as it goes, maintaining heap ordering at each insert. In high dimensions, this is a lot of copying, so it will get very slow.
Also, this data structure stores each key twice: once in the array of pairs, and again in an Index::Dict{K,Int}
field. For large collections, this seems like a waste of memory.
Finally, lots of functions will depend only on the values (the log-weights), so iterating through them needs to be fast.
I was thinking of adapting some ideas from @Andy Ferris 's Dictionaries.jl:
Indices
withvalues
field changed to <:AbstractArray
(with a type parameter). This way indices could be stored in a StructArray
or TupleVector
, which give better efficiency for many operations.lookup
and one for find
. This would allow reordering without changing the original array.<: AbstractDictionary
that's very similar to Dictionary
, but maintains heap ordering similarly to DataStructures.PriorityQueue
logtotal
field, maybe using the new LogSumExp
in OnlineStatsI know the ordering on indices is internal, but this would use it in a similar way to what Dictionary
does.
What do you think? Or maybe there's an easier way to do this?
I think I found a better way to do this:
https://github.com/cscherrer/PermutedArrays.jl
I think the idea of making the internals of Indices
more abstract/replaceable is OK. I backed out of it because there's no real trait for resize!
on arrays, and I wanted to get at least something out that worked and was easy to use. Certainly something you could implement via some copy-paste of the Dictionaries.jl code. If someone figures out how to make it robust we can probably upstream it.
Sort-ordered dictionaries are an area of interest for me (though I'm probably more thinking about fully-sorted dictionaries rather than heaps at the moment).
Maintaining summary stats for data structures is another really cool idea. For invertible, associative, commutative operations I think that this can be done easily enough with a wrapper that intercepts the mutations (assuming you don't want to track the deltas manually); for more general associative reductions it depends somewhat on the data structure (e.g. you can do this with a tree).
If PermutedArrays can do everything you want already, using that sounds like a great idea.
Thanks for the feedback :smile:
If Indices
can be made to take another type parameter, say replacing values::Vector{I}
with values::V
, this could be something like
struct MyDictionary{I,T,V,F,S} <: AbstractDictionary{I, T}
inds::Indices{I,V}
# keys.data === inds.values
keys::PermutedVector{V}
# This could be abstracted too, but no need to do everything at once
values::Vector{T}
stats::S # e.g., OnlineStats.LogSumExp
update_stats!::F # e.g., OnlineStats.fit!
end
I had convinced myself this is important for my JuliaCon talk, but I think that's just because it sounds fun. So I should work on slides. I'll think about this some more in the background, and maybe we can talk some more in a week or two?
With JuliaCon submitted (:sweat_smile: ), I have a little catch-up to do. Adding this to my to-do list, hope to get back to it soon :smile:
Last updated: Nov 06 2024 at 04:40 UTC