What's the current state of labelled arrays? I recall there were many approaches in early Julia, has something turned out to be the favourite?
To give some background, I'm considering transitioning my SynthControl.jl
package over to use some sort of named arrays. Essentially the package solves a bunch of problems of the following form:
We have a set of outcomes Y::Matrix{Float64}
which is size N
by T
, where N
is the number of observed units and T
the time periods during which they are observed. Now one (or more, but let's go with one for simplicity) of these units is "treated" at some point 1 < T0 < T
and we want to find out how the outcome was affected by this treatment. The idea is to find a weighted combination of the other N-1
units that closely approximates the evolution of the outcome of the treated unit before T0
. So essentially we do
minimize(w -> sum(abs2, Y[i, 1:T0] - Y[Not(i), 1:T0]*w))
Now every row in Y
represents an observed unit, e.g. a US state, and every column a time, e.g. 1981. Likewise the w
vector is a weight for every non-treated unit, so it would be nice to have, instead of
julia> s.treatment_panel.Y
39×31 Matrix{Float64}:
89.8 95.4 101.1 102.9 108.2 (...)
julia> s.w
38-element Vector{Float64}:
0.0
0.0
0.014810770450243859
0.10908962424043207
0.0
0.0
(...)
To have something like
julia> s.treatment_panel.Y
39×31 SomeMatrix{Float64}:
/ 1980 1981 1982 1983 1984
Alabama | 89.8 95.4 101.1 102.9 108.2 (...)
Alaska |
julia> s.w
38-element Vector{Float64}:
Alabama | 0.0
Alaska | 0.0
Arkansas | 0.014810770450243859
(...)
Ideally with as little overhead as possible for things like Y*w
. Any suggestions?
NamedArrays seems pretty good:
julia> y_test = NamedArray(s.treatment_panel.Y, (s.treatment_panel.is, s.treatment_panel.ts), ("State", "Year"))
39×31 Named Matrix{Float64}
State ╲ Year │ 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 … 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
─────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 89.8 95.4 101.1 102.9 108.2 111.7 116.2 117.1 123.0 121.4 123.2 119.6 119.1 … 112.1 105.6 108.6 107.9 109.1 108.5 107.1 102.6 101.4 104.9 106.2 100.7 96.2
2 │ 100.3 104.1 103.9 108.0 109.7 114.8 119.1 122.6 127.3 126.5 131.8 128.7 127.4 121.5 118.3 113.1 116.8 126.0 113.8 108.8 113.0 110.7 108.7 109.5 104.8 99.4
3 │ 123.0 121.0 123.5 124.4 126.7 127.1 128.0 126.4 126.1 121.9 120.2 118.6 115.4 90.1 82.4 77.8 68.7 67.5 63.4 58.6 56.4 54.5 53.8 52.3 47.2 41.6
(...)
julia> w_test = NamedArray(s.w, filter(!=(3), s.treatment_panel.is), "State")
38-element Named Vector{Float64}
State │
───────┼──────────
1 │ 0.0
2 │ 0.0
4 │ 0.0148108
5 │ 0.10909
(...)
julia> y_test[Not(3), :]'*s.w
31-element Named Vector{Float64}
Year │
──────┼────────
1970 │ 117.424
1971 │ 119.823
1972 │ 124.646
1973 │ 124.367
Any known downsides to this?
AxisKeys.jl is my favorite
Yeah, AxisKeys have the most lightweight data structure among all "keyed arrays" packages that I saw.
AxisKeys also distinguishes between selecting from the axes
vs from the axiskeys
, which I think is important for a clean interface.
Thanks I'll try it out
Sadly we never got the ecosystem to consolidate on one.
AxisKeys.jl was my attempt, trying to be fairly lightweight and make few assumptions. It could still be much simpler (e.g. I think double-wrapping with NamedDims.jl ends up more complex than putting all the info in one struct). The ideal for me is something as natural & inevitable as Base's NamedTuple
. But I don't use it much in the end & should probably hand over maintenance somehow.
DimensionalData.jl is probably the most actively developed package, and the largest, builds in many things instead of farming out? Aimed at spatial data, special meanings to X, Y. In my ideal world all of this could be built on top of some minimal NamedTuple-esque package... but this is unlikely to happen.
AxisArrays.jl is older, and seemed abandoned for a bit (when the above two were written), has many undocumented features. But is still in use, e.g. I think by the Images.jl ecosystem. It worked hard to be low-overhead, and some of this hard work was aimed at Julia <1 & could be simplified.
NamedArrays.jl is also older. It's generally much more mutable, and I think doesn't try so hard to be type-stable etc. Got the best name though!
I know https://github.com/JuliaDataCubes/YAXArrays.jl
exists as well. though I don't know much about it, it does seem actively developed
Oh right I forgot that one. It's built on top of DimensionalData.jl I think.
There's also at least one such thing built into JuMP.jl e.g. here https://jump.dev/JuMP.jl/stable/manual/containers/
I use LabelledArrays.jl in places where I would normally use a named tuple but the interface requires an array.
Last updated: Nov 06 2024 at 04:40 UTC