Stream: helpdesk (published)

Topic: Labelled/named arrays


view this post on Zulip Nils (Feb 08 2024 at 14:41):

What's the current state of labelled arrays? I recall there were many approaches in early Julia, has something turned out to be the favourite?

To give some background, I'm considering transitioning my SynthControl.jl package over to use some sort of named arrays. Essentially the package solves a bunch of problems of the following form:

We have a set of outcomes Y::Matrix{Float64} which is size N by T, where N is the number of observed units and T the time periods during which they are observed. Now one (or more, but let's go with one for simplicity) of these units is "treated" at some point 1 < T0 < T and we want to find out how the outcome was affected by this treatment. The idea is to find a weighted combination of the other N-1 units that closely approximates the evolution of the outcome of the treated unit before T0. So essentially we do

minimize(w -> sum(abs2, Y[i, 1:T0] - Y[Not(i), 1:T0]*w))

Now every row in Y represents an observed unit, e.g. a US state, and every column a time, e.g. 1981. Likewise the w vector is a weight for every non-treated unit, so it would be nice to have, instead of

julia> s.treatment_panel.Y
39×31 Matrix{Float64}:
  89.8   95.4  101.1  102.9  108.2  (...)

julia> s.w
38-element Vector{Float64}:
 0.0
 0.0
 0.014810770450243859
 0.10908962424043207
 0.0
 0.0
(...)

To have something like

julia> s.treatment_panel.Y
39×31 SomeMatrix{Float64}:
                  /  1980  1981  1982  1983  1984
Alabama |  89.8   95.4  101.1  102.9  108.2  (...)
Alaska      |

julia> s.w
38-element Vector{Float64}:
Alabama |  0.0
Alaska |  0.0
Arkansas | 0.014810770450243859
(...)

Ideally with as little overhead as possible for things like Y*w. Any suggestions?

view this post on Zulip Nils (Feb 08 2024 at 14:50):

NamedArrays seems pretty good:

julia> y_test = NamedArray(s.treatment_panel.Y, (s.treatment_panel.is, s.treatment_panel.ts), ("State", "Year"))
39×31 Named Matrix{Float64}
State  Year   1970   1971   1972   1973   1974   1975   1976   1977   1978   1979   1980   1981   1982     1988   1989   1990   1991   1992   1993   1994   1995   1996   1997   1998   1999   2000
─────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1              89.8   95.4  101.1  102.9  108.2  111.7  116.2  117.1  123.0  121.4  123.2  119.6  119.1    112.1  105.6  108.6  107.9  109.1  108.5  107.1  102.6  101.4  104.9  106.2  100.7   96.2
2             100.3  104.1  103.9  108.0  109.7  114.8  119.1  122.6  127.3  126.5  131.8  128.7  127.4     121.5  118.3  113.1  116.8  126.0  113.8  108.8  113.0  110.7  108.7  109.5  104.8   99.4
3             123.0  121.0  123.5  124.4  126.7  127.1  128.0  126.4  126.1  121.9  120.2  118.6  115.4      90.1   82.4   77.8   68.7   67.5   63.4   58.6   56.4   54.5   53.8   52.3   47.2   41.6
(...)

julia> w_test = NamedArray(s.w, filter(!=(3), s.treatment_panel.is), "State")
38-element Named Vector{Float64}
State  
───────┼──────────
1             0.0
2             0.0
4       0.0148108
5         0.10909
(...)

julia> y_test[Not(3), :]'*s.w
31-element Named Vector{Float64}
Year  
──────┼────────
1970   117.424
1971   119.823
1972   124.646
1973   124.367

Any known downsides to this?

view this post on Zulip jar (Feb 08 2024 at 19:49):

AxisKeys.jl is my favorite

view this post on Zulip aplavin (Feb 09 2024 at 02:02):

Yeah, AxisKeys have the most lightweight data structure among all "keyed arrays" packages that I saw.

view this post on Zulip jar (Feb 09 2024 at 04:24):

AxisKeys also distinguishes between selecting from the axes vs from the axiskeys, which I think is important for a clean interface.

view this post on Zulip Nils (Feb 09 2024 at 09:23):

Thanks I'll try it out

view this post on Zulip Michael Abbott (Feb 09 2024 at 19:39):

Sadly we never got the ecosystem to consolidate on one.

AxisKeys.jl was my attempt, trying to be fairly lightweight and make few assumptions. It could still be much simpler (e.g. I think double-wrapping with NamedDims.jl ends up more complex than putting all the info in one struct). The ideal for me is something as natural & inevitable as Base's NamedTuple. But I don't use it much in the end & should probably hand over maintenance somehow.

DimensionalData.jl is probably the most actively developed package, and the largest, builds in many things instead of farming out? Aimed at spatial data, special meanings to X, Y. In my ideal world all of this could be built on top of some minimal NamedTuple-esque package... but this is unlikely to happen.

AxisArrays.jl is older, and seemed abandoned for a bit (when the above two were written), has many undocumented features. But is still in use, e.g. I think by the Images.jl ecosystem. It worked hard to be low-overhead, and some of this hard work was aimed at Julia <1 & could be simplified.

NamedArrays.jl is also older. It's generally much more mutable, and I think doesn't try so hard to be type-stable etc. Got the best name though!

view this post on Zulip Andy Dienes (Feb 09 2024 at 22:45):

I know https://github.com/JuliaDataCubes/YAXArrays.jl exists as well. though I don't know much about it, it does seem actively developed

view this post on Zulip jar (Feb 09 2024 at 23:00):

YAXArrays.jl

view this post on Zulip Michael Abbott (Feb 09 2024 at 23:54):

Oh right I forgot that one. It's built on top of DimensionalData.jl I think.

There's also at least one such thing built into JuMP.jl e.g. here https://jump.dev/JuMP.jl/stable/manual/containers/

view this post on Zulip Alec (Feb 10 2024 at 05:49):

I use LabelledArrays.jl in places where I would normally use a named tuple but the interface requires an array.


Last updated: Nov 06 2024 at 04:40 UTC