Friends, I'm trying to adapt some code I have in order to work using Tables.jl. I'm having some trouble figuring out how one can add columns to a table in Tables.jl. I couldn't find it in the documentation...
Don't think there is general mechanism to do that.
The Tables.jl interface is great for IO and interop/conversions, but actual data manipulations are often most convenient when focusing on specific types.
Would be nice if more tables supported @insert tbl.col = ...
, this mechanism is perfectly extensible. For now, columntable (namedtuple-of-vectors), StructArrays, and DictArrays do that.
And to be perfectly general for tables without property access to columns, @insert Tables.columns(tbl).col = ...
could be added as well.
if more tables supported
This just goes in Accessors extensions, we don't need to bother Bogumil about it, right?
I don't think the Tables package itself should (or could) be involved in @insert tbl.col = ...
interface implementation. So, we don't need to bother Bogumil or other maintainers of Tablesjl in general :)
Implementation for adding columns is fundamentally different for different table types. It should live either in Accessors extensions, or in the package defining the table type. The latter is more natural, the glue code is more likely to change due to changes in the table type than due to changes in Accessors.
StructArrays support lives in Accessors: I added it, and put there just because review responses are much faster for Accessors – so it was faster and easier than to put into StructArrays. DictArrays support Accessors natively, without any extensions.
I found this direction interesting to experiment with, and added more Tables support to AccessorsExtra. It can modify Tables.columns(tbl)
for supported types efficiently, and can modify rowtable(tbl)
and columntable(tbl)
for any table type (that supports Table.materializer
). The latter is as efficient as the manual sequence of columntable + add namedtuple entry + materialize back
, which is basically free for typical columnar tables.
Examples:
julia> using Tables, AccessorsExtra, StructArrays, TypedTables
# create a few tables of different types;
julia> tbl_r = [(a=1,b=2),(a=3,b=4)]
2-element Vector{@NamedTuple{a::Int64, b::Int64}}:
(a = 1, b = 2)
(a = 3, b = 4)
julia> tbl_sa = StructArray(tbl_r)
2-element StructArray(::Vector{Int64}, ::Vector{Int64}) with eltype @NamedTuple{a::Int64, b::Int64}:
(a = 1, b = 2)
(a = 3, b = 4)
julia> tbl_tt = Table(tbl_r)
Table with 2 columns and 2 rows:
a b
┌─────
1 │ 1 2
2 │ 3 4
# explicitly supported types: add a new column by modifying columns()
julia> @insert Tables.columns(tbl_r).c = ["5", "6"]
2-element Vector{@NamedTuple{a::Int64, b::Int64, c::String}}:
(a = 1, b = 2, c = "5")
(a = 3, b = 4, c = "6")
julia> @insert Tables.columns(tbl_sa).c = ["5", "6"]
2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{String}) with eltype @NamedTuple{a::Int64, b::Int64, c::String}:
(a = 1, b = 2, c = "5")
(a = 3, b = 4, c = "6")
# any type, no special support needed: add column by modifying columntable()
julia> @insert Tables.columntable(tbl_tt).c = ["5", "6"]
Table with 3 columns and 2 rows:
a b c
┌────────
1 │ 1 2 5
2 │ 3 4 6
# or add a row by modifying rowtable()
julia> @insert last(Tables.rowtable(tbl_tt)) = (a=5, b=6)
Table with 2 columns and 3 rows:
a b
┌─────
1 │ 1 2
2 │ 3 4
3 │ 5 6
# deletion also works, of course:
julia> @delete last(Tables.rowtable(tbl_tt))
Table with 2 columns and 1 row:
a b
┌─────
1 │ 1 2
If this happens to be actually useful, can be upstreamed to Accessors directly. And specific table types can add more efficient overloads for modify(rowtable)
.
I personally just use StructArrays almost all the time, but it was fun to make these generalizations :)
This looks good. It supports DataFrames too, right?
Seems like that, yeah:
julia> tbl = [(a=1,b=2),(a=3,b=4)]
julia> df = DataFrame(tbl)
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 3 4
julia> @delete last(rowtable(df))
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> @insert columntable(df).c = ["5", "6"]
2×3 DataFrame
Row │ a b c
│ Int64 Int64 String
─────┼──────────────────────
1 │ 1 2 5
2 │ 3 4 6
As I said, any table that supports Tables.materializer
.
I wrote a lightweight package here: https://github.com/pdeffebach/ColumnFrames.jl
which defines a table type and implements the Tables interface along with the Named Tuple interface (the two overlap, I overloaded some tables stuff for convenience). It adds a few functions related to mutability though. Its essentially a mutable named tuple where all the columns are vectors of the same length. I never finished writing tests or publishing it, though
Peter Deffebach said:
which defines a table type and implements the Tables interface along with the Named Tuple interface (the two overlap, I overloaded some tables stuff for convenience). It adds a few functions related to mutability though. Its essentially a mutable named tuple where all the columns are vectors of the same length. I never finished writing tests or publishing it, though
For what it's worth, this sounds like basically what TypedTables.jl already does except that it has an interface pretending the Table
is a vector of named tuples rather than the other way around.
Oh yeah, the other important thing is that the column types aren't encoded in the type. That's like the key feature lol. So this way you can have 1000s of columns and not worry about compilation issues.
Ah. That's equivalent to the FlexTable
type from TypedTables I believe.
julia> ft = FlexTable(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
FlexTable with 2 columns and 3 rows:
name age
┌─────────────
1 │ Alice 25
2 │ Bob 42
3 │ Charlie 37
julia> ft.sex = [:F, :M, :M];
julia> ft
FlexTable with 3 columns and 3 rows:
name age sex
┌──────────────────
1 │ Alice 25 F
2 │ Bob 42 M
3 │ Charlie 37 M
Hmmm, it looks like FlexTable still has a NamedTuple on the inside, so you could still potentially create a named tuple with 10,000 elements. Whereas my version was backed by a plain vector of vectors.
I made DictArrays.jl some time ago, think "like structarrays, but with dictionaries instead of namedtuples". Supports Tables and collection interface, fast both for very wide tables and for functions like map and filter.
I ended up barely using them – never encounter tables with more than 100-200 columns, and for them it's still fine to wait for StructArrays compilation.
Still, it's a direct demonstration that Tables + collection operations + Accessors for column modification are possible at once, and without sacrificing performance.
I'm currently working with 10,000 column dataset in Stata. The Julia toolchain just isn't there to do all the stuff I want. But yeah it's hard to see how something like these packages fits in. If you have 10,000 columns, you probably don't want something quick with a lightweight API. Just use dataframes
Last updated: Nov 06 2024 at 04:40 UTC