Stream: helpdesk (published)

Topic: Inserting columns in Tables.jl


view this post on Zulip Davi Sales Barreira (Feb 05 2024 at 16:53):

Friends, I'm trying to adapt some code I have in order to work using Tables.jl. I'm having some trouble figuring out how one can add columns to a table in Tables.jl. I couldn't find it in the documentation...

view this post on Zulip aplavin (Feb 05 2024 at 17:51):

Don't think there is general mechanism to do that.
The Tables.jl interface is great for IO and interop/conversions, but actual data manipulations are often most convenient when focusing on specific types.

view this post on Zulip aplavin (Feb 05 2024 at 17:54):

Would be nice if more tables supported @insert tbl.col = ..., this mechanism is perfectly extensible. For now, columntable (namedtuple-of-vectors), StructArrays, and DictArrays do that.
And to be perfectly general for tables without property access to columns, @insert Tables.columns(tbl).col = ... could be added as well.

view this post on Zulip jar (Feb 05 2024 at 20:16):

if more tables supported

This just goes in Accessors extensions, we don't need to bother Bogumil about it, right?

view this post on Zulip aplavin (Feb 05 2024 at 22:20):

I don't think the Tables package itself should (or could) be involved in @insert tbl.col = ... interface implementation. So, we don't need to bother Bogumil or other maintainers of Tablesjl in general :)

Implementation for adding columns is fundamentally different for different table types. It should live either in Accessors extensions, or in the package defining the table type. The latter is more natural, the glue code is more likely to change due to changes in the table type than due to changes in Accessors.
StructArrays support lives in Accessors: I added it, and put there just because review responses are much faster for Accessors – so it was faster and easier than to put into StructArrays. DictArrays support Accessors natively, without any extensions.

view this post on Zulip aplavin (Feb 20 2024 at 19:46):

I found this direction interesting to experiment with, and added more Tables support to AccessorsExtra. It can modify Tables.columns(tbl) for supported types efficiently, and can modify rowtable(tbl) and columntable(tbl) for any table type (that supports Table.materializer). The latter is as efficient as the manual sequence of columntable + add namedtuple entry + materialize back, which is basically free for typical columnar tables.

Examples:

julia> using Tables, AccessorsExtra, StructArrays, TypedTables

# create a few tables of different types;
julia> tbl_r = [(a=1,b=2),(a=3,b=4)]
2-element Vector{@NamedTuple{a::Int64, b::Int64}}:
 (a = 1, b = 2)
 (a = 3, b = 4)

julia> tbl_sa = StructArray(tbl_r)
2-element StructArray(::Vector{Int64}, ::Vector{Int64}) with eltype @NamedTuple{a::Int64, b::Int64}:
 (a = 1, b = 2)
 (a = 3, b = 4)

julia> tbl_tt = Table(tbl_r)
Table with 2 columns and 2 rows:
     a  b
   ┌─────
 1  1  2
 2  3  4

# explicitly supported types: add a new column by modifying columns()
julia> @insert Tables.columns(tbl_r).c = ["5", "6"]
2-element Vector{@NamedTuple{a::Int64, b::Int64, c::String}}:
 (a = 1, b = 2, c = "5")
 (a = 3, b = 4, c = "6")

julia> @insert Tables.columns(tbl_sa).c = ["5", "6"]
2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{String}) with eltype @NamedTuple{a::Int64, b::Int64, c::String}:
 (a = 1, b = 2, c = "5")
 (a = 3, b = 4, c = "6")

# any type, no special support needed: add column by modifying columntable()
julia> @insert Tables.columntable(tbl_tt).c = ["5", "6"]
Table with 3 columns and 2 rows:
     a  b  c
   ┌────────
 1  1  2  5
 2  3  4  6

# or add a row by modifying rowtable()
julia> @insert last(Tables.rowtable(tbl_tt)) = (a=5, b=6)
Table with 2 columns and 3 rows:
     a  b
   ┌─────
 1  1  2
 2  3  4
 3  5  6

# deletion also works, of course:
julia> @delete last(Tables.rowtable(tbl_tt))
Table with 2 columns and 1 row:
     a  b
   ┌─────
 1  1  2

view this post on Zulip aplavin (Feb 20 2024 at 19:48):

If this happens to be actually useful, can be upstreamed to Accessors directly. And specific table types can add more efficient overloads for modify(rowtable).
I personally just use StructArrays almost all the time, but it was fun to make these generalizations :)

view this post on Zulip jar (Feb 20 2024 at 20:05):

This looks good. It supports DataFrames too, right?

view this post on Zulip aplavin (Feb 20 2024 at 20:35):

Seems like that, yeah:

julia> tbl = [(a=1,b=2),(a=3,b=4)]

julia> df = DataFrame(tbl)
2×2 DataFrame
 Row  a      b
      Int64  Int64
─────┼──────────────
   1      1      2
   2      3      4

julia> @delete last(rowtable(df))
1×2 DataFrame
 Row  a      b
      Int64  Int64
─────┼──────────────
   1      1      2

julia> @insert columntable(df).c = ["5", "6"]
2×3 DataFrame
 Row  a      b      c
      Int64  Int64  String
─────┼──────────────────────
   1      1      2  5
   2      3      4  6

As I said, any table that supports Tables.materializer.

view this post on Zulip Peter Deffebach (Sep 10 2024 at 16:04):

I wrote a lightweight package here: https://github.com/pdeffebach/ColumnFrames.jl

which defines a table type and implements the Tables interface along with the Named Tuple interface (the two overlap, I overloaded some tables stuff for convenience). It adds a few functions related to mutability though. Its essentially a mutable named tuple where all the columns are vectors of the same length. I never finished writing tests or publishing it, though

view this post on Zulip Adam non-jedi Beckmeyer (Sep 10 2024 at 16:55):

Peter Deffebach said:

which defines a table type and implements the Tables interface along with the Named Tuple interface (the two overlap, I overloaded some tables stuff for convenience). It adds a few functions related to mutability though. Its essentially a mutable named tuple where all the columns are vectors of the same length. I never finished writing tests or publishing it, though

For what it's worth, this sounds like basically what TypedTables.jl already does except that it has an interface pretending the Table is a vector of named tuples rather than the other way around.

view this post on Zulip Peter Deffebach (Sep 10 2024 at 17:20):

Oh yeah, the other important thing is that the column types aren't encoded in the type. That's like the key feature lol. So this way you can have 1000s of columns and not worry about compilation issues.

view this post on Zulip Adam non-jedi Beckmeyer (Sep 10 2024 at 20:41):

Ah. That's equivalent to the FlexTable type from TypedTables I believe.

julia> ft = FlexTable(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
FlexTable with 2 columns and 3 rows:
     name     age
   ┌─────────────
 1 │ Alice    25
 2 │ Bob      42
 3 │ Charlie  37

julia> ft.sex = [:F, :M, :M];

julia> ft
FlexTable with 3 columns and 3 rows:
     name     age  sex
   ┌──────────────────
 1 │ Alice    25   F
 2 │ Bob      42   M
 3 │ Charlie  37   M

view this post on Zulip Peter Deffebach (Sep 10 2024 at 21:01):

Hmmm, it looks like FlexTable still has a NamedTuple on the inside, so you could still potentially create a named tuple with 10,000 elements. Whereas my version was backed by a plain vector of vectors.

view this post on Zulip aplavin (Sep 10 2024 at 21:52):

I made DictArrays.jl some time ago, think "like structarrays, but with dictionaries instead of namedtuples". Supports Tables and collection interface, fast both for very wide tables and for functions like map and filter.
I ended up barely using them – never encounter tables with more than 100-200 columns, and for them it's still fine to wait for StructArrays compilation.

Still, it's a direct demonstration that Tables + collection operations + Accessors for column modification are possible at once, and without sacrificing performance.

view this post on Zulip Peter Deffebach (Sep 10 2024 at 22:28):

I'm currently working with 10,000 column dataset in Stata. The Julia toolchain just isn't there to do all the stuff I want. But yeah it's hard to see how something like these packages fits in. If you have 10,000 columns, you probably don't want something quick with a lightweight API. Just use dataframes


Last updated: Nov 22 2024 at 04:41 UTC