What is the best way to read a CSV into a StructArray (I'm still on the fence whether I want to use DataFrames)?
If you use CSV.jl the output is already satisfying the Tables.jl interface, so it's basically a dataframe-like object already.
But here's one way to collect it into a StructArray:
julia> using StructArrays, CSV
shell> cat foo.csv
col1,col2,col3,col4,col5,col6,col7,col8
,1,1.0,1,one,2019-01-01,2019-01-01T00:00:00,true
,2,2.0,2,two,2019-01-02,2019-01-02T00:00:00,false
,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,true
julia> tab = StructArray(NamedTuple.(CSV.File("foo.csv")))
3-element StructArray(::Vector{Missing}, ::Vector{Int64}, ::Vector{Float64}, ::Vector{Float64}, ::Vector{String}, ::Vector{Dates.Date}, ::Vector{Dates.DateTime}, ::Vector{Bool}) with eltype NamedTuple{(:col1, :col2, :col3, :col4, :col5, :col6, :col7, :col8), Tuple{Missing, Int64, Float64, Float64, String, Dates.Date, Dates.DateTime, Bool}}:
(col1 = missing, col2 = 1, col3 = 1.0, col4 = 1.0, col5 = "one", col6 = Dates.Date("2019-01-01"), col7 = Dates.DateTime("2019-01-01T00:00:00"), col8 = 1)
(col1 = missing, col2 = 2, col3 = 2.0, col4 = 2.0, col5 = "two", col6 = Dates.Date("2019-01-02"), col7 = Dates.DateTime("2019-01-02T00:00:00"), col8 = 0)
(col1 = missing, col2 = 3, col3 = 3.0, col4 = 3.14, col5 = "three", col6 = Dates.Date("2019-01-03"), col7 = Dates.DateTime("2019-01-03T00:00:00"), col8 = 1)
julia> tab.col1
3-element Vector{Missing}:
missing
missing
missing
julia> tab.col2
3-element Vector{Int64}:
1
2
3
julia> tab.col7
3-element Vector{Dates.DateTime}:
2019-01-01T00:00:00
2019-01-02T00:00:00
2019-01-03T00:00:00
julia> tab.col8
3-element Vector{Bool}:
1
0
1
That is cool. Thanks Mason :pray:
I kind of like StructArrays because sa[1]
gives me a custom type, which is what I kind of wanted (but not so critical I guess).
That's already the case with CSV:
julia> tab[1]
(col1 = missing, col2 = 1, col3 = 1.0, col4 = 1.0, col5 = "one", col6 = Dates.Date("2019-01-01"), col7 = Dates.DateTime("2019-01-01T00:00:00"), col8 = true)
julia> CSV.File("foo.csv")[1]
CSV.Row: (col1 = missing, col2 = 1, col3 = 1.0, col4 = 1.0, col5 = "one", col6 = Dates.Date("2019-01-01"), col7 = Dates.DateTime("2019-01-01T00:00:00"), col8 = true)
Or is your point that you can use any type you want with a structarray?
I mean, the CSV is a list of Widget
s and I want sa[1]
to be a Widget
.
Gotcha
Then you should be able to do
StructArray(Widget.(CSV.File("foo.csv")))
However, this is not optimal for performance due to the temporary array.
So you may want to make a custom constructor that's more efficient
Yeah, I was hoping for some clever way to avoid temp stuff, but not the end of the world.
You could do something like this:
julia> function csv_to_SA(T, file::String)
csv = CSV.File(file)
SA = StructVector([T(csv[1])])
for row in csv[2:end]
push!(SA, T(row))
end
SA
end
julia> csv_to_SA(NamedTuple, "foo.csv")
3-element StructArray(::Vector{Missing}, ::Vector{Int64}, ::Vector{Float64}, ::Vector{Float64}, ::Vector{String}, ::Vector{Dates.Date}, ::Vector{Dates.DateTime}, ::Vector{Bool}) with eltype NamedTuple{(:col1, :col2, :col3, :col4, :col5, :col6, :col7, :col8), Tuple{Missing, Int64, Float64, Float64, String, Dates.Date, Dates.DateTime, Bool}}:
(col1 = missing, col2 = 1, col3 = 1.0, col4 = 1.0, col5 = "one", col6 = Dates.Date("2019-01-01"), col7 = Dates.DateTime("2019-01-01T00:00:00"), col8 = 1)
(col1 = missing, col2 = 2, col3 = 2.0, col4 = 2.0, col5 = "two", col6 = Dates.Date("2019-01-02"), col7 = Dates.DateTime("2019-01-02T00:00:00"), col8 = 0)
(col1 = missing, col2 = 3, col3 = 3.0, col4 = 3.14, col5 = "three", col6 = Dates.Date("2019-01-03"), col7 = Dates.DateTime("2019-01-03T00:00:00"), col8 = 1)
Thanks. Yeah. I am working on something like that now :blush:
Last updated: Dec 28 2024 at 04:38 UTC