Stream: helpdesk (published)

Topic: CSV -> StructArray


view this post on Zulip Eric Forgy (Feb 03 2021 at 02:03):

What is the best way to read a CSV into a StructArray (I'm still on the fence whether I want to use DataFrames)?

view this post on Zulip Mason Protter (Feb 03 2021 at 02:48):

If you use CSV.jl the output is already satisfying the Tables.jl interface, so it's basically a dataframe-like object already.

view this post on Zulip Mason Protter (Feb 03 2021 at 02:52):

But here's one way to collect it into a StructArray:

julia> using StructArrays, CSV

shell> cat foo.csv
col1,col2,col3,col4,col5,col6,col7,col8
,1,1.0,1,one,2019-01-01,2019-01-01T00:00:00,true
,2,2.0,2,two,2019-01-02,2019-01-02T00:00:00,false
,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,true

julia> tab = StructArray(NamedTuple.(CSV.File("foo.csv")))
3-element StructArray(::Vector{Missing}, ::Vector{Int64}, ::Vector{Float64}, ::Vector{Float64}, ::Vector{String}, ::Vector{Dates.Date}, ::Vector{Dates.DateTime}, ::Vector{Bool}) with eltype NamedTuple{(:col1, :col2, :col3, :col4, :col5, :col6, :col7, :col8), Tuple{Missing, Int64, Float64, Float64, String, Dates.Date, Dates.DateTime, Bool}}:
 (col1 = missing, col2 = 1, col3 = 1.0, col4 = 1.0, col5 = "one", col6 = Dates.Date("2019-01-01"), col7 = Dates.DateTime("2019-01-01T00:00:00"), col8 = 1)
 (col1 = missing, col2 = 2, col3 = 2.0, col4 = 2.0, col5 = "two", col6 = Dates.Date("2019-01-02"), col7 = Dates.DateTime("2019-01-02T00:00:00"), col8 = 0)
 (col1 = missing, col2 = 3, col3 = 3.0, col4 = 3.14, col5 = "three", col6 = Dates.Date("2019-01-03"), col7 = Dates.DateTime("2019-01-03T00:00:00"), col8 = 1)

julia> tab.col1
3-element Vector{Missing}:
 missing
 missing
 missing

julia> tab.col2
3-element Vector{Int64}:
 1
 2
 3

julia> tab.col7
3-element Vector{Dates.DateTime}:
 2019-01-01T00:00:00
 2019-01-02T00:00:00
 2019-01-03T00:00:00

julia> tab.col8
3-element Vector{Bool}:
 1
 0
 1

view this post on Zulip Eric Forgy (Feb 03 2021 at 02:56):

That is cool. Thanks Mason :pray:

I kind of like StructArrays because sa[1] gives me a custom type, which is what I kind of wanted (but not so critical I guess).

view this post on Zulip Mason Protter (Feb 03 2021 at 02:59):

That's already the case with CSV:

julia> tab[1]
(col1 = missing, col2 = 1, col3 = 1.0, col4 = 1.0, col5 = "one", col6 = Dates.Date("2019-01-01"), col7 = Dates.DateTime("2019-01-01T00:00:00"), col8 = true)

julia> CSV.File("foo.csv")[1]
CSV.Row: (col1 = missing, col2 = 1, col3 = 1.0, col4 = 1.0, col5 = "one", col6 = Dates.Date("2019-01-01"), col7 = Dates.DateTime("2019-01-01T00:00:00"), col8 = true)

view this post on Zulip Mason Protter (Feb 03 2021 at 03:00):

Or is your point that you can use any type you want with a structarray?

view this post on Zulip Eric Forgy (Feb 03 2021 at 03:00):

I mean, the CSV is a list of Widgets and I want sa[1] to be a Widget.

view this post on Zulip Mason Protter (Feb 03 2021 at 03:03):

Gotcha

view this post on Zulip Mason Protter (Feb 03 2021 at 03:04):

Then you should be able to do

StructArray(Widget.(CSV.File("foo.csv")))

However, this is not optimal for performance due to the temporary array.

view this post on Zulip Mason Protter (Feb 03 2021 at 03:04):

So you may want to make a custom constructor that's more efficient

view this post on Zulip Eric Forgy (Feb 03 2021 at 03:05):

Yeah, I was hoping for some clever way to avoid temp stuff, but not the end of the world.

view this post on Zulip Mason Protter (Feb 03 2021 at 03:26):

You could do something like this:

julia> function csv_to_SA(T, file::String)
           csv = CSV.File(file)
           SA = StructVector([T(csv[1])])
           for row in csv[2:end]
               push!(SA, T(row))
           end
           SA
       end

julia> csv_to_SA(NamedTuple, "foo.csv")
3-element StructArray(::Vector{Missing}, ::Vector{Int64}, ::Vector{Float64}, ::Vector{Float64}, ::Vector{String}, ::Vector{Dates.Date}, ::Vector{Dates.DateTime}, ::Vector{Bool}) with eltype NamedTuple{(:col1, :col2, :col3, :col4, :col5, :col6, :col7, :col8), Tuple{Missing, Int64, Float64, Float64, String, Dates.Date, Dates.DateTime, Bool}}:
 (col1 = missing, col2 = 1, col3 = 1.0, col4 = 1.0, col5 = "one", col6 = Dates.Date("2019-01-01"), col7 = Dates.DateTime("2019-01-01T00:00:00"), col8 = 1)
 (col1 = missing, col2 = 2, col3 = 2.0, col4 = 2.0, col5 = "two", col6 = Dates.Date("2019-01-02"), col7 = Dates.DateTime("2019-01-02T00:00:00"), col8 = 0)
 (col1 = missing, col2 = 3, col3 = 3.0, col4 = 3.14, col5 = "three", col6 = Dates.Date("2019-01-03"), col7 = Dates.DateTime("2019-01-03T00:00:00"), col8 = 1)

view this post on Zulip Eric Forgy (Feb 03 2021 at 03:28):

Thanks. Yeah. I am working on something like that now :blush:


Last updated: Nov 06 2024 at 04:40 UTC