Stream: helpdesk (published)

Topic: Creating out-of-memory dataframe from several csv files


view this post on Zulip Davi Sales Barreira (May 25 2022 at 13:53):

Friends, perhaps someone can give me a hand here. I have a ton of CSV files and I want to join everything in one large Dataframe.
The problem I have is consistency, i.e. the CSVs are supposed to have all the same columns, but some don't (slight variation in the naming used).
In order to deal with this, I was using

    df = DataFrame()
    for f in readdir(datadir() * "/exp_raw", join = true)
        println(f)
        if contains(f, ".csv")
            append!(df, CSV.read(f, DataFrame, delim = ";"), cols = :union)
        elseif contains(f, ".zip")
            append!(df, extraircsvzippado(f),
                cols = :union)
        end
    end

Hence, I joined the files in one dataframe as I parsed the files, and added new columns if they didn't match.
The issue is that the dataframe is getting too large for the memory of the computer.
Any ideas on how to deal with this?

view this post on Zulip Chad Scherrer (May 25 2022 at 14:03):

Sounds like you need an out-of-core representation. Not sure if DataFrames can do this. If not, maybe JuliaDB.jl?

view this post on Zulip Davi Sales Barreira (May 25 2022 at 14:05):

Chad Scherrer said:

Sounds like you need an out-of-core representation. Not sure if DataFrames can do this. If not, maybe JuliaDB.jl?

Thanks for the answer @Chad Scherrer . The only problem is the inconsistency in the files. As I've said, the number of columns grows each time there is a difference. Hence why I was using the append! with cols=:union.

view this post on Zulip Davi Sales Barreira (May 25 2022 at 14:06):

I'm wondering if Dagger.jl can help me here.

view this post on Zulip Davi Sales Barreira (May 25 2022 at 14:15):

After some thought, I think perhaps I can try pre-processing the CSV by checking the columns and fixing it, in order to send it to a database ;)

view this post on Zulip Chad Scherrer (May 25 2022 at 14:15):

The inconsistencies have to be solved, either before you join or after. I was thinking with JuliaDB you could just let them grow, then fix up any weirdness after the fact. Since it doesn't try to put it all in memory it seems like it would work.

Just saw your last comment, yes that's the "fix it before" :smile:

view this post on Zulip Nils (May 27 2022 at 14:15):

I've just been dealing with the same issue and I think realistically there's no other way then just fixing the column names. FWIW I ended up doing something like:

colnames = names.(CSV.read.(list_of_csv_files, DataFrame; limit = 1))

which I then converted to a DataFrame for an easy overview over all the names in all tables. In my case I only had a few hundred CSVs and it turned out there were only a dozen or so different patterns of column names, so it wasn't too hard to fix, especially after doing [lowercase.(x) for x in colnames] as a lot of it turned out to be inconsistent capitalization...

view this post on Zulip Davi Sales Barreira (May 27 2022 at 16:42):

Yeah, that's how I'm proceeding also.


Last updated: Oct 02 2023 at 04:34 UTC