Creating out-of-memory dataframe from several csv files · helpdesk (published)

Friends, perhaps someone can give me a hand here. I have a ton of CSV files and I want to join everything in one large Dataframe.
The problem I have is consistency, i.e. the CSVs are supposed to have all the same columns, but some don't (slight variation in the naming used).
In order to deal with this, I was using

    df = DataFrame()
    for f in readdir(datadir() * "/exp_raw", join = true)
        println(f)
        if contains(f, ".csv")
            append!(df, CSV.read(f, DataFrame, delim = ";"), cols = :union)
        elseif contains(f, ".zip")
            append!(df, extraircsvzippado(f),
                cols = :union)
        end
    end

Hence, I joined the files in one dataframe as I parsed the files, and added new columns if they didn't match.
The issue is that the dataframe is getting too large for the memory of the computer.
Any ideas on how to deal with this?

Chad Scherrer (May 25 2022 at 14:03):

Sounds like you need an out-of-core representation. Not sure if DataFrames can do this. If not, maybe JuliaDB.jl?

Davi Sales Barreira (May 25 2022 at 14:05):

Thanks for the answer @Chad Scherrer . The only problem is the inconsistency in the files. As I've said, the number of columns grows each time there is a difference. Hence why I was using the append! with cols=:union.

Davi Sales Barreira (May 25 2022 at 14:06):

Davi Sales Barreira (May 25 2022 at 14:15):

After some thought, I think perhaps I can try pre-processing the CSV by checking the columns and fixing it, in order to send it to a database ;)

Chad Scherrer (May 25 2022 at 14:15):

The inconsistencies have to be solved, either before you join or after. I was thinking with JuliaDB you could just let them grow, then fix up any weirdness after the fact. Since it doesn't try to put it all in memory it seems like it would work.

Nils (May 27 2022 at 14:15):

I've just been dealing with the same issue and I think realistically there's no other way then just fixing the column names. FWIW I ended up doing something like:

colnames = names.(CSV.read.(list_of_csv_files, DataFrame; limit = 1))

which I then converted to a DataFrame for an easy overview over all the names in all tables. In my case I only had a few hundred CSVs and it turned out there were only a dozen or so different patterns of column names, so it wasn't too hard to fix, especially after doing [lowercase.(x) for x in colnames] as a lot of it turned out to be inconsistent capitalization...

Stream: helpdesk (published)

Topic: Creating out-of-memory dataframe from several csv files