Friends, perhaps someone can give me a hand here. I have a ton of CSV files and I want to join everything in one large Dataframe.
The problem I have is consistency, i.e. the CSVs are supposed to have all the same columns, but some don't (slight variation in the naming used).
In order to deal with this, I was using
df = DataFrame()
for f in readdir(datadir() * "/exp_raw", join = true)
println(f)
if contains(f, ".csv")
append!(df, CSV.read(f, DataFrame, delim = ";"), cols = :union)
elseif contains(f, ".zip")
append!(df, extraircsvzippado(f),
cols = :union)
end
end
Hence, I joined the files in one dataframe as I parsed the files, and added new columns if they didn't match.
The issue is that the dataframe is getting too large for the memory of the computer.
Any ideas on how to deal with this?
Sounds like you need an out-of-core representation. Not sure if DataFrames can do this. If not, maybe JuliaDB.jl?
Chad Scherrer said:
Sounds like you need an out-of-core representation. Not sure if DataFrames can do this. If not, maybe JuliaDB.jl?
Thanks for the answer @Chad Scherrer . The only problem is the inconsistency in the files. As I've said, the number of columns grows each time there is a difference. Hence why I was using the append!
with cols=:union
.
I'm wondering if Dagger.jl can help me here.
After some thought, I think perhaps I can try pre-processing the CSV by checking the columns and fixing it, in order to send it to a database ;)
The inconsistencies have to be solved, either before you join or after. I was thinking with JuliaDB you could just let them grow, then fix up any weirdness after the fact. Since it doesn't try to put it all in memory it seems like it would work.
Just saw your last comment, yes that's the "fix it before" :smile:
I've just been dealing with the same issue and I think realistically there's no other way then just fixing the column names. FWIW I ended up doing something like:
colnames = names.(CSV.read.(list_of_csv_files, DataFrame; limit = 1))
which I then converted to a DataFrame for an easy overview over all the names in all tables. In my case I only had a few hundred CSVs and it turned out there were only a dozen or so different patterns of column names, so it wasn't too hard to fix, especially after doing [lowercase.(x) for x in colnames]
as a lot of it turned out to be inconsistent capitalization...
Yeah, that's how I'm proceeding also.
Last updated: Dec 28 2024 at 04:38 UTC