So I have a dataframe with a bunch of benchmarking data, I want to save this dataframe (preferably as plaintext), together with some metadata with things like the machine the benchmarks were performed on, the git commit of the package when it got benchmarked, etc. What's the best way to go about this?
There's no obvious way to include metadata in a CSV file from what I can tell. Do I need to work with JSON or something here?
My go-to usually would just be to use JLD2 to store a DataFrame
and struct with my metadata, but I'd like to store it as plaintext and have the results not be tied to literal julia objects if possible.
Very imperfect but my recent solution to this problem was to smash together a TOML file as a header with the CSV at the bottom. I can re-ingest the CSV by just skipping the correct number of rows and separately I can read the meta-data by slicing off the CSV at the bottom and parsing the TOML file at the top. Not the best solution ever but has worked well enough I suppose.
For now I just manually tell the reader how many rows to skip, not sure what would be the easiest way to detect the "split point" automatically so I could write a function that handles both halves automatically for me.
JLD2 uses HDF5 files, and HDF5 files support arbitrary key-value metadata (attributes). What I'm less sure about is whether you can use JLD2 itself to read and write such metadata, or if you'd need https://github.com/JuliaIO/HDF5.jl
But HDF5 isn't human-readable right?
Mason Protter said:
But HDF5 isn't human-readable right?
Answering on behalf of Brian, that's correct. It's a binary format.
Another low-tech solution would be to put the data file and metadata file into a single tarball or other archive. I personally would prefer everything in one JSON or HDF5 file, but it's hard to beat the simplicity of this approach
Mason Protter said:
But HDF5 isn't human-readable right?
Correct. I interpreted your statement as being fine with storing the dataframe as binary, as long as the metadata is readable outside of Julia
I personally advocate the tarball approach. Trying to cram different data formats together into the same file in an ad hoc way rarely ends well.
Man, data sucks.
If the requirement is to have everything plaintext and readable, the JSON table approach may be the way to go. Less efficient, but no tarball required
I'll probably just go for a binary format, because the whole tarball thing sounds awful
Expanding Man said:
I personally advocate the tarball approach. Trying to cram different data formats together into the same file in an ad hoc way rarely ends well.
There is a legacy "format" at my employer that is a few thousand bytes of ascii with metadata fields (I don't recall the exact number) directly prepended to Float32 raw arrays. You have to just know the standard ascii size and split it at the right spot to ingest. It's horrible. I tried to advocate for a .zip archive custom format (or HDF5) but was shot down.
Parquet has limited metadata support, and I believe arrow does as well, though my feeling about your context ("a bunch of benchmarking data") is that you probably shouldn't be eager to start whipping out dedicated tabular formats.
I figured it would be better to not recommend Parquet in case you had thoughts :D
Yeah, I am probably notorious by now for being both the primary maintainer of parquet in the julia ecosystem and an individual who hates the parquet format.
I'm honestly tempted to just write a CSV with an extra column that has the first cell holding the metadata, and the rest of the column empty
I just had the same thought
all I want is CSV with a quick description. This is so stupid.
I wonder how many of these types of "need a quick table format with some other stuff" could be handled by a version of JSON or TOML that included a CSV field as one of the embeddable types
Mason Protter said:
This is so stupid.
The perpetual sentiment of "data scientists" everywhere. Just wait until you see SQL.
https://stackoverflow.com/a/33846112 TIL
Wonder if CSV.jl knows about this
You can tell CSV.jl about comment characters, I'm pretty sure.
I was just going to say that ways of appending data to csv's or concatenating csv's of different formats in the same file are not totally unheard of but I wouldn't necessarily recommend it.
Brian Chen said:
Wonder if CSV.jl knows about this
lolno
julia> using CSV, DataFrames
julia> file = tempname() * ".csv";
julia> open(file, "w+") do io
write(io, """
#publisher,W3C
#updated,2015-10-17T00:00:00Z
#name,sensor,temperature
#datatype,string,float
sensor,temperature
s-1,25.5
""")
end;
julia> CSV.read(file, DataFrame)
┌ Warning: thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 2. Parsing extra columns and widening final columnset
└ @ CSV ~/.julia/packages/CSV/aoJqo/src/file.jl:587
┌ Warning: thread = 1 warning: only found 2 / 3 columns around data row: 4. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/aoJqo/src/file.jl:586
5×3 DataFrame
Row │ #publisher W3C Column3
│ String15 String31 String15?
─────┼───────────────────────────────────────────────
1 │ #updated 2015-10-17T00:00:00Z missing
2 │ #name sensor temperature
3 │ #datatype string float
4 │ sensor temperature missing
5 │ s-1 25.5 missing
As @Expanding Man said, at least it can be taught to ignore the comments
https://csv.juliadata.org/latest/examples.html#comment_example
But that still leaves having to parse them out into a usable form
yeah
Probably what we need, at least from a Julia ecosystem point of view, is utility functions for making it as simple and effortless as possible to do what you are trying to do with tarballs. Will still involve separate objects from an API point of view though (i.e. metadata as a separate object from table).
Expanding Man said:
Probably what we need, at least from a Julia ecosystem point of view, is utility functions for making it as simple and effortless as possible to do what you are trying to do with tarballs. Will still involve separate objects from an API point of view though (i.e. metadata as a separate object from table).
Depending on one's view of how "human readable" a tarball or zip archive is you're just reinventing HDF5 at that point, no?
It's possible using a zip instead would make this easier. https://github.com/JuliaIO/ZipArchives.jl for example lets you read out individual entries with a single call
I guess my next question would then become, if I use HDF5, what's the easiest way to store tabular data with heterogeneous columns?
I'm pretty tempted to just shove a DataFrame into JLD2 but I am kinda worried about the compatability of it if the future layout of DataFrames changes
I guess I could always go for a Vector{<:NamedTuple}
I'm only storing Int
, Float64
, and String
Jordan Cluts said:
Depending on one's view of how "human readable" a tarball or zip archive is you're just reinventing HDF5 at that point, no?
Not really. You could of course argue that HDF5 supports any format because you can stick a binary blob in it, but by design there is a natural isomorphism between a tarball and a file system directory, so you don't need any additional conventions on top of what already exists in the file system to identify files or file format metadata.
I guess I see what you're saying but I suppose this query started with "I have a dataframe and some metadata and I'd like to store and retrieve them" instead of "I have a folder of files of various kinds I'd like to keep together" and while it's easy to transition between those two states HDF5 handles the first rather directly while tarball/zip handle the latter rather directly. And in neither case can you directly inspect the result with a text editor.
(For a broad definition of the word "directly")
yeah, dealing with multiple files is exactly what I want to avoid
Mason Protter said:
I'm only storing
Int
,Float64
, andString
So in theory HDF5 eats this use case for breakfast, but I can only find very old an conflicting info on whether there's any integration between DF.jl and HDF5.jl
https://github.com/JuliaData/DataFrames.jl/issues/64
https://github.com/JuliaIO/HDF5.jl/issues/122
Would have to look into how JLD serializes dataframes
Honestly at this point, I think having everything in one JSON file might be the least bad approach as long as you don't have too much data
Brian Chen said:
Honestly at this point, I think having everything in one JSON file might be the least bad approach as long as you don't have too much data
I haven't tried this but a similar idea is simply to use TOML directly as well. It would be more verbose and less readable than a CSV but for "small" amounts of data could work. It supports a type called an Array of Tables which is basically identical to a vector of named tuples i.e. a Tables.jl compatible source (I don't actually know what TOML.jl converts this to but presumably getting it back into a DataFrame should be pretty easy).
edit: I went and checked and TOML.jl generates a dictionary containing a vector of dictionaries. It'd take a bit of finagling to convert the dictionaries all to named tuples and eventually into a DataFrame but definitely conceivable.
My point, more broadly, is that readers of tarballs already understand the concept of files and the usual metadata tricks (file extensions and magic bytes) for specifying their format. A reader of a tarball containing a bunch of separate data in different formats should then be able to understand it with almost no context, whereas with HDF5 you'd need to specify which blobs are files. Maybe there is already a standard for this, but my understanding was that it is not necessarily guaranteed that you are supposed to treat paths in HDF5 as paths in a file system.
DataFrames.jl has table-level metadata https://dataframes.juliadata.org/stable/lib/functions/#DataAPI.metadata
Arrow.jl does too https://arrow.apache.org/julia/stable/manual/#Custom-application-metadata
I haven't tried it but that's where I'd start.
The metadata doesn't get preserved when you write it to a CSV or other non-binary format
Metadata won't get preserved if you write it into a format that doesn't support metadata, such as CSV. Other human-readable formats may support metadata.
jar said:
Metadata won't get preserved if you write it into a format that doesn't support metadata, such as CSV. Other human-readable formats may support metadata.
Yes but it's frustrating that for "human-readable table formats" CSV is the clear victor except for its glaring lack of standard frontmatter/metadata support. And alternatives that handle this sort of key-value store well (JSON, TOML, etc.) in turn don't handle anything larger than very small tables gracefully or at all. Clearly there could be a universal meta-data CSV standard but we (society writ large, not Julia community) haven't managed to do so.
I don't think smarter text formats is the way to go. With either text or binary formats I'm going to need a program to display and edit a table. A simple program like cat
isn't going to cut it for either, it's gotta be at least somewhat smart. So instead of trying to make text formats smarter, I'd rather we make our viewers/editors for binary tables more usable.
I really don't think it's that unreasonable or weird to want some quick over-all metadata at the start of a CSV file
like literally just a string description field. Nothing clever
So you can look at it in a text editor?
jar said:
I don't think smarter text formats is the way to go. With either text or binary formats I'm going to need a program to display and edit a table. A simple program like
cat
isn't going to cut it for either, it's gotta be at least somewhat smart. So instead of trying to make text formats smarter, I'd rather we make our viewers/editors for binary tables more usable.
I agree with the sentiment completely. I tried that approach and the senior engineers (not computer science but other engineering fields) at my employer were a) intolerant of change and b) refused to use anything that can't be investigated as text since they don't trust any binary reader program to still work 150 years from now which apparently is a crucial requirement. As such I have been forced to be inventive with text formats and sympathize strongly with Mason about this.
jar said:
So you can look at it in a text editor?
in this case, yes, I just want someone to be able to look at it and see what it says. But in more advanced examples, if someone wanted to put some data in there and parse it themselves they should be free to do so
put some data in there and parse it themselves
what do these mean?
whatever they want it to mean
Anyway you can put #
comments at the top of a csv, many readers have an option to strip them
but not to read them
Fwiw, my lesson with data (at least on HPC clusters): either use different files (annoying but simple) or established binary formats (like e.g. HDF5).
I wouldn't use JLD2 for "long term" storage. Actually, I did in the past and had to manually convert everything to plain HDF5 because of incompatibilities and broken files.
As for the "human readable" debate, I kind of agree with @jar. Binary viewers, say for HDF5, are fine, as long as it is a established binary format with a standard.
Probably not the solution you want, but TableMetaDataTools.jl has some functions to help you save and read metadata. If would have to be a separate toml file, not in the .csv, but at least it automates attaching the metadata in a toml file to the julia object you want.
Last updated: Dec 28 2024 at 04:38 UTC