Saving tabular data and metadata together · helpdesk (published)

So I have a dataframe with a bunch of benchmarking data, I want to save this dataframe (preferably as plaintext), together with some metadata with things like the machine the benchmarks were performed on, the git commit of the package when it got benchmarked, etc. What's the best way to go about this?

There's no obvious way to include metadata in a CSV file from what I can tell. Do I need to work with JSON or something here?

Mason Protter (Feb 23 2024 at 17:31):

My go-to usually would just be to use JLD2 to store a DataFrame and struct with my metadata, but I'd like to store it as plaintext and have the results not be tied to literal julia objects if possible.

Jordan Cluts (Feb 23 2024 at 18:01):

Very imperfect but my recent solution to this problem was to smash together a TOML file as a header with the CSV at the bottom. I can re-ingest the CSV by just skipping the correct number of rows and separately I can read the meta-data by slicing off the CSV at the bottom and parsing the TOML file at the top. Not the best solution ever but has worked well enough I suppose.

For now I just manually tell the reader how many rows to skip, not sure what would be the easiest way to detect the "split point" automatically so I could write a function that handles both halves automatically for me.

Brian Chen (Feb 23 2024 at 18:02):

JLD2 uses HDF5 files, and HDF5 files support arbitrary key-value metadata (attributes). What I'm less sure about is whether you can use JLD2 itself to read and write such metadata, or if you'd need https://github.com/JuliaIO/HDF5.jl

Mason Protter (Feb 23 2024 at 18:05):

Jordan Cluts (Feb 23 2024 at 18:06):

Brian Chen (Feb 23 2024 at 18:06):

Another low-tech solution would be to put the data file and metadata file into a single tarball or other archive. I personally would prefer everything in one JSON or HDF5 file, but it's hard to beat the simplicity of this approach

Brian Chen (Feb 23 2024 at 18:07):

Correct. I interpreted your statement as being fine with storing the dataframe as binary, as long as the metadata is readable outside of Julia

Expanding Man (Feb 23 2024 at 18:07):

I personally advocate the tarball approach. Trying to cram different data formats together into the same file in an ad hoc way rarely ends well.

Mason Protter (Feb 23 2024 at 18:08):

Brian Chen (Feb 23 2024 at 18:08):

If the requirement is to have everything plaintext and readable, the JSON table approach may be the way to go. Less efficient, but no tarball required

Mason Protter (Feb 23 2024 at 18:10):

I'll probably just go for a binary format, because the whole tarball thing sounds awful

Jordan Cluts (Feb 23 2024 at 18:11):

There is a legacy "format" at my employer that is a few thousand bytes of ascii with metadata fields (I don't recall the exact number) directly prepended to Float32 raw arrays. You have to just know the standard ascii size and split it at the right spot to ingest. It's horrible. I tried to advocate for a .zip archive custom format (or HDF5) but was shot down.

Expanding Man (Feb 23 2024 at 18:11):

Parquet has limited metadata support, and I believe arrow does as well, though my feeling about your context ("a bunch of benchmarking data") is that you probably shouldn't be eager to start whipping out dedicated tabular formats.

Brian Chen (Feb 23 2024 at 18:12):

I figured it would be better to not recommend Parquet in case you had thoughts :D

Expanding Man (Feb 23 2024 at 18:12):

Yeah, I am probably notorious by now for being both the primary maintainer of parquet in the julia ecosystem and an individual who hates the parquet format.

Mason Protter (Feb 23 2024 at 18:13):

I'm honestly tempted to just write a CSV with an extra column that has the first cell holding the metadata, and the rest of the column empty

Brian Chen (Feb 23 2024 at 18:13):

Mason Protter (Feb 23 2024 at 18:13):

Jordan Cluts (Feb 23 2024 at 18:14):

I wonder how many of these types of "need a quick table format with some other stuff" could be handled by a version of JSON or TOML that included a CSV field as one of the embeddable types

Expanding Man (Feb 23 2024 at 18:14):

The perpetual sentiment of "data scientists" everywhere. Just wait until you see SQL.

Brian Chen (Feb 23 2024 at 18:14):

Brian Chen (Feb 23 2024 at 18:15):

Expanding Man (Feb 23 2024 at 18:15):

Expanding Man (Feb 23 2024 at 18:16):

I was just going to say that ways of appending data to csv's or concatenating csv's of different formats in the same file are not totally unheard of but I wouldn't necessarily recommend it.

Mason Protter (Feb 23 2024 at 18:18):

julia> using CSV, DataFrames

julia> file = tempname() * ".csv";

julia> open(file, "w+") do io
           write(io, """
       #publisher,W3C
       #updated,2015-10-17T00:00:00Z
       #name,sensor,temperature
       #datatype,string,float
       sensor,temperature
       s-1,25.5
       """)
       end;

julia> CSV.read(file, DataFrame)
┌ Warning: thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 2. Parsing extra columns and widening final columnset
└ @ CSV ~/.julia/packages/CSV/aoJqo/src/file.jl:587
┌ Warning: thread = 1 warning: only found 2 / 3 columns around data row: 4. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/aoJqo/src/file.jl:586
5×3 DataFrame
 Row │ #publisher  W3C                   Column3
     │ String15    String31              String15?
─────┼───────────────────────────────────────────────
   1 │ #updated    2015-10-17T00:00:00Z  missing
   2 │ #name       sensor                temperature
   3 │ #datatype   string                float
   4 │ sensor      temperature           missing
   5 │ s-1         25.5                  missing

Brian Chen (Feb 23 2024 at 18:18):

Brian Chen (Feb 23 2024 at 18:19):

Mason Protter (Feb 23 2024 at 18:19):

Expanding Man (Feb 23 2024 at 18:20):

Probably what we need, at least from a Julia ecosystem point of view, is utility functions for making it as simple and effortless as possible to do what you are trying to do with tarballs. Will still involve separate objects from an API point of view though (i.e. metadata as a separate object from table).

Mason Protter (Feb 23 2024 at 18:23):

Jordan Cluts (Feb 23 2024 at 18:23):

Depending on one's view of how "human readable" a tarball or zip archive is you're just reinventing HDF5 at that point, no?

Brian Chen (Feb 23 2024 at 18:27):

It's possible using a zip instead would make this easier. https://github.com/JuliaIO/ZipArchives.jl for example lets you read out individual entries with a single call

Mason Protter (Feb 23 2024 at 18:28):

I guess my next question would then become, if I use HDF5, what's the easiest way to store tabular data with heterogeneous columns?

I'm pretty tempted to just shove a DataFrame into JLD2 but I am kinda worried about the compatability of it if the future layout of DataFrames changes

Mason Protter (Feb 23 2024 at 18:28):

Mason Protter (Feb 23 2024 at 18:29):

Expanding Man (Feb 23 2024 at 18:33):

Not really. You could of course argue that HDF5 supports any format because you can stick a binary blob in it, but by design there is a natural isomorphism between a tarball and a file system directory, so you don't need any additional conventions on top of what already exists in the file system to identify files or file format metadata.

Jordan Cluts (Feb 23 2024 at 18:38):

I guess I see what you're saying but I suppose this query started with "I have a dataframe and some metadata and I'd like to store and retrieve them" instead of "I have a folder of files of various kinds I'd like to keep together" and while it's easy to transition between those two states HDF5 handles the first rather directly while tarball/zip handle the latter rather directly. And in neither case can you directly inspect the result with a text editor.

Mason Protter (Feb 23 2024 at 18:40):

Brian Chen (Feb 23 2024 at 18:41):

So in theory HDF5 eats this use case for breakfast, but I can only find very old an conflicting info on whether there's any integration between DF.jl and HDF5.jl

Brian Chen (Feb 23 2024 at 18:41):

Brian Chen (Feb 23 2024 at 18:42):

Brian Chen (Feb 23 2024 at 18:43):

Honestly at this point, I think having everything in one JSON file might be the least bad approach as long as you don't have too much data

Jordan Cluts (Feb 23 2024 at 18:52):

I haven't tried this but a similar idea is simply to use TOML directly as well. It would be more verbose and less readable than a CSV but for "small" amounts of data could work. It supports a type called an Array of Tables which is basically identical to a vector of named tuples i.e. a Tables.jl compatible source (I don't actually know what TOML.jl converts this to but presumably getting it back into a DataFrame should be pretty easy).

edit: I went and checked and TOML.jl generates a dictionary containing a vector of dictionaries. It'd take a bit of finagling to convert the dictionaries all to named tuples and eventually into a DataFrame but definitely conceivable.

Expanding Man (Feb 23 2024 at 18:59):

My point, more broadly, is that readers of tarballs already understand the concept of files and the usual metadata tricks (file extensions and magic bytes) for specifying their format. A reader of a tarball containing a bunch of separate data in different formats should then be able to understand it with almost no context, whereas with HDF5 you'd need to specify which blobs are files. Maybe there is already a standard for this, but my understanding was that it is not necessarily guaranteed that you are supposed to treat paths in HDF5 as paths in a file system.

jar (Feb 23 2024 at 19:03):

Mason Protter (Feb 23 2024 at 19:04):

The metadata doesn't get preserved when you write it to a CSV or other non-binary format

jar (Feb 23 2024 at 19:09):

Metadata won't get preserved if you write it into a format that doesn't support metadata, such as CSV. Other human-readable formats may support metadata.

Jordan Cluts (Feb 23 2024 at 19:13):

Yes but it's frustrating that for "human-readable table formats" CSV is the clear victor except for its glaring lack of standard frontmatter/metadata support. And alternatives that handle this sort of key-value store well (JSON, TOML, etc.) in turn don't handle anything larger than very small tables gracefully or at all. Clearly there could be a universal meta-data CSV standard but we (society writ large, not Julia community) haven't managed to do so.

jar (Feb 23 2024 at 19:20):

I don't think smarter text formats is the way to go. With either text or binary formats I'm going to need a program to display and edit a table. A simple program like cat isn't going to cut it for either, it's gotta be at least somewhat smart. So instead of trying to make text formats smarter, I'd rather we make our viewers/editors for binary tables more usable.

Mason Protter (Feb 23 2024 at 19:21):

I really don't think it's that unreasonable or weird to want some quick over-all metadata at the start of a CSV file

Mason Protter (Feb 23 2024 at 19:21):

jar (Feb 23 2024 at 19:23):

Jordan Cluts (Feb 23 2024 at 19:24):

I agree with the sentiment completely. I tried that approach and the senior engineers (not computer science but other engineering fields) at my employer were a) intolerant of change and b) refused to use anything that can't be investigated as text since they don't trust any binary reader program to still work 150 years from now which apparently is a crucial requirement. As such I have been forced to be inventive with text formats and sympathize strongly with Mason about this.

Mason Protter (Feb 23 2024 at 19:26):

in this case, yes, I just want someone to be able to look at it and see what it says. But in more advanced examples, if someone wanted to put some data in there and parse it themselves they should be free to do so

jar (Feb 23 2024 at 19:27):

Mason Protter (Feb 23 2024 at 19:31):

jar (Feb 23 2024 at 19:36):

Anyway you can put # comments at the top of a csv, many readers have an option to strip them

Mason Protter (Feb 23 2024 at 19:39):

Carsten Bauer (Feb 25 2024 at 05:40):

Fwiw, my lesson with data (at least on HPC clusters): either use different files (annoying but simple) or established binary formats (like e.g. HDF5).

I wouldn't use JLD2 for "long term" storage. Actually, I did in the past and had to manually convert everything to plain HDF5 because of incompatibilities and broken files.

As for the "human readable" debate, I kind of agree with @jar. Binary viewers, say for HDF5, are fine, as long as it is a established binary format with a standard.

Peter Deffebach (Sep 10 2024 at 15:58):

Probably not the solution you want, but TableMetaDataTools.jl has some functions to help you save and read metadata. If would have to be a separate toml file, not in the .csv, but at least it automates attaching the metadata in a toml file to the julia object you want.

Stream: helpdesk (published)

Topic: Saving tabular data and metadata together

Mason Protter (Feb 23 2024 at 17:31):

Mason Protter (Feb 23 2024 at 17:31):

Jordan Cluts (Feb 23 2024 at 18:01):

Brian Chen (Feb 23 2024 at 18:02):

Mason Protter (Feb 23 2024 at 18:05):

Jordan Cluts (Feb 23 2024 at 18:06):

Brian Chen (Feb 23 2024 at 18:06):

Brian Chen (Feb 23 2024 at 18:07):

Expanding Man (Feb 23 2024 at 18:07):

Mason Protter (Feb 23 2024 at 18:08):

Brian Chen (Feb 23 2024 at 18:08):

Mason Protter (Feb 23 2024 at 18:10):

Jordan Cluts (Feb 23 2024 at 18:11):

Expanding Man (Feb 23 2024 at 18:11):

Brian Chen (Feb 23 2024 at 18:12):

Expanding Man (Feb 23 2024 at 18:12):

Mason Protter (Feb 23 2024 at 18:13):

Brian Chen (Feb 23 2024 at 18:13):

Mason Protter (Feb 23 2024 at 18:13):

Jordan Cluts (Feb 23 2024 at 18:14):

Expanding Man (Feb 23 2024 at 18:14):

Brian Chen (Feb 23 2024 at 18:14):

Brian Chen (Feb 23 2024 at 18:15):

Expanding Man (Feb 23 2024 at 18:15):

Expanding Man (Feb 23 2024 at 18:16):

Mason Protter (Feb 23 2024 at 18:18):

Mason Protter (Feb 23 2024 at 18:18):

Brian Chen (Feb 23 2024 at 18:18):

Brian Chen (Feb 23 2024 at 18:18):

Brian Chen (Feb 23 2024 at 18:19):

Mason Protter (Feb 23 2024 at 18:19):

Expanding Man (Feb 23 2024 at 18:20):

Mason Protter (Feb 23 2024 at 18:23):

Jordan Cluts (Feb 23 2024 at 18:23):

Brian Chen (Feb 23 2024 at 18:27):

Mason Protter (Feb 23 2024 at 18:28):

Mason Protter (Feb 23 2024 at 18:28):

Mason Protter (Feb 23 2024 at 18:29):

Expanding Man (Feb 23 2024 at 18:33):

Jordan Cluts (Feb 23 2024 at 18:38):

Mason Protter (Feb 23 2024 at 18:40):

Brian Chen (Feb 23 2024 at 18:41):

Brian Chen (Feb 23 2024 at 18:41):

Brian Chen (Feb 23 2024 at 18:42):

Brian Chen (Feb 23 2024 at 18:42):

Brian Chen (Feb 23 2024 at 18:43):

Jordan Cluts (Feb 23 2024 at 18:52):

Expanding Man (Feb 23 2024 at 18:59):

jar (Feb 23 2024 at 19:03):

Mason Protter (Feb 23 2024 at 19:04):

jar (Feb 23 2024 at 19:09):

Jordan Cluts (Feb 23 2024 at 19:13):

jar (Feb 23 2024 at 19:20):

Mason Protter (Feb 23 2024 at 19:21):

Mason Protter (Feb 23 2024 at 19:21):

jar (Feb 23 2024 at 19:23):

Jordan Cluts (Feb 23 2024 at 19:24):

Mason Protter (Feb 23 2024 at 19:26):

jar (Feb 23 2024 at 19:27):

Mason Protter (Feb 23 2024 at 19:31):

jar (Feb 23 2024 at 19:36):

Mason Protter (Feb 23 2024 at 19:39):

Carsten Bauer (Feb 25 2024 at 05:40):

Peter Deffebach (Sep 10 2024 at 15:58):