Stream: helpdesk (published)

Topic: Saving tabular data and metadata together


view this post on Zulip Mason Protter (Feb 23 2024 at 17:31):

So I have a dataframe with a bunch of benchmarking data, I want to save this dataframe (preferably as plaintext), together with some metadata with things like the machine the benchmarks were performed on, the git commit of the package when it got benchmarked, etc. What's the best way to go about this?

There's no obvious way to include metadata in a CSV file from what I can tell. Do I need to work with JSON or something here?

view this post on Zulip Mason Protter (Feb 23 2024 at 17:31):

My go-to usually would just be to use JLD2 to store a DataFrame and struct with my metadata, but I'd like to store it as plaintext and have the results not be tied to literal julia objects if possible.

view this post on Zulip Jordan Cluts (Feb 23 2024 at 18:01):

Very imperfect but my recent solution to this problem was to smash together a TOML file as a header with the CSV at the bottom. I can re-ingest the CSV by just skipping the correct number of rows and separately I can read the meta-data by slicing off the CSV at the bottom and parsing the TOML file at the top. Not the best solution ever but has worked well enough I suppose.

For now I just manually tell the reader how many rows to skip, not sure what would be the easiest way to detect the "split point" automatically so I could write a function that handles both halves automatically for me.

view this post on Zulip Brian Chen (Feb 23 2024 at 18:02):

JLD2 uses HDF5 files, and HDF5 files support arbitrary key-value metadata (attributes). What I'm less sure about is whether you can use JLD2 itself to read and write such metadata, or if you'd need https://github.com/JuliaIO/HDF5.jl

view this post on Zulip Mason Protter (Feb 23 2024 at 18:05):

But HDF5 isn't human-readable right?

view this post on Zulip Jordan Cluts (Feb 23 2024 at 18:06):

Mason Protter said:

But HDF5 isn't human-readable right?

Answering on behalf of Brian, that's correct. It's a binary format.

view this post on Zulip Brian Chen (Feb 23 2024 at 18:06):

Another low-tech solution would be to put the data file and metadata file into a single tarball or other archive. I personally would prefer everything in one JSON or HDF5 file, but it's hard to beat the simplicity of this approach

view this post on Zulip Brian Chen (Feb 23 2024 at 18:07):

Mason Protter said:

But HDF5 isn't human-readable right?

Correct. I interpreted your statement as being fine with storing the dataframe as binary, as long as the metadata is readable outside of Julia

view this post on Zulip Expanding Man (Feb 23 2024 at 18:07):

I personally advocate the tarball approach. Trying to cram different data formats together into the same file in an ad hoc way rarely ends well.

view this post on Zulip Mason Protter (Feb 23 2024 at 18:08):

Man, data sucks.

view this post on Zulip Brian Chen (Feb 23 2024 at 18:08):

If the requirement is to have everything plaintext and readable, the JSON table approach may be the way to go. Less efficient, but no tarball required

view this post on Zulip Mason Protter (Feb 23 2024 at 18:10):

I'll probably just go for a binary format, because the whole tarball thing sounds awful

view this post on Zulip Jordan Cluts (Feb 23 2024 at 18:11):

Expanding Man said:

I personally advocate the tarball approach. Trying to cram different data formats together into the same file in an ad hoc way rarely ends well.

There is a legacy "format" at my employer that is a few thousand bytes of ascii with metadata fields (I don't recall the exact number) directly prepended to Float32 raw arrays. You have to just know the standard ascii size and split it at the right spot to ingest. It's horrible. I tried to advocate for a .zip archive custom format (or HDF5) but was shot down.

view this post on Zulip Expanding Man (Feb 23 2024 at 18:11):

Parquet has limited metadata support, and I believe arrow does as well, though my feeling about your context ("a bunch of benchmarking data") is that you probably shouldn't be eager to start whipping out dedicated tabular formats.

view this post on Zulip Brian Chen (Feb 23 2024 at 18:12):

I figured it would be better to not recommend Parquet in case you had thoughts :D

view this post on Zulip Expanding Man (Feb 23 2024 at 18:12):

Yeah, I am probably notorious by now for being both the primary maintainer of parquet in the julia ecosystem and an individual who hates the parquet format.

view this post on Zulip Mason Protter (Feb 23 2024 at 18:13):

I'm honestly tempted to just write a CSV with an extra column that has the first cell holding the metadata, and the rest of the column empty

view this post on Zulip Brian Chen (Feb 23 2024 at 18:13):

I just had the same thought

view this post on Zulip Mason Protter (Feb 23 2024 at 18:13):

all I want is CSV with a quick description. This is so stupid.

view this post on Zulip Jordan Cluts (Feb 23 2024 at 18:14):

I wonder how many of these types of "need a quick table format with some other stuff" could be handled by a version of JSON or TOML that included a CSV field as one of the embeddable types

view this post on Zulip Expanding Man (Feb 23 2024 at 18:14):

Mason Protter said:

This is so stupid.

The perpetual sentiment of "data scientists" everywhere. Just wait until you see SQL.

view this post on Zulip Brian Chen (Feb 23 2024 at 18:14):

https://stackoverflow.com/a/33846112 TIL

view this post on Zulip Brian Chen (Feb 23 2024 at 18:15):

Wonder if CSV.jl knows about this

view this post on Zulip Expanding Man (Feb 23 2024 at 18:15):

You can tell CSV.jl about comment characters, I'm pretty sure.

view this post on Zulip Expanding Man (Feb 23 2024 at 18:16):

I was just going to say that ways of appending data to csv's or concatenating csv's of different formats in the same file are not totally unheard of but I wouldn't necessarily recommend it.

view this post on Zulip Mason Protter (Feb 23 2024 at 18:18):

Brian Chen said:

Wonder if CSV.jl knows about this

lolno

view this post on Zulip Mason Protter (Feb 23 2024 at 18:18):

julia> using CSV, DataFrames

julia> file = tempname() * ".csv";

julia> open(file, "w+") do io
           write(io, """
       #publisher,W3C
       #updated,2015-10-17T00:00:00Z
       #name,sensor,temperature
       #datatype,string,float
       sensor,temperature
       s-1,25.5
       """)
       end;

julia> CSV.read(file, DataFrame)
┌ Warning: thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 2. Parsing extra columns and widening final columnset
└ @ CSV ~/.julia/packages/CSV/aoJqo/src/file.jl:587
┌ Warning: thread = 1 warning: only found 2 / 3 columns around data row: 4. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/aoJqo/src/file.jl:586
5×3 DataFrame
 Row │ #publisher  W3C                   Column3
     │ String15    String31              String15?
─────┼───────────────────────────────────────────────
   1 │ #updated    2015-10-17T00:00:00Z  missing
   2 │ #name       sensor                temperature
   3 │ #datatype   string                float
   4 │ sensor      temperature           missing
   5 │ s-1         25.5                  missing

view this post on Zulip Brian Chen (Feb 23 2024 at 18:18):

As @Expanding Man said, at least it can be taught to ignore the comments

view this post on Zulip Brian Chen (Feb 23 2024 at 18:18):

https://csv.juliadata.org/latest/examples.html#comment_example

view this post on Zulip Brian Chen (Feb 23 2024 at 18:19):

But that still leaves having to parse them out into a usable form

view this post on Zulip Mason Protter (Feb 23 2024 at 18:19):

yeah

view this post on Zulip Expanding Man (Feb 23 2024 at 18:20):

Probably what we need, at least from a Julia ecosystem point of view, is utility functions for making it as simple and effortless as possible to do what you are trying to do with tarballs. Will still involve separate objects from an API point of view though (i.e. metadata as a separate object from table).

view this post on Zulip Mason Protter (Feb 23 2024 at 18:23):

image.png

view this post on Zulip Jordan Cluts (Feb 23 2024 at 18:23):

Expanding Man said:

Probably what we need, at least from a Julia ecosystem point of view, is utility functions for making it as simple and effortless as possible to do what you are trying to do with tarballs. Will still involve separate objects from an API point of view though (i.e. metadata as a separate object from table).

Depending on one's view of how "human readable" a tarball or zip archive is you're just reinventing HDF5 at that point, no?

view this post on Zulip Brian Chen (Feb 23 2024 at 18:27):

It's possible using a zip instead would make this easier. https://github.com/JuliaIO/ZipArchives.jl for example lets you read out individual entries with a single call

view this post on Zulip Mason Protter (Feb 23 2024 at 18:28):

I guess my next question would then become, if I use HDF5, what's the easiest way to store tabular data with heterogeneous columns?

I'm pretty tempted to just shove a DataFrame into JLD2 but I am kinda worried about the compatability of it if the future layout of DataFrames changes

view this post on Zulip Mason Protter (Feb 23 2024 at 18:28):

I guess I could always go for a Vector{<:NamedTuple}

view this post on Zulip Mason Protter (Feb 23 2024 at 18:29):

I'm only storing Int, Float64, and String

view this post on Zulip Expanding Man (Feb 23 2024 at 18:33):

Jordan Cluts said:

Depending on one's view of how "human readable" a tarball or zip archive is you're just reinventing HDF5 at that point, no?

Not really. You could of course argue that HDF5 supports any format because you can stick a binary blob in it, but by design there is a natural isomorphism between a tarball and a file system directory, so you don't need any additional conventions on top of what already exists in the file system to identify files or file format metadata.

view this post on Zulip Jordan Cluts (Feb 23 2024 at 18:38):

I guess I see what you're saying but I suppose this query started with "I have a dataframe and some metadata and I'd like to store and retrieve them" instead of "I have a folder of files of various kinds I'd like to keep together" and while it's easy to transition between those two states HDF5 handles the first rather directly while tarball/zip handle the latter rather directly. And in neither case can you directly inspect the result with a text editor.

(For a broad definition of the word "directly")

view this post on Zulip Mason Protter (Feb 23 2024 at 18:40):

yeah, dealing with multiple files is exactly what I want to avoid

view this post on Zulip Brian Chen (Feb 23 2024 at 18:41):

Mason Protter said:

I'm only storing Int, Float64, and String

So in theory HDF5 eats this use case for breakfast, but I can only find very old an conflicting info on whether there's any integration between DF.jl and HDF5.jl

view this post on Zulip Brian Chen (Feb 23 2024 at 18:41):

https://github.com/JuliaData/DataFrames.jl/issues/64

view this post on Zulip Brian Chen (Feb 23 2024 at 18:42):

https://github.com/JuliaIO/HDF5.jl/issues/122

view this post on Zulip Brian Chen (Feb 23 2024 at 18:42):

Would have to look into how JLD serializes dataframes

view this post on Zulip Brian Chen (Feb 23 2024 at 18:43):

Honestly at this point, I think having everything in one JSON file might be the least bad approach as long as you don't have too much data

view this post on Zulip Jordan Cluts (Feb 23 2024 at 18:52):

Brian Chen said:

Honestly at this point, I think having everything in one JSON file might be the least bad approach as long as you don't have too much data

I haven't tried this but a similar idea is simply to use TOML directly as well. It would be more verbose and less readable than a CSV but for "small" amounts of data could work. It supports a type called an Array of Tables which is basically identical to a vector of named tuples i.e. a Tables.jl compatible source (I don't actually know what TOML.jl converts this to but presumably getting it back into a DataFrame should be pretty easy).

edit: I went and checked and TOML.jl generates a dictionary containing a vector of dictionaries. It'd take a bit of finagling to convert the dictionaries all to named tuples and eventually into a DataFrame but definitely conceivable.

view this post on Zulip Expanding Man (Feb 23 2024 at 18:59):

My point, more broadly, is that readers of tarballs already understand the concept of files and the usual metadata tricks (file extensions and magic bytes) for specifying their format. A reader of a tarball containing a bunch of separate data in different formats should then be able to understand it with almost no context, whereas with HDF5 you'd need to specify which blobs are files. Maybe there is already a standard for this, but my understanding was that it is not necessarily guaranteed that you are supposed to treat paths in HDF5 as paths in a file system.

view this post on Zulip jar (Feb 23 2024 at 19:03):

DataFrames.jl has table-level metadata https://dataframes.juliadata.org/stable/lib/functions/#DataAPI.metadata
Arrow.jl does too https://arrow.apache.org/julia/stable/manual/#Custom-application-metadata

I haven't tried it but that's where I'd start.

view this post on Zulip Mason Protter (Feb 23 2024 at 19:04):

The metadata doesn't get preserved when you write it to a CSV or other non-binary format

view this post on Zulip jar (Feb 23 2024 at 19:09):

Metadata won't get preserved if you write it into a format that doesn't support metadata, such as CSV. Other human-readable formats may support metadata.

view this post on Zulip Jordan Cluts (Feb 23 2024 at 19:13):

jar said:

Metadata won't get preserved if you write it into a format that doesn't support metadata, such as CSV. Other human-readable formats may support metadata.

Yes but it's frustrating that for "human-readable table formats" CSV is the clear victor except for its glaring lack of standard frontmatter/metadata support. And alternatives that handle this sort of key-value store well (JSON, TOML, etc.) in turn don't handle anything larger than very small tables gracefully or at all. Clearly there could be a universal meta-data CSV standard but we (society writ large, not Julia community) haven't managed to do so.

view this post on Zulip jar (Feb 23 2024 at 19:20):

I don't think smarter text formats is the way to go. With either text or binary formats I'm going to need a program to display and edit a table. A simple program like cat isn't going to cut it for either, it's gotta be at least somewhat smart. So instead of trying to make text formats smarter, I'd rather we make our viewers/editors for binary tables more usable.

view this post on Zulip Mason Protter (Feb 23 2024 at 19:21):

I really don't think it's that unreasonable or weird to want some quick over-all metadata at the start of a CSV file

view this post on Zulip Mason Protter (Feb 23 2024 at 19:21):

like literally just a string description field. Nothing clever

view this post on Zulip jar (Feb 23 2024 at 19:23):

So you can look at it in a text editor?

view this post on Zulip Jordan Cluts (Feb 23 2024 at 19:24):

jar said:

I don't think smarter text formats is the way to go. With either text or binary formats I'm going to need a program to display and edit a table. A simple program like cat isn't going to cut it for either, it's gotta be at least somewhat smart. So instead of trying to make text formats smarter, I'd rather we make our viewers/editors for binary tables more usable.

I agree with the sentiment completely. I tried that approach and the senior engineers (not computer science but other engineering fields) at my employer were a) intolerant of change and b) refused to use anything that can't be investigated as text since they don't trust any binary reader program to still work 150 years from now which apparently is a crucial requirement. As such I have been forced to be inventive with text formats and sympathize strongly with Mason about this.

view this post on Zulip Mason Protter (Feb 23 2024 at 19:26):

jar said:

So you can look at it in a text editor?

in this case, yes, I just want someone to be able to look at it and see what it says. But in more advanced examples, if someone wanted to put some data in there and parse it themselves they should be free to do so

view this post on Zulip jar (Feb 23 2024 at 19:27):

put some data in there and parse it themselves

what do these mean?

view this post on Zulip Mason Protter (Feb 23 2024 at 19:31):

whatever they want it to mean

view this post on Zulip jar (Feb 23 2024 at 19:36):

Anyway you can put # comments at the top of a csv, many readers have an option to strip them

view this post on Zulip Mason Protter (Feb 23 2024 at 19:39):

but not to read them

view this post on Zulip Carsten Bauer (Feb 25 2024 at 05:40):

Fwiw, my lesson with data (at least on HPC clusters): either use different files (annoying but simple) or established binary formats (like e.g. HDF5).

I wouldn't use JLD2 for "long term" storage. Actually, I did in the past and had to manually convert everything to plain HDF5 because of incompatibilities and broken files.

As for the "human readable" debate, I kind of agree with @jar. Binary viewers, say for HDF5, are fine, as long as it is a established binary format with a standard.

view this post on Zulip Peter Deffebach (Sep 10 2024 at 15:58):

Probably not the solution you want, but TableMetaDataTools.jl has some functions to help you save and read metadata. If would have to be a separate toml file, not in the .csv, but at least it automates attaching the metadata in a toml file to the julia object you want.


Last updated: Dec 28 2024 at 04:38 UTC