Hey, everyone. I'm doing some ETL in Julia, and things are a bit all over the place. I was wondering if there was any package out there to sort of organize everything in a pipeline / workflow. I've seen some of this stuff in python (e.g. Luigi). I was wondering if there was anything in Julia. Any suggestions in code/organization are also welcomed :)
ETL extends for...?
"extraction transformation loading"
I have to do a lot of this for my job. DataFrames.jl is wonderful, but I'm afraid there aren't any good options right now for out of memory stuff.
Most database stuff now implements DBInterface so that's been useful as well
@Júlio Hoffimann , as @Expanding Man said, it's pretty much the workflow where I take the raw data and do some transformations before training my model. There are some softwares that allows one to draw the actual pipe, which is good for documentation.
Thanks, @Expanding Man , I'll take a look.
@Davi Sales Barreira you mean tabular transformations? I think you are already aware of TableTransforms.jl right ?
Yeah, ETL is usually before these types of transformations listed in TableTransforms.jl. I haven't actually used TableTransforms.jl, so perhaps it's adaptable to me case.
In my case, my original data is actually an html file, which I then do a bunch of things in order to get a table. So it's more this pipeline from really raw data, to somewhat structured /e.g. table.
I mean, I have to use the html, and cross it with other datasets, and so on.
Can you give examples of things that e.g. Luigi could do for you that you'd like to be able to do? My understanding was that most of the usefulness of packages like that is for handling distributed tasks and handling out of memory stuff, though it doesn't really sound like that's what you're describing.
A lot of that could be done with Dagger.jl, though you might still have to do significant setup work if you are trying to run it on a cluster.
Actually there will be out of memory stuff. But I'm not there yet, cause I'm still getting the code to parse the data.
So, I have a bunch of html files in a S3 bucket, and json files. I also have some databases in Metabase.
I've downloaded a couple of examples locally, and I'm writing some code that parses the information, cleans, structures and generates some new tables, that will be used for modelling.
Now, with time new html /json files will be sent to the S3 bucket, and from time to time there might be errors in the parsing. So I want to structure a pipeline that let's me organize this whole shebang.
I haven't actually used Luigi. But I read some of the docs, and it seems to do exactly the sort of stuff I just described.
It looks like Dagger.jl might help. Would you say it's already mature enough for doing production stuff?
I've started working for a startup and I'm in charge of the DS stuff. So I'm trying to use Julia for the most part.
I'm still having trouble understanding what it is you want to do that does not involve just writing a bunch of functions in a module somewhere. If it's something you are planning on scaling up it might save you time in the long run if you start with Dagger.jl sooner rather than later, it scales down very well. Its code base is a little messy in my opinion but it has been around for quite a while and is very actively maintained.
If your question is more about workflow: if I have a ton of tables I tend to pass them around in a big Dict
of DataFrames along with some ancillary data (usually I create a struct
for this). When things are messy enough that has to get passed absolutely everywhere and winds up acting like a pseudo global state.
This approach has worked pretty well for me, fortunately Julia doesn't really penalize you for having tons of accessible references in a function, even if they are poorly typed, type stability is only a problem if you access from an under-typed container. Any function that doesn't do that is fine. So even though the approach I described probably sounds super inefficient it's not that bad.
Again, parallelism is a different issue. I've messed around a bit with populating something like that using Dagger.jl but nothing at a very large scale. It might be nice if we had a slightly-higher-level wrapper to Dagger.jl that made this a little easier, but I don't know what that would look like. Like I said, I think the main use of the pipeline tools you were referring to like luigi (maybe dask?) has a lot more to do with managing tasks on a cluster.
One question that might help determine whether Dagger is sufficient for this use case is how much persistence/orchestration functionality you need for this ETL pipeline. Dagger has a web dashboard for tracking tasks, but to my knowledge no pipeline persistence like many Airflow-esque tools do. Think having a daemon that tracks when files are added to a bucket and can auto-feed them through the pipeline, plus can restart itself gracefully + log when it encounters errors. Yes all this can be done in with/without Dagger, but you'd have to roll more of your own code for it.
Thanks again, @Expanding Man . Now I get your point. I have two separate issues. The first one is indeed running in a cluster, with tasks that scales. And Dagger.jl seems to do that for me. The second problem is actually just writing a bunch of functions as you mentioned. So I was just wondering about how to "organize" the whole process of getting data from many locations and returning a clean table. Hence, it's more about organizing and documenting than actually coding. The tip on putting dataframes in a dict seems quite handy.
Opinions will differ, but for me a good bit of coding wisdom is not to be too eager to make things more complicated than they need to be. Running your process in a huge distributed environment is probably going to be complicated, and there's probably no way around using Dagger.jl for that and even that you might find to be lacking features. However if your intention is to run on one machine, simply getting it organized should not require anything that any programming language cannot already offer you. For example, if you were doing this in python instead, I suspect it would be a bad move to use something like luigi or dask just to write a few functions down.
Though admittedly sometimes it can be hard to tell when exactly you've arrived at the point of actually needing something more complicated.
Advice much appreciated!
Have never used it and I'm not sure it's ever really taken off, but a package that advertises itself as "clearing the pipeline jungle" is FeatureTransforms.jl, which was presented at last year's JuliaCon (blog post including the talk here)
Expanding Man said:
It might be nice if we had a slightly-higher-level wrapper to Dagger.jl that made this a little easier
Fwiw, I use FileTrees.jl alot for data wrangling. It is in itself a quite nice abstraction of the "Dict
of DataFrames
" with some very handy utilities for slicing around and map-reducing.
It can make use Dagger
to do stuff in parallel as well, so you can set up a lazy pipeline of transformations and let Dagger.jl produce the end result. I have found that it is very often possible to use the same code to do eager local processing on a smaller subset and then fire away at the whole dataset on a cluster when you think the code is correct.
I can't vouch for any part of it being "production grade" though.
I recently came across Pipelines.jl (https://cihga39871.github.io/Pipelines.jl/stable/) which in turn mentions JobSchedulers.jl . I'm not sure how relevant it is to your use case, as this is not my wheelhouse and I don't completely understand where each of these options stand, but maybe it's helpful :shrug:
Last updated: Nov 06 2024 at 04:40 UTC