Stream: helpdesk (published)

Topic: Workflow / Pipeline for ETL in Julia


view this post on Zulip Davi Sales Barreira (Apr 20 2022 at 23:49):

Hey, everyone. I'm doing some ETL in Julia, and things are a bit all over the place. I was wondering if there was any package out there to sort of organize everything in a pipeline / workflow. I've seen some of this stuff in python (e.g. Luigi). I was wondering if there was anything in Julia. Any suggestions in code/organization are also welcomed :)

view this post on Zulip Júlio Hoffimann (Apr 21 2022 at 16:30):

ETL extends for...?

view this post on Zulip Expanding Man (Apr 21 2022 at 16:32):

"extraction transformation loading"

view this post on Zulip Expanding Man (Apr 21 2022 at 16:33):

I have to do a lot of this for my job. DataFrames.jl is wonderful, but I'm afraid there aren't any good options right now for out of memory stuff.

view this post on Zulip Expanding Man (Apr 21 2022 at 16:34):

Most database stuff now implements DBInterface so that's been useful as well

view this post on Zulip Davi Sales Barreira (Apr 21 2022 at 17:29):

@Júlio Hoffimann , as @Expanding Man said, it's pretty much the workflow where I take the raw data and do some transformations before training my model. There are some softwares that allows one to draw the actual pipe, which is good for documentation.

view this post on Zulip Davi Sales Barreira (Apr 21 2022 at 17:29):

Thanks, @Expanding Man , I'll take a look.

view this post on Zulip Júlio Hoffimann (Apr 21 2022 at 17:30):

@Davi Sales Barreira you mean tabular transformations? I think you are already aware of TableTransforms.jl right ?

view this post on Zulip Davi Sales Barreira (Apr 21 2022 at 17:34):

Yeah, ETL is usually before these types of transformations listed in TableTransforms.jl. I haven't actually used TableTransforms.jl, so perhaps it's adaptable to me case.
In my case, my original data is actually an html file, which I then do a bunch of things in order to get a table. So it's more this pipeline from really raw data, to somewhat structured /e.g. table.

view this post on Zulip Davi Sales Barreira (Apr 21 2022 at 17:35):

I mean, I have to use the html, and cross it with other datasets, and so on.

view this post on Zulip Expanding Man (Apr 21 2022 at 17:39):

Can you give examples of things that e.g. Luigi could do for you that you'd like to be able to do? My understanding was that most of the usefulness of packages like that is for handling distributed tasks and handling out of memory stuff, though it doesn't really sound like that's what you're describing.

view this post on Zulip Expanding Man (Apr 21 2022 at 17:40):

A lot of that could be done with Dagger.jl, though you might still have to do significant setup work if you are trying to run it on a cluster.

view this post on Zulip Davi Sales Barreira (Apr 21 2022 at 18:50):

Actually there will be out of memory stuff. But I'm not there yet, cause I'm still getting the code to parse the data.
So, I have a bunch of html files in a S3 bucket, and json files. I also have some databases in Metabase.
I've downloaded a couple of examples locally, and I'm writing some code that parses the information, cleans, structures and generates some new tables, that will be used for modelling.
Now, with time new html /json files will be sent to the S3 bucket, and from time to time there might be errors in the parsing. So I want to structure a pipeline that let's me organize this whole shebang.

view this post on Zulip Davi Sales Barreira (Apr 21 2022 at 18:51):

I haven't actually used Luigi. But I read some of the docs, and it seems to do exactly the sort of stuff I just described.

view this post on Zulip Davi Sales Barreira (Apr 21 2022 at 18:57):

It looks like Dagger.jl might help. Would you say it's already mature enough for doing production stuff?

view this post on Zulip Davi Sales Barreira (Apr 21 2022 at 18:58):

I've started working for a startup and I'm in charge of the DS stuff. So I'm trying to use Julia for the most part.

view this post on Zulip Expanding Man (Apr 21 2022 at 19:02):

I'm still having trouble understanding what it is you want to do that does not involve just writing a bunch of functions in a module somewhere. If it's something you are planning on scaling up it might save you time in the long run if you start with Dagger.jl sooner rather than later, it scales down very well. Its code base is a little messy in my opinion but it has been around for quite a while and is very actively maintained.

view this post on Zulip Expanding Man (Apr 21 2022 at 19:21):

If your question is more about workflow: if I have a ton of tables I tend to pass them around in a big Dict of DataFrames along with some ancillary data (usually I create a struct for this). When things are messy enough that has to get passed absolutely everywhere and winds up acting like a pseudo global state.

This approach has worked pretty well for me, fortunately Julia doesn't really penalize you for having tons of accessible references in a function, even if they are poorly typed, type stability is only a problem if you access from an under-typed container. Any function that doesn't do that is fine. So even though the approach I described probably sounds super inefficient it's not that bad.

Again, parallelism is a different issue. I've messed around a bit with populating something like that using Dagger.jl but nothing at a very large scale. It might be nice if we had a slightly-higher-level wrapper to Dagger.jl that made this a little easier, but I don't know what that would look like. Like I said, I think the main use of the pipeline tools you were referring to like luigi (maybe dask?) has a lot more to do with managing tasks on a cluster.

view this post on Zulip Brian Chen (Apr 21 2022 at 19:38):

One question that might help determine whether Dagger is sufficient for this use case is how much persistence/orchestration functionality you need for this ETL pipeline. Dagger has a web dashboard for tracking tasks, but to my knowledge no pipeline persistence like many Airflow-esque tools do. Think having a daemon that tracks when files are added to a bucket and can auto-feed them through the pipeline, plus can restart itself gracefully + log when it encounters errors. Yes all this can be done in with/without Dagger, but you'd have to roll more of your own code for it.

view this post on Zulip Davi Sales Barreira (Apr 21 2022 at 19:44):

Thanks again, @Expanding Man . Now I get your point. I have two separate issues. The first one is indeed running in a cluster, with tasks that scales. And Dagger.jl seems to do that for me. The second problem is actually just writing a bunch of functions as you mentioned. So I was just wondering about how to "organize" the whole process of getting data from many locations and returning a clean table. Hence, it's more about organizing and documenting than actually coding. The tip on putting dataframes in a dict seems quite handy.

view this post on Zulip Expanding Man (Apr 21 2022 at 19:49):

Opinions will differ, but for me a good bit of coding wisdom is not to be too eager to make things more complicated than they need to be. Running your process in a huge distributed environment is probably going to be complicated, and there's probably no way around using Dagger.jl for that and even that you might find to be lacking features. However if your intention is to run on one machine, simply getting it organized should not require anything that any programming language cannot already offer you. For example, if you were doing this in python instead, I suspect it would be a bad move to use something like luigi or dask just to write a few functions down.

view this post on Zulip Expanding Man (Apr 21 2022 at 19:50):

Though admittedly sometimes it can be hard to tell when exactly you've arrived at the point of actually needing something more complicated.

view this post on Zulip Davi Sales Barreira (Apr 22 2022 at 00:11):

Advice much appreciated!

view this post on Zulip Nils (Apr 22 2022 at 08:07):

Have never used it and I'm not sure it's ever really taken off, but a package that advertises itself as "clearing the pipeline jungle" is FeatureTransforms.jl, which was presented at last year's JuliaCon (blog post including the talk here)

view this post on Zulip DrChainsaw (Apr 22 2022 at 08:14):

Expanding Man said:

It might be nice if we had a slightly-higher-level wrapper to Dagger.jl that made this a little easier

Fwiw, I use FileTrees.jl alot for data wrangling. It is in itself a quite nice abstraction of the "Dict of DataFrames" with some very handy utilities for slicing around and map-reducing.

It can make use Dagger to do stuff in parallel as well, so you can set up a lazy pipeline of transformations and let Dagger.jl produce the end result. I have found that it is very often possible to use the same code to do eager local processing on a smaller subset and then fire away at the whole dataset on a cluster when you think the code is correct.

I can't vouch for any part of it being "production grade" though.

view this post on Zulip Sundar R (Apr 29 2022 at 08:05):

I recently came across Pipelines.jl (https://cihga39871.github.io/Pipelines.jl/stable/) which in turn mentions JobSchedulers.jl . I'm not sure how relevant it is to your use case, as this is not my wheelhouse and I don't completely understand where each of these options stand, but maybe it's helpful :shrug:


Last updated: Nov 22 2024 at 04:41 UTC