Stream: helpdesk (published)

Topic: ✔ Artifacts for projects that are not packages

Notification Bot (Nov 26 2021 at 13:40):

Yakir Luc Gagnon has marked this topic as resolved.

Rik Huijzer (Nov 28 2021 at 19:54):

Why not make a package for the paper? I do that I find that it works quite well, but maybe there is an easier way?

Júlio Hoffimann (Nov 29 2021 at 00:27):

I use DataDeps.jl for papers. Maybe you don't need the Artfifacts framework after all @Yakir Luc Gagnon ?

Júlio Hoffimann (Nov 29 2021 at 00:28):

The package works nicely and was designed with your use case in mind. Register a URL and use the data as is it was on disk already downloaded.

Yakir Luc Gagnon (Nov 29 2021 at 08:59):

First DataDeps:
I love DataDeps and use it regularly. It's truly great. I find though that there are a few of these packages that overlap a bit, and since the Artifacts framework is intrinsic to Julia I'd give it a shot. I actually don't have a strong opinion on whether it's better or worse for my use-case, I'm not sure there is even a good answer to that. But this is the first time I use Artifacts for something like this. And so far it works very nicely.

Secondly, package versus project for scientific papers:
This is a great question. I did both by now, here's what I learned.

Using a package

Advantages

you can leverage the updating functionality of Pkg to get the people you work with to easily update their versions (i.e. before publishing).
you can use the Test framework to include tests to your publication
possibly more extensive documentations

Disadvantages

users need to using MyPackage and run_main() every time they want to generate the results of the study, instead of just include("generate_results.jl")
we include a Manifest.toml file to the package to make sure it will always work, this too is not terribly natural for a Package to do.

Using a project

Advantages

it's easier for the reader to parse
less moving parts

Disadvantages

basically the inverse of the advantages of using a package...

I'd love to hear your opinions on this. I also think that this heavily depends on your specific field of research. People that do a lot of simulations should find DrWatson.jl amazing, people that don't might find it too large or opinionated (rightly or not)?

j-fu (Nov 29 2021 at 09:14):

Good discussion here. I did some write-up on projects here: https://j-fu.github.io/marginalia/julia/project-workflow/
In short, I tend to side with the idea to set a project up as a package with Manifest.toml.
Currently trying this out with my colleagues, so far it seems to work.
A follow-up on packages and on local registries is in the making...

Yakir Luc Gagnon (Nov 29 2021 at 10:30):

wow, I love this! I really like the idea of keeping a packages repo and deving those. then everything ships together and there shouldn't be any issues what so ever to reproduce (in my specific case I depend on two of my own packages, one registered and one not). Hmmm... I might change to this! Thank you @j-fu

Yakir Luc Gagnon (Nov 30 2021 at 09:51):

So actually @j-fu , I'm not so convinced anymore: the pros and cons I listed above are for an actual package versus a project. The framework you nicely presented in that blog seems (I might be wrong) like a project. Here's why:
You generated a package and with it all its functional sub-directories, that's true, but then by requiring that the user start a session from within this package's path and activate it there, you've effectively rendered that package into a project. Or it's more correct to say that the user interacts with that code as a project, not as a package. Interacting with it as a package would be starting Julia anywhere and just using MyProject (assuming it's been added to the current environment).

I still really like the ideas presented in your blog and the whole framework, but I think what you've described is a project, not a package.

j-fu (Nov 30 2021 at 10:11):

Indeed, I agree, I described a project, not a package - this was my intention. Instantiating a project as a package (as B. Kaminski advocates) helps with the possibility to run tests, documentation generation etc.

The second point (which I learned from reading the tutorials of Dr.Watson.jl) is the possibility to develop project-local packages within this project due to the ability of Pkg to use a relative path in Manifest.toml. This IMHO helps with early states of new packages, where just ideas are tried out etc. This initial development would take place locally within the project.
Only at later stages - e.g. when it should be shared between several projects - a budding package would be registered. I envision to have some intermediate level registry (using LocalRegistry.jl) for versions 0.0.x , after that, re-registration in e.g. General starting with 0.1.0 would be possible.

Yakir Luc Gagnon (Nov 30 2021 at 11:16):

Love this, yeah, I agree. As I said before, I really like keeping accompanying packages within the project's repo. It makes the whole thing completely reproducible. Thanks for the useful discussion!

Sukera (Nov 30 2021 at 12:19):

any package is also always a project, no?

j-fu (Nov 30 2021 at 14:21):

Ok you trigger me to do some RTFM (for my own alphabetization, should have done this earlier...) : the Pkg glossary is quite clear on this: https://pkgdocs.julialang.org/v1.6/glossary/#Glossary and introduces an "Applications" wich include "simulation/analytics code accompanying a scientific paper".

So in that sense we are talking here about applications vs. packages. Will update my post when I find time...

Sukera (Nov 30 2021 at 14:26):

Sukera (Nov 30 2021 at 14:27):

a project is just a directory with a Project.toml in it. A package is a project with a special directory structure (e.g. usually has src, test...) and is registered. An application is a self contained project, commonly with a Manifest.toml commited as well (for reproducibility) and it may also be compiled to a static binary (e.g. using PackageCompiler)

Sukera (Nov 30 2021 at 14:36):

in general, for a publication you want to definitely include the Manifest.toml, since it allows anyone reading your paper to reproduce your exact julia environment by running ]instantiate

j-fu (Nov 30 2021 at 14:40):

Exacty, after all, the committed manifest makes the difference. Besides of reproducibility this allows for dev'ed packages with relative paths, which provides a resonable low key way to start collaborative package development before registration. One would have an application monorepo with subpackages in it and can register the subpackages when they are sufficiently mature.

j-fu (Nov 30 2021 at 14:42):

git clone and ]instantiate are also sufficient for this approach.

Sukera (Nov 30 2021 at 14:52):

you can also ]add a git repo directly, but that also requires a Manifest

Sukera (Nov 30 2021 at 14:52):

I'm not too fond of monorepos :)

Yakir Luc Gagnon (Nov 30 2021 at 18:36):

A monorepo for an application used to reproduce the stats and figures of a scientific publication is great. You wouldn't want all the code used to reproduce the published article in a bunch of different places. Think of it this way, all your projects are probably in one monorepo anyways, /home/sukera/ :stuck_out_tongue:

Mason Protter (Nov 30 2021 at 18:40):

Depends on your needs, but I often find it’s worth the time to split out My workflow into packages

Sukera (Nov 30 2021 at 18:43):

I don't have a .git in my home directory, no :p

Yakir Luc Gagnon (Nov 30 2021 at 19:21):

I feel like we're talking over each other's heads: I too hate having things in one single monolithic repo, and my "workflow" also works best if it's split into composable units. But, if you're gonna publish a scientific article, and you want to provide a single link to a repository that contains everything the reader would need to reproduce said article, I'd argue it's better if all the moving pieces are in one tracked place.

Sukera (Nov 30 2021 at 19:24):

I'm not arguing against that

Mason Protter (Nov 30 2021 at 20:22):

Isn’t that why we have Project.toml and manifest.toml files?

Mason Protter (Nov 30 2021 at 20:23):

I’ve certainly always found Julia projects split into packages waaaay more reproducible than a anything else out there in the scientific community.

“Run this make file” is not reproducible.

Mason Protter (Nov 30 2021 at 20:23):

Or even worse is “here’s my Matlab script”

Sukera (Nov 30 2021 at 20:24):

my point is that a monorepo implies keeping things that should be tracked via seperate versions with the same version - and that's not a good thing

Sukera (Nov 30 2021 at 20:24):

I'm not at all opposed to having subprojects (in terms of having subdirectories with their own Project.toml)

Sukera (Nov 30 2021 at 20:25):

as long as those subprojects are semantically supposed to be tracked with the same version, it's fine

Sukera (Nov 30 2021 at 20:26):

on example for that would be having a subdirectory for plotting, while the main project doesn't require plotting packages

Mason Protter (Nov 30 2021 at 20:39):

I’m agreeing with you. I’m disagreeing with @Yakir Luc Gagnon

Yakir Luc Gagnon (Nov 30 2021 at 22:29):

I hope we are all talking about the same thing here. Different research fields might legitimately lend them selves to one way more than the other. In my field, behavioral biology (or more specifically visual ecology) for instance, we have tons of raw videos, we (mostly manually) track animals behaving in them, calibrate some of those results, and analyze the stats. There isn't tons in there that could easily get separated out into packages that would be ultra useful for the community.

My point is that things that are mostly (even only) useful for one specific publication shouldn't need to be semver-tracked, with coverage reports and continuous integration, full documentation, and in the general registry. They can start off as a small module, in a sub-directory of the monorepo in which said Application-project resides. Doing so allows the researcher to think in a composable manner, containing pieces of the analysis within modules, while freeing her/him from maintaining a separate package that only this publication will use. It's kind of neat.

This snapshot, this project that is an application, doesn't need to be scattered across github/lab in a meticulous semver network of packages. It's gonna exist and be used literally only for that one publication. And of course, if more publications/people find parts of it useful, let those inner modules bud out as fully fledged packages!

I honestly think we're thinking of different things when we say "publications". I have a feeling that when you publish you publish super useful code or analysis tool while I publish "findings"... I'd love to hear how I could improve this process, which is why I started this thread, so please let me know where I've misunderstood you.

Sukera (Nov 30 2021 at 22:30):

Again, we're not talking about forcefully ripping things apart into multiple packages and registering them.

Sukera (Nov 30 2021 at 22:32):

and I think we (I at least) am very much thinking of code associated with your publication, which serves as purely a tool to achieve your analysis. Your focus is not on the artifact (i.e. the code) itself - that's fine and we're not talking about forcefully seperating the two.

Sukera (Nov 30 2021 at 22:33):

what we _are_ talking about is tracking changes in that code and (most importantly) _providing it as part of your publication_, even if it's not the main point of your work. This immensely helps in reproducing your results.

Sukera (Nov 30 2021 at 22:36):

However, at the same time, if you _do_ find yourself writing the same code over and over and over again for the same kinds of publications - factoring them out into their own thing (i.e. package) and referencing _that_ with a consistent version makes both your and your colleagues' life easier, because they don't have to rely on a fragile setup where things can change all the time. To that extent, a Project.toml and a Manifest.toml you can ]instantiate are extremely useful tools - no more, no less. Neither of those two imply that you _have_ to create a package, register, publish & maintain it though - it just makes it easier to use code _that isn't directly shipped with the code directly associated with your publication_ or that's shared by multiple publications.

Yakir Luc Gagnon (Nov 30 2021 at 22:54):

I'm actually not sure if you're responding to me or Mason..

Who said anything about not using a Project and Manifest file...? Did you see j-fu's outlined suggestion in his blog? That is the monorepo I was promoting. Of course we'll use Project and Manifest files, testing, docs, and a readme -- for the Application, the monorepo. That's how we'd make it reproducible. The fact we don't maintain composable chunks in separate packages elsewhere on the web makes it very self contained.

Honestly, I think we're just talking over each other's heads now.

Sukera (Nov 30 2021 at 22:57):

I was replying to you, and I'm fairly certain Mason & I agree on this :)

Mason Protter (Nov 30 2021 at 22:58):

Yeah. Usually what I try to do when I'm doing research for a paper is that I'll make a local package on my computer (or a remote machine). Rarely does this thing ever get published anywhere.

At first, the functions in this package are basically just wrappers around a script that I can call from the REPL or a notebook, but then as I start doing more and more stuff and I get a feel for what sub-functionality is being duplicated a lot, I'll start splitting out parts of these things into their own utility functions. By the end of the project, the package ends up having a hierarchical structure with some very general purpose utilities being wrapped by some quite specialized pipeline functions. That way, almost all the work I'm doing at the REPL / notebook is just writing f(x, y), inspecting the output and then changing things in the package and continuing.

Sometimes through this process I'll notice some group of functions that I'm pretty sure I'll be using again at a future date so I turn that into it's own separate module.

Mason Protter (Nov 30 2021 at 22:58):

Definitely not saying to always split things up all the time

Mason Protter (Nov 30 2021 at 22:59):

But it pays to put yourself in a position where it's easy to split thing into sub-repos

Sukera (Nov 30 2021 at 22:59):

@Yakir Luc Gagnon I guess there's a difference in terminology - in Software Engineering, a "monorepo" is a collection of distinct, unrelated projects (that should have their own versions) being tracked in a single repo with a single version instead. This is what I'm arguing against.

Sukera (Nov 30 2021 at 22:59):

That's also my hangup with calling your approach a "monorepo" - it isn't! it's just a single repo for a single project (or it seems to be, if I understand you corrrectly).

Sukera (Nov 30 2021 at 23:01):

One famous example is for a monorepo is (or at least used to be) windows. The operating system. Everything, from kernel to windows explorer, to graphics stack is/was a single repo with a single version - the windows version and its build number.

Yakir Luc Gagnon (Nov 30 2021 at 23:01):

blah! no one wants that! I'll argue against that too!

You understood me perfectly well. And I was seemingly using the term monorepo incorrectly (to me, mono - single, repo - repository).

Sukera (Nov 30 2021 at 23:02):

No worries! :) Not all meanings for words can be inferred from their linguistic origins alone after all

Sukera (Nov 30 2021 at 23:04):

I guess it's one more reason for why a universal translator is an insanely hard project - not even the same language can be translated to itself :joy:

j-fu (Nov 30 2021 at 23:58):

Sorry for having introduced "That-What-Must-Not-Be-Named" (monorepo) into the discussion. It was just meant as a bit of an analogy :sweat: ...

Yakir Luc Gagnon (Dec 01 2021 at 07:42):

Yeah, I tried to understand where and what went wrong. You introduced it and I just went with it cause it made total sense to me. I'm glad at least we understood each other.

Thanks all for a great discussion (love !)

Last updated: Oct 02 2023 at 04:34 UTC