Yet another idea for solving multiple outputs problem

81 views
Skip to first unread message

Alexandr Burdiyan

unread,
Oct 17, 2020, 9:08:09 AM10/17/20
to redo
See this Twitter discussion for background: https://twitter.com/burdiyan/status/1317038438378557440

It seems like Twitter is not the right medium for the discussion I'd like to have so I'm sending this email.

I ended up creating a simpler example showing the same problem here: https://github.com/burdiyan/redo-virtual-targets-issue/tree/simple

Quoting the proposed solution from the simpler example link:

"What if there were a way for a rule to explicitly declare its outputs? Like `redo-ifchange` there could be `redo-output` (or similar) that would mark the indicated files as static, but generated, and add them as a dependency for the rule running `redo-output`.

I still don't quite understand what static files are in the redo world though :) But I've tested it and it seems to work.

It would be super cool if you could be able to run `redo-output` before or after outputs are actually built. The parent process would check if the outputs were generated after the rule is executed. If some of the previously declared outputs disappear - then the rule is marked as dirty.

I'm pretty sure this idea would have lots of implications, that I'm not aware of, but I find the idea of explicitly declaring outputs pretty elegant. It's a bit like Bazel - but with simplicity, imperativity and dynamism of `redo`."

Prakhar Goel

unread,
Oct 17, 2020, 11:38:09 AM10/17/20
to Alexandr Burdiyan, redo
Q: how does redo link the outputs to the do script then? The do script
hasn't run yet when some other target calls redo-if-change for any of
the multiple-outputs (not the first time at least).

--
________________________
Warm Regards
Prakhar Goel

Alexandr Burdiyan

unread,
Oct 17, 2020, 11:42:47 AM10/17/20
to Prakhar Goel, redo
I implicitly assumed that it was clear that multiple output files actually cannot be redo-I’d change’d. Only the virtual target that produces them.

So if a.do wants to redo-output, in b.do you’d have to `redo-ifchange a` and then *use* the generated outputs.

Alexandr Burdiyan

unread,
Oct 19, 2020, 6:40:50 AM10/19/20
to redo
Another approach to this that I'd like to experiment with is to allow rules to write a *list* of names of output files into $4 and then within the builder process check wether these files were actually created, modified and etc. I guess I like it less than a separate `redo-output` program, but this may be easier to implement without too much new code. Although another magic positional argument for rules seems like a not very elegant solution.

Ideally I'd like to enable the following ideas to these output files:

- They are marked as generated.
- If previously declared output disappears - the target declaring the output is dirty and needs to rebuild.
- If previously declared output is touched from the outside - warn and skip the execution. Same thing as with current $3 outputs.
- If output is declared but didn't appear - build failure.

Nils Dagsson Moskopp

unread,
Jul 31, 2022, 1:25:33 AM7/31/22
to Alexandr Burdiyan, redo
My default idea for multiple outputs is to produce a tarball with all of
the outputs if something produces multiple outputs. Then I make each
output depend on the tarball and extract a single file from it.

So … in what ways does this not solve your problems?

Greetings,
Nils
signature.asc

Joseph Garvin

unread,
Aug 18, 2023, 5:15:00 PM8/18/23
to Nils Dagsson Moskopp, Alexandr Burdiyan, redo
resurrecting an old thread since I'm trying out implementing something like redo-output... one problem with tarballs is they have O(n) lookup. you can't seek directly to just the file you want to extract, you have to skip all other files that happen to be in front of it inside the tarball. this is admittedly a limitation specific to tarball and you could work around it with something like zip, but tar is widely available on unix systems by default while zip is not. the bigger problem is this prevents you from having granular dependencies, letting the build system know you only depend on inner_file_1 is better than depending on files.tar because the build system can specifically check if inner_file_1 changed, and avoid rebuilding if it hasn't even if other files inside the tarball have changed.

--
You received this message because you are subscribed to the Google Groups "redo" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redo-list+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/redo-list/87bkt5lupp.fsf%40dieweltistgarnichtso.net.

Prakhar Goel

unread,
Aug 18, 2023, 10:01:03 PM8/18/23
to Joseph Garvin, Nils Dagsson Moskopp, Alexandr Burdiyan, redo
Can't you already do this with redo-stamp?

Joseph Garvin

unread,
Aug 20, 2023, 1:46:41 PM8/20/23
to Prakhar Goel, Nils Dagsson Moskopp, Alexandr Burdiyan, redo
For the second part of the problem yeah, you could use redo-stamp on only the inner file you actually use.

--
You received this message because you are subscribed to the Google Groups "redo" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redo-list+...@googlegroups.com.

Nils Dagsson Moskopp

unread,
Aug 25, 2023, 3:20:22 PM8/25/23
to Joseph Garvin, Alexandr Burdiyan, redo
Joseph Garvin <joseph....@gmail.com> writes:

> letting the build system know you only depend
> on inner_file_1 is better than depending on files.tar because the build
> system can specifically check if inner_file_1 changed, and avoid rebuilding
> if it hasn't even if other files inside the tarball have changed

I am pretty sure that for granular dependencies you do not want to
depend on files.tar, but on inner_file_1, which is extracted using
inner_file_1.do from files.tar and depends solely on files.tar.

I talked with people about multiple outputs at Chaos Communication Camp
recently and some pretty creative approaches exist; also people seem to
reject the tarball solution mainly on aesthetic grounds.

A similar thing occurs when you tell people who want out-of-tree builds
to just mount an overlay filesystem. The majority of people recoil, but
would clearly be capable of doing so, i.e. their OS offers FS overlays.
People seem to think that it is “weird” and should not be used, despite
asserting Docker and other software using overlay filesystems are okay.

Anyway, if it is stupid and it works, it is not stupid. And if it feels
clever and does not work, it is not particularly useful or clever, like
the DAG-toposort “optimizations” which can never beat a naive recursive
top-down build implementation on correctness (if you believe otherwise,
read “build systems a la carte” again and meditate on the formalisms) …

That being said, I would very much prefer to see a real-world issue you
encountered that could not be solved with tar and what you did instead.

Since I use redo for real problems, architecture astronautics annoy me.
So, please show and tell your problem & solution. Maybe do a blog post?
signature.asc

Joseph Garvin

unread,
Sep 7, 2023, 4:48:48 PM9/7/23
to Nils Dagsson Moskopp, Alexandr Burdiyan, redo
tar is a non-starter for large data. If you put 1000 files in it, extracting the 1000th file requires following the linked list of 999 record headers to get to it. Even if you pick a different format that fixes this, you double your IO -- assuming producer app makes a standalone files you have to copy them into the tar, and assuming your consumer app consumes standalone files you have to copy it back out.

Not a big deal for tiny source/object files but it's a deal breaker for say 10+GB ML samples/models.

Nils Dagsson Moskopp

unread,
Sep 8, 2023, 11:46:59 AM9/8/23
to Joseph Garvin, Alexandr Burdiyan, redo
Joseph Garvin <k04...@gmail.com> writes:

> tar is a non-starter for large data. If you put 1000 files in it,
> extracting the 1000th file requires following the linked list of 999 record
> headers to get to it. Even if you pick a different format that fixes this,
> you double your IO -- assuming producer app makes a standalone files you
> have to copy them into the tar, and assuming your consumer app consumes
> standalone files you have to copy it back out.
>
> Not a big deal for tiny source/object files but it's a deal breaker for say
> 10+GB ML samples/models.

show me a real case where this is the first problem you encounter. in my
experience, once you handle gigabytes of data, hashing data to see if it
is up to date becomes the bottleneck pretty much before everything else.

and if you never hash the data, you are going to have useless rebuilds …
signature.asc

Joseph Garvin

unread,
Sep 9, 2023, 6:27:41 PM9/9/23
to Nils Dagsson Moskopp, Alexandr Burdiyan, redo
This is one of the use cases for redo-stamp. You can hash any subset of the file you think is appropriate to avoid needing to hash the entire thing. If you're generating a big machine learning model that is a big block of a gajillion randomish floats, generally speaking when you tweak any input to the model generation they all change at once. It's extremely unlikely that you happen to build a model where the first say 1000 floats are the same but the rest are different. Also, you are usually doing experiments that involve changing metadata that is tiny for the model. So hashing the metadata, the size and the first N bytes of floats is good enough in practice.
Reply all
Reply to author
Forward
0 new messages