Re: Digest for redo-list@googlegroups.com - 4 updates in 1 topic

46 views

Skip to first unread message

Alexandr Burdiyan

unread,

Jan 12, 2021, 2:33:32 PM1/12/21

to redo...@googlegroups.com

I’m really glad to see this new implementation. I actually stopped using redo because of mtime checks vs. hashes, so I’m really glad there’s an implementation that does hashing by default. I haven’t tried goredo, but reading what you wrote makes me think it’s a decent improvement to Avery’s implementation. Hope to give it a try some day soon.

I agree that it seems weird to have .do files as executables because of the assumptions about the arguments. I’d argue that one should never execute .do files outside of redo. Also it makes writing them easier as you don’t have to go back to the shell to run chmod.

How does goredo behave with directory outputs or when a target needs to produce more than one file? Are there any special cases around that?

On 12 Jan 2021, 19:06 +0100, redo...@googlegroups.com, wrote:

redo...@googlegroups.com Google Groups

Topic digest
View all topics

new redo implementation in go - 4 Updates

new redo implementation in go

Sergey Matveev <star...@stargrave.org>: Jan 12 06:55PM +0300

Greetings, everyone!

*** José I. Rojas Echenique [2015-02-28 21:07]:
>I realize that yet another redo is the last thing we need, but...
>Annoyed by all the current implementations I wrote one that feels perfect.

Same here! That is why I wrote several months ago yet another Go
implementation of redo: http://www.goredo.cypherpunks.ru/
And that is certainly the perfect one :-)

At first, I want to thank (again) apenwarr/redo project and
Avery Pennarun! Without them I probably won't dive into "redo" at all,
continue suffering all my life with that Makefile-driven world. Now I am
a huge fan of redo. However it required me several months even to
clearly realize (without any doubts) that moving to redo was worth of
it. Now I literally has no projects (I manage, author) with Makefiles,
forgetting them as a dark ages.

I decided to write own implementation, as far as I remember, only
because of Python's speed (even its interpreter invocation takes time),
however it was not some kind of bottleneck of course. Due to much better
dependencies management and good jobs parallelizing it greatly saved
much time already comparing to make-based solutions.

My goredo implementation fully resembles most of apenwarr/redo's
behaviour and features: sanitizers/safety checks (both $3 and stdout are
written simultaneously; $1 was not touched during target execution;
externally modified files checks (very convenient!)), parallel builds
(jobserver with possibly unlimited number of jobs), coloured messages,
target's execution time, various debug messages, stderr output prefixing
with the PID, optional stderr capturing to logfile on the disk (with
ability to view it later), statusline with running/done/waiting jobs,
ability to trace scripts (sh -x), redo-ifchange/ifcreate/always/stamp/whichdo,
redo-dot (DOT dependency graph generation) commands, ability to .do
files be any kind of executable (not just shell). All my projects can
work/build under apenwarr/redo, apenwarr/do, redo-c (without redo-stamp
features, which I "disable" by using "command -v redo-stamp > /dev/null
|| exit 0" checks) and goredo.

It has technical differences of course, as it is not a clone at all:
.redo directory with recfiles (https://www.gnu.org/software/recutils/)
stores the state in each directory, instead of a single .redo with
SQLite3 database. recfiles are even conveniently human readable and can
be scripted/processed easily (if someone wants to). You can "limit" top
directory searching either by REDO_TOP_DIR=PATH environment variable, or
by touch-ing /path/to/top/dir/.redo/top. stderr log messages are stored
with TAI64N timestamp as daemontools'es multilog do with the logs (if
you do not need it, it is trivially stripped).

Single major difference is that goredo (as redo-c) always checksums
targets/dependencies (if ctime differs, with BLAKE3). Unfortunately I
can not agree with most arguments given in apenwarr/redo's FAQ. Even
modern filesystems tend to checksum (even with cryptographic hashes!)
all the data passed between the OS and disks -- it is very cheap even
from delay-s point of view, not possible throughput. And checksumming
greatly simplifies many things and makes redo-stamp completely useless.

If you want always out of date targets -- just do not echo to stdout and
do not touch $3, to skip file creation. If you really wish to produce
empty target (and make it participating in checksumming and determining
out-of-date-ness) -- touch $3. redo utilities must help people.
Automatic recording of "ifcreate default.do" dependency, "ifchange
$CURRENT.do" and similar things -- all of them are done implicitly
(unlike the FAQ's note about Python Zen's "explicit is better") and it
is very very convenient. Checksumming is just a permanent
redo-stamp-ing, that removes all complications with it (at least the
fact that not all redo implementations implement it and you have to
check its ability as I do with "command redo-stamp || exit"). Possibly
it won't help you. But possibly it will just behave as automatic
convenient redo-stamp. But it won't harm you anyhow. Initial goredo
implementation did not permanently checksum and tried to honestly do
redo-stamp-ing, but soon I have completely replaced it with permanent
checksuming -- it just *heavily* simplifies everything inside redo
implementation itself and is too convenient to the user. redo-stamp
works too, records the stamp, but it plays no role later.

I fully agree with Avery that the worst thing can happen is that some
target won't be build when it have to. But with permanent checksumming I
see no problem at all, if, and *only* if you have ability to work with
*both* stdout and $3 *and* no file is created if nothing was in stdout.
It is default behaviour in apenwarr/redo (perfect behaviour!), was (two
commits ago) default behaviour in redo-c and the only behaviour in
goredo. If empty file is always created (even if $3 was not touched and
no output was in stdout) -- you loose ability to control the behaviour
of dependencies (yeah, that empty's file hash won't differ and that is
why target won't be out-of-date, comparing to lack of target's file and
its interpretation as out-of-date). I really convinced and see no other
options than to capture stdout (and $3), determine lack of target as
out-of-date, and do not create file if stdout was empty. Lack of any of
that will ruing everything. DJB is a genius that just did not explain
how important both stdout/$3. And of course all of that is fully
perfectly friendly with permanent checksumming.

goredo also uses fsync on files/states/directories by default (can be
disabled) to make guarantees about built targets.

It passes (except for two) all apenwarr/redo implementation-neutral tests:
http://www.git.cypherpunks.ru/?p=goredo.git;a=blob;f=t/apenwarr/README;h=ae5e7e0a23cfae6835d5ac5b49695d04e588a6e7;hb=HEAD
Also it passes all redo.sh-tests too:
http://www.git.cypherpunks.ru/?p=goredo.git;a=blob;f=t/redo-sh.tests/README;h=08b4ae86fc80442cb90110f8481fd197561623b4;hb=HEAD

And it consists only of single binary (can be just copied to necessary
systems, without requiring of Python or any of its libraries) to which
redo* symlinks are created. I have doubts that continuous parsing and
writing of recfiles will be fast enough (and won't be surprised if
SQlite3 will outperform goredo's speed), but goredo even with fsyncs is
much more faster (visually!) than Python-based software.

PS: I have doubts about posting the advertisement of my creation to that
maillist, but I mention that redo-list@ is a "A group for discussing
implementations of djb's "redo" software build system.", so it seems
ethical to post it here :-)

--
Sergey Matveev (http://www.stargrave.org/)
OpenPGP: CF60 E89A 5923 1E76 E263 6422 AE1A 8109 E498 57EF

Karolis K <karolis.k...@gmail.com>: Jan 12 07:21PM +0200

Hello Dear Sergey,

I love seeing this here. I am heavy user of apenwarr’s redo, but have found a few small issues with it. apenwarr seems to put this project in hybernation for long periods of time and so it’s unlikely it will be updated anytime soon. Given this, any alternative implementation is welcome.

I checked your version just briefly, and here is my feedback (might have more later)

Things I like:

1. separate .redo/ being placed in directories of targets (seems like this allows to simply remove/rename directories within projects without loosing any state)
2. dependencies being written in separate files, rather than a database (allows to quickly inspect dependencies and manually manage them when needed: i.e. renaming and not loosing state)
3. stamped by default (always wanted this behaviour, instead of always relying on time stamps: allows to make cosmetic changes to do files, without restarting whole pipeline)

Things I don’t like:

1. I miss redo-ood command. My default all.do file always was redo-ood | xargs redo-ifchange
2. I also miss redo-targets and redo-sources. I know there is redo-dot, but it’s output is designed for visualization, not programming.
3. I don’t like that files have to be executable in order to run custom shebang paths. To me an executable file “smth.do” that cannot be executed directly “./smth.do” makes little sense.
4. I don’t like that `redo what/what`, when “what” dir doesn’t exist, will create an empty “what” directory.

Just some feedback based on quick testing. I think I will start using your version for some projects and see if I run into further issue.

So thanks a lot, and would love to hear your comments about this feedback, if any.

Kind regards,
Karolis Koncevičius.

Sergey Matveev <star...@stargrave.org>: Jan 12 08:40PM +0300

*** Karolis K [2021-01-12 19:21]:
>1. I miss redo-ood command. My default all.do file always was redo-ood | xargs redo-ifchange

Currently I do not understand what is this for? Am I right that: for
example you have "lib" target (that builds .so) and independent "doc"
target (that builds some textfile). redo-ood will print that both lib
and doc are OOD and you will immediately rebuild both of them. So this
-ood | -ifchange is literally for "rebuild everything that seems to be OOD"?

>2. I also miss redo-targets and redo-sources. I know there is redo-dot, but it’s output is designed for visualization, not programming.

Seems easy to implement, but honestly by quick looking I can not
understand what they are useful for (except for debugging)?

>3. I don’t like that files have to be executable in order to run custom shebang paths. To me an executable file “smth.do” that cannot be executed directly “./smth.do” makes little sense

Why custom executable smth.do can not be run directly with ./? Exactly
that way it is executed by goredo (as an example). I used to place
Python scripts that way:

$ cat > foo.do <<EOF
#!/usr/bin/env python
print("hello")
EOF
$ chmod +x foo.do
$ ./foo.do
hello
$ redo foo
redo foo (0.016084sec)
$ cat foo
hello

apenwarr/redo parses files to determine shebang and (if it is not the
shell) prepares command line for execution. Isn't it completely the same
task the OS/kernel does by reading/parsing shebang and executing it?
"/bin/sh" is just a special use-case for convenience and ability to add
"-x" to it.

--
Sergey Matveev (http://www.stargrave.org/)
OpenPGP: CF60 E89A 5923 1E76 E263 6422 AE1A 8109 E498 57EF

Karolis K <karolis.k...@gmail.com>: Jan 12 07:50PM +0200

Thanks for such a quick reply!

I think I should have specified that I am using redo for managing dependencies within data-analysis projects. So my use-case might be a bit unconventional.

> target (that builds some textfile). redo-ood will print that both lib
> and doc are OOD and you will immediately rebuild both of them. So this
> -ood | -ifchange is literally for "rebuild everything that seems to be OOD"?

Yes, I am using this for redoing every target that is out of date, for whatever reason. This wouldn’t be necessary if the whole pipeline creates one output at the end, then simply `redo final-target` will do the job. However in my case there might be many “leaf” targets. So when I change a file that is a dependency for 5 unrelated targets - redo-ood | xargs redo-ifchange will update all of them. I am of course open to learning about better ways of doing things.

>> 2. I also miss redo-targets and redo-sources. I know there is redo-dot, but it’s output is designed for visualization, not programming.

> Seems easy to implement, but honestly by quick looking I can not
> understand what they are useful for (except for debugging)?

Within the same context, if I am considering a change to some file - before doing that I often want to get a list of files that will be impacted by this change (depend on the file I am changing).

> Why custom executable smth.do can not be run directly with ./? Exactly
> that way it is executed by goredo (as an example). I used to place
> Python scripts that way:

In my case - because of special arguments passed by redo: $1, $2, $3. They are not set if the script is ran directly and an error is produced.
This is a bit of a nit-pick, I can easily live with it, I think. Just seemed a bit odd, since I never thought about .do files being “executable”.

Kind regards,
Karolis K.

Back to top

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to redo-list+...@googlegroups.com.

Alexandr Burdiyan

unread,

Jan 12, 2021, 2:42:24 PM1/12/21

to redo...@googlegroups.com

Speaking of hash-checks. I think I made wrong assumption initially that goredo would check mtime first, and only if it differs will perform the hash-check. But now I’ve read "file’s change is detected by comparing its ctime and BLAKE3 hash” on the website, and I guess it does the hash check always without checking mtime at all, is that correct?

Wouldn’t it improve things a bit if goredo would not do the hash-check if mtime is the same? One could argue that someone could change the file and reset mtime, but that would be the problem in all redo implementations anyways.

Sergey Matveev

unread,

Jan 12, 2021, 2:43:33 PM1/12/21

to redo...@googlegroups.com, Alexandr Burdiyan

*** Alexandr Burdiyan [2021-01-12 20:33]:

>I agree that it seems weird to have .do files as executables

* goredo does not require /bin/sh .do files to be executable.
The same as in apenwarr/redo and redo-c
* compiled C .do files anyway required to be executable, even with
apenwarr/redo and redo-c implementations too

signature.asc

Sergey Matveev

unread,

Jan 12, 2021, 2:55:34 PM1/12/21

to redo...@googlegroups.com

*** Alexandr Burdiyan [2021-01-12 20:42]:

>But now I’ve read "file’s change is detected by comparing its ctime and BLAKE3 hash” on the website, and I guess it does the hash check always without checking mtime at all, is that correct?

Why check mtime if ctime is always changed too? I fully agree with
apenwarr's https://apenwarr.ca/log/20181113 article about practical
useless of mtime for that kind of tasks. ctime will change everytime
anything happens with the file.

If ctime is not changed, then no need to check the hash. If it is
changed, let's check the hash if its contents. Of course, even just link
count will update ctime and we are forced to hash the file. But are
those changes really often to make us bother?

Checking mtime alone is harmful (non-working), as the article shows us.

signature.asc

Sergey Matveev

unread,

Jan 12, 2021, 3:22:42 PM1/12/21

to redo...@googlegroups.com, Alexandr Burdiyan

*** Alexandr Burdiyan [2021-01-12 20:33]:

>How does goredo behave with directory outputs or when a target needs to produce more than one file? Are there any special cases around that?

Sorry that missed that questions previously.
No, it does not have any special cases for them. Just an ordinary redo :-)
It allows directory creation, nothing will break. But rebuilding "mkdir $3"
target will lead to:
rename /tmp/1/.redo.foo.7f60176a.3 /tmp/1/foo: file exists
In practice all the targets where I create directories starts with some
kind of "rm -rf $1".

Sergey Matveev

unread,

Jan 12, 2021, 4:09:14 PM1/12/21

to redo...@googlegroups.com

*** Sergey Matveev [2021-01-12 22:55]:

>If ctime is not changed, then no need to check the hash.

However that seems not to be true. For example if you change contents of
the file, especially without affecting its size, you do fdatasync(), but
after that the system goes down and no inode information will be updated,
no ctime change will happen and no hash will be checked. So, as I
understand, checksumming must always be done (ideally), because it is
awful situation when target is not treated OOD when it have to.

spacefro...@meterriblecrew.net

unread,

Jan 13, 2021, 8:48:31 AM1/13/21

to redo...@googlegroups.com

I'd like to share two observations regarding checksumming.

First, software builds should be depending only on the content of their inputs. In this regard, to always checksum is the solution.

Second, the idea of redo should actually be thought wider than to build software. Its core functionality is exactly the two following things: 1. a standard interface to .do files. 2. dependency tracking. That's it. Every other assumption about what redo should do implicitly is necessarily counterproductive to some usage scenario.

The supposition that a .do target should not be run just because its dependency didn't change its content is plain wrong. Suppose I would like to run some .do target that inserts a message into a log. Suppose also this message could be produced into a file just naturally by another .do target. With ckecksumming, I would need to artificially enforce that the message is different each time, otherwise my log target would not trigger. But I already have that information available to redo, it's the message file's ctime.

Generally speaking, sometimes time IS the measure I would like to use to make a decision on rebuilds. goredo forces me to convert time into content by dumping a timestamp into a file.

The interpretation of what ctime and hash value mean for the out-of-dated'ness of a build product is artificially fixed in goredo, while it is left to the user in apenwarr/redo.

I acknowledge that goredos implementation is convenient but would like to encourage you to provide a complement to redo-stamp, like redo-timed, to give the user a tool to mark do targets as time-based only.

One last point: apenwarr's comment on hashes being unreliable is valid. So at the very least, goredo should make a three step decision (maybe that is done already) and take the ctime, file size, and hash into account.

Kind regards,
–Michael

Sergey Matveev

unread,

Jan 13, 2021, 9:45:30 AM1/13/21

to redo...@googlegroups.com

*** spacefro...@meterriblecrew.net [2021-01-13 13:48]:

>I acknowledge that goredos implementation is convenient but would like to encourage you to provide a complement to redo-stamp, like redo-timed, to give the user a tool to mark do targets as time-based only.
>One last point: apenwarr's comment on hashes being unreliable is valid. So at the very least, goredo should make a three step decision (maybe that is done already) and take the ctime, file size, and hash into account.

Good points! However...

As I can see, we (the users of redo) have various requirements for OOD
detection: one wants checksumming, another one wants mtime (like Make),
another one wants more "sensitive" ctime, someone wants possibly yet
another dependency decision source. Lack of (auto)checksumming is
inconvenient in practice for software build (really good point). But
some implementations provide redo-stamp for tuning/altering dependency
behaviour. Auto-checksumming seems may hurt other use-cases and you
suggest to create redo-timed to alter dependency tracking behaviour in
completely the opposite way :-). Completely opposite (purely time/inode
based vs purely checksum-based). So there can not be any kind of golden
mean, as I can see.

Seems that ideally you should explicitly tell redo what kind of
dependency tracking it must use: stamp/time/size/go.sum-like-checksums
or whatever user wishes. That will satisfy any user's requirements. But
of course IMO that is not convenient. Probably you can configure some
kind of default behaviour per project/directory and use redo-stamp/timed
in .do only for overriding.

But with that way redo is becoming some kind of huge and complicated
tool. User has to understand and determine what he wishes for and choose
redo-stamp/timed/whatever in his .do targets. That complexity is bad.

But checksums are the only thing that are guaranteed to make
deterministic decisions without errors. If any bit is changed -- hash
differs. inode information is brittle information -- it may not update
because of clock granularity, may not update because of clumsy FUSE
implementation. inode/*time information heavily depends on huge quantity
of factors and error prone in general case. If mtime/size/ctime is not
changed, that does not mean that file was not really changed. If hash is
different -- that gives us guarantee that data is changed.

Even if goredo will have redo-stamp, redo-go.sum-like, redo-timed --
there always be a case where none of that fully fits OOD decision making.
Build only the checksum-based way can be used as a base for building any
kind of dependency tracking decisions. If you want to check only for
mtime (it is enough and perfectly working for your use-cases), then you
can do something like:
redo-ifchange foo.mtime
and foo.mtime.do containing:
stat -f %m foo
To "emulate" OOD tracking based on looking for mtime field. I understand
that it requires you to make some kind of hack (instead of hypothetical
redo-timed), but that way is universal and it does not require
redo-implementation change when you need another OOD decision making rules.
It is some kind of common denominator, that also (I assume) satisfies
most users (for building software). It also makes redo-implementations
interchangeable, requiring only basic subset of features.

I think that software should not try to solve all the tasks and use
cases: that is why things like systemd, CMake appear (no holy-war! just
my opinion). Good implementation should solve 99.9% of common
tasks/use-cases as simply as it can (well, Unix-way). And of course it
should not prohibit not so common tasks solving too. Checksum-based OOD
gives ability with tiny hacks (for example .mtime files) to get desired
behaviour. It is the most flexible "brick" that always works (even on
broken FUSE filesystems, missing fsync-updated inode information and so on).

Michael Raitza

unread,

Jan 13, 2021, 1:41:45 PM1/13/21

to redo...@googlegroups.com

Sergey Matveev <star...@stargrave.org> writes:
> Completely opposite (purely time/inode
> based vs purely checksum-based). So there can not be any kind of golden
> mean, as I can see.

Exactly, there cannot be a golden solution. And I acknowledge your
inclination towards content-based dependency tracking. But still I
wanted to raise your attention to support the two solutions.

To reliably track dependencies, decision making should run in the
following order:
1. file size
2. "time"
3. content

If the file size of your dependency changed, the target is OOD. There
is no lying about that. Every usable filesystem will support that. When
file size changed, the content changed. And the "time" changed, because
content cannot magically appear. So it must have appeared at a later
"time". But it is not enough. When file size is the same, it could
still be OOD.

Next, when "time" did not change, the targed is not OOD. I write "time"
in quotes because I imagine not inode/*time but something I propose
below.

Now we must (in my opinion) give the user the means to choose between
two modes:
- Content: When "time" did change, check the content via hash and
decide.
- Time: When "time" did change, target is OOD.

> But checksums are the only thing that are guaranteed to make
> deterministic decisions without errors.

I am sorry to be blunt, but this statement is plain wrong. All hash
functions collide and you must always provide a means to resolve such
collisions. By taking size into account (like above) you would at least
reduce the collision probability by never comparing hashes of files of
different sizes (because they can obviously never have the same content
regardless of a hash collision). There is no good reason to throw this
exceptionally cheap, yet safe, mitigation away.

Regarding "time", I have a modified suggestion to you:
You are right, inode/*time based decision making is broken in all kinds
of horrible ways. It would increase your load of supporting various
broken OS's and filesystems ...
I claim now, inode/*time based decision making was never what the user
wanted in the first place! He wanted "build generation"-based decision
making (in addition to content-based). This is what I have in mind:

- keep a monotonic counter value for each target in the .redo database
- whenever you build a target, increase the target's counter value
- keep a monotonic counter value for each of your dependencies (not only
one for all of them!)
- when checking a dependency, compare your recorded monotonic counter
with the current value of your dependency.
Two cases:
- The dependency just got created and its counter value is 0 (the
first value after "not in database", e.g. after redo-ifcreate), then
the target is OOD. 0 is greater than all other values and means
"just created". (But you likely capture ifcreate dependencies
otherwise)
- The dependencies counter value is greater than that on record in the
current target: Rebuild.

It is completely independent of any OS inode/*time information and just
records "build generations" along the dependency graph. Space
complexity is the number of edges plus the number of vertices in your
dependency graph. Time complexity is the number of immediate
dependencies for a given target.

This is all very well for targets created by redo. For originals, that
should not be tracked by size/content but "time", the user needs a tool
to update the original's "build generation" in the redo database,
e.g. redo-touch. So, instead of apenwarr's solution to "touch foo.c" to
force redo to re-run you would do "redo-touch foo.c". This is for the
very, very rare case (i.e. my message logger example) where you want to
trigger a redo run regardless of content. It should not bluntly mark the
original as out-of-date (as redo-always would do), because you may have
other dependant targets that want to use content-based tracking. Also,
this is absolutely fool proof and can never suffer a hash
collision. (E.g. when your original comes from an untrusted source that
tries to craft a hash collision.)

Instead of
redo-ifchanged <deps...>
one could write
redo-ifnewer <deps...>
to mark dependencies time-tracked. It's important to do this from the
caller's side if one wants to have the ability to depend on it in either
mode from different targets. Drawback is that time-tracked and
content-tracked dependencies are build sequentially. Using "redo-timed"
on the dependency's side (i.e. a weaker version of "redo-always") would
render this a non-issue but would fix all downstream callers to use
time-based decision making (just like with redo-always, actually).

I would like to hear your thoughts on this! I am still convinced that
for practicality we want both modes of dependency tracking. I know that
I use that regularly for interactive development, and it is very
convenient to have.

–Michael

Sergey Matveev

unread,

Jan 13, 2021, 4:23:39 PM1/13/21

to redo...@googlegroups.com

*** Michael Raitza [2021-01-13 19:41]:

>If the file size of your dependency changed, the target is OOD. There
>is no lying about that.

Fully agree. Size check is cheap (free) and always OOD if it differs.
Good optimization, will certainly add it!

>> But checksums are the only thing that are guaranteed to make
>> deterministic decisions without errors.
>
>I am sorry to be blunt, but this statement is plain wrong. All hash
>functions collide and you must always provide a means to resolve such
>collisions.

From pure mathematical point of view you are right that collisions are
possible. But from practical point of view, with *cryptographic*
collision-resistant functions of enough long digest output, that
probability is so low, that I can neither agree with you, nor imagine
even single possible event of collision during my life. Currently none
of popular cryptographic hashes with *at least* 256-bit output were
"broken" to be able to make a collision. goredo uses 256-bit BLAKE3
output -- enough to forget about practical collisions probability.

I really can not accept any argument of possible hash collision, even
over the data that adversary sends me. Cryptography (encryption,
hashing at least) is all about providing solutions that has some
security probability tradeoff with practical usability. One-time pad is
perfectly secure, but hardly can used in practice (except for military).
RC4 is not secure, but good (not CPU hungry) for obfuscation purposes.
3DES is secure enough for most tasks, but slow. AES/Salsa20 are both
fast and secure (probability to their (known) breakage is too
negligible) and that is why used so much. Negligible probability equals
to zero to me in practice. Seriously, I will likely expect to see broken
"size" of the file in the OS (easily!), than practical hash collision of
long enough digest.

>Now we must (in my opinion) give the user the means to choose between two modes:

>[...]

>Regarding "time", I have a modified suggestion to you:

>[...]

>This is all very well for targets created by redo. For originals, that
>should not be tracked by size/content but "time", the user needs a tool
>to update the original's "build generation" in the redo database

I fill I fully understand your suggestion. So basically (however you
already described it easy and good enough :-)):

* you separate targets on two kinds: redo-created targets
* and "originals" (I call them "sources", let's call them redo-originals)
* you consider that "time" tracking (you variant of counter in the
database/state) of redo-created targets is enough
* but one must also have a way to "touch" redo-originals (obviously)

If we assume that inode-based tracking for *newly* created redo-created
targets works in practice, then there is no need "counter" database
states. In theory all OSes can fail with everything and (I presume) your
completely throw away inode as a possible source of information. I agree
with that. But, in practice I believe that inode-based information for
*newly* created files works on any more or less usable/popular OS
correctly (let's forget about FUSE currently). Everything I wrote before
about inode-information possible fails were only and only regarded to
redo-original files. redo-c, goredo, apenwarr/* -- all of them create
temporary files and rename them to redo-created target. Probabilities of
new file's inode-collision and *time-information with an old one is
negligible in practice (for me). So I am sure that (in practice, with
non-FUSE filesystems) redo-created targets OOD are sufficient to be
tracked only by inode-information (ctime).

I really want to underline probabilities. Both hash-collisions
(cryptographic long digest) and completely broken filesystems (that are
used in practice, let's forget about FUSE) (having same
inode-information for newly created/renamed files) have probability *in
practice*. Even non-ECC RAM has much higher risk of flipping the bit
*anywhere*, ruining literally everything we can program in software.

So, if we take current apenwarr/redo (and assume it does fsync()s), then
it satisfies your suggestion for redo-created targets. If we take
current goredo (that does fsync()s by default even for directories after
file renaming) and comment its hash checking after ctime check, then it
also satisfies. inode-based tracking of fsync-ed newly created renamed
redo-created targets seems to be satisfactory.

Let's return to redo-originals. You basically suggest user to decide how
it will track originals. Do whatever he wants and "record" his decision
by redo-touch execution. I have nothing against that acceptable claim
and suggestion in general. For me, exactly the redo-original targets
were under the question will all that really (in practice!) possible
unabilities to track them by inode-information. mmap the redo-original
file, overwrite some part, do fdatasync(), poweroff computer -- data is
changed, but inode (ctime and everything else) update was not finished
(it was not updated): absolutely possible accident in practice.

Actually I think that it is an acceptable decision just to redo
everything (clean all) if we are not sure about consistency of our data,
if we saw a breakage. So I am not totally against inode-based tracking
of redo-originals, however they are clearly can be in inconsistent state
especially after the failures. But most people/computers live even
without ECC RAM pretty good in practice.

apenwarr/redo gives us ability to track redo-originals by
inode-information by default, or you can optionally do redo-stamp if it.
If you wish for something different, like you example of log message, or
probably "redo if atime (and only atime!) is changed", or "redo if utmpx
contains fact of user's logging in" -- you have to create your owns
"hacks" (additional intermediate redo-created targets) that will be
under either inode or redo-stamp tracking.

If we assume that those "utmpx-checks" are pretty rare in practice, then
it is ok to force user to create that intermediate hack/proxy targets,
in my opinion of course. Is it ok to replace redo-touch with redo-stamp?
I think yes. There is no difference between hypothetical:
some_metainfo_extractor | redo_dep_compare && redo-touch || :
and
some_metainfo_extractor | redo-stamp
where redo-stamp just compares the input with some previous value and
decides if "touch"-ing is necessary.

So basically current apenwarr/redo with inode-based tracking of
redo-created targets, optional inode-based tracking of redo-originals
and ability to redo-stamp-ing seems satisfactory. We have two tracking
mechanisms: inode and stamp based.

But actually in my (yeah, programmer biased) practice redo-stamp usage
frequency for redo-created targets is pretty high. In C projects which
redo-always check for environment variables, command paths, pkg-config-s
and similar things (I want to rebuild only necessary things if one of my
library's version or CFLAGS environment variable changes) -- redo-stamp
is used often. And I am sure that most other use-cases won't be hurt by
redo-stamp-ing too. So, if we have got inode-based redo-ifchange and
frequently used redo-stamp together -- can we replace redo-ifchange with
redo-stamp? I believe from practical point of real redo usage scenarious
-- yes we can do it safely. If there are occasions where it may hurt,
then we make hack/proxy intermediate targets with necessary output for
redo-stamp's decision. If we replace inode-based redo-ifchange with
redo-stamp and that move will remove huge quantity of redo-stamp
invocations in .do scripts and will add just small amount of
intermediate hack/proxy targets, then it is worth of it. It gives more
satisfaction to most users than burdens them.

So, in practice (in my opinion) redo-stamp used as a redo-change and as
a redo-touch covers the most frequent use-cases and does not require you
to write redo-stamp in so many .do files and make redo-stamp/ifchange/touch
decision for redo-original targets. You can not be satisfied with
inode-based redo-ifchange alone, because of untrustworthy redo-originals.
But redo-stamp can be easily used for purely inode-based tracking
(through intermediate target), for any kind of other one too. But
redo-stamp tracking is good by default for most cases out-of-box.

Of course, let's return to original world. If our target downloads (only
once) some huge tarball (project's dependency for example), then won't
it be too expensive to hash it all the time during each build? How many
targets are possibly dangerous in a sense that their content may be
changed without any inode altering? Won't it be a good optimization
still to check size/ctime/whatever and skip hash check if everything is
the same? *In practice* I think that it is safe optimizations to do and
to skip every time hashing of files noone touches at all during all
their lifetime. If we are sure that some files are definitely can be
modified by various reasons, we just can use redo-always to forcefully
check their content.

My conclusions:

* "counter" based state is just a reimplementation of filesystem.
Filesystems works in practice with newly created/renamed redo-created
targets. FUSE filesystems possibly does not work and "counter" state
will be a salvation here. But it is added complexity (always bad) and
hopefully zero probability that FUSE-based people running redo and
having issues with it exist now
* hash-based tracking (redo-stamp-ing) is the best default behaviour,
that also can be used to relatively easily, without redo implementation
altering, do whatever tracking you wish for (smth.mtime.do:stat %m smth).
No, practical collisions have zero probability
* inode-based optimizations are acceptable for most practical use-cases,
not burning computer resources in vain. redo-always-ing or touch-ing
can force inode-based OOD-skip decision to miss

Yeah, there are many assumptions: practical use-cases frequency, no
FUSE, fsync-ing. That is far from ideal solution not dependent on OS and
any kind of real world tasks/assumptions. But it is small and compact
(less code -- less probability to errors in it, much more higher than
*anything* we discussed so far, except for FUSE :-)). Probability of
non-ECC RAM ruining my ZFS is higher than anything bad will happen with
everything related to redo. And goredo chooses to checksum targets by
default (with inode-based decisions, that skips it) because of
convenience to most user tasks and (much!) lower complexity of the
goredo's code itself, without sacrificing abilities to easily do
whatever tracking you wish for (unlike inode-based tracking, anyway
requiring some additional tool/way to make guaranteed decisions like
cryptographic checksums comparing).

>I would like to hear your thoughts on this! I am still convinced that
>for practicality we want both modes of dependency tracking. I know that
>I use that regularly for interactive development, and it is very
>convenient to have.

Possibly I missed your thoughts somehow and I am walking around my own
ones again and again. But I am not convinced that there are only two
possible modes of dependency tracking. There can be any number of them
(purely atime-based, utmpx-based, flag-file based, whatever). Hashing
*ability* existence is a must, because it gives guarantees (zero
probability of collisions in practice with cryptographic enough long
digested hash) of confidence. Hash-based tracking can replace every kind
of other dependency tracking relatively easily, without complexity in
redo implementation, with high flexibility. And my practice did not
convince me that purely inode-based tracking is frequently useful.

Reply all

Reply to author

Forward

0 new messages