*** Michael Raitza [2021-01-13 19:41]:
>If the file size of your dependency changed, the target is OOD. There
>is no lying about that.
Fully agree. Size check is cheap (free) and always OOD if it differs.
Good optimization, will certainly add it!
>> But checksums are the only thing that are guaranteed to make
>> deterministic decisions without errors.
>
>I am sorry to be blunt, but this statement is plain wrong. All hash
>functions collide and you must always provide a means to resolve such
>collisions.
From pure mathematical point of view you are right that collisions are
possible. But from practical point of view, with *cryptographic*
collision-resistant functions of enough long digest output, that
probability is so low, that I can neither agree with you, nor imagine
even single possible event of collision during my life. Currently none
of popular cryptographic hashes with *at least* 256-bit output were
"broken" to be able to make a collision. goredo uses 256-bit BLAKE3
output -- enough to forget about practical collisions probability.
I really can not accept any argument of possible hash collision, even
over the data that adversary sends me. Cryptography (encryption,
hashing at least) is all about providing solutions that has some
security probability tradeoff with practical usability. One-time pad is
perfectly secure, but hardly can used in practice (except for military).
RC4 is not secure, but good (not CPU hungry) for obfuscation purposes.
3DES is secure enough for most tasks, but slow. AES/Salsa20 are both
fast and secure (probability to their (known) breakage is too
negligible) and that is why used so much. Negligible probability equals
to zero to me in practice. Seriously, I will likely expect to see broken
"size" of the file in the OS (easily!), than practical hash collision of
long enough digest.
>Now we must (in my opinion) give the user the means to choose between two modes:
>[...]
>Regarding "time", I have a modified suggestion to you:
>[...]
>This is all very well for targets created by redo. For originals, that
>should not be tracked by size/content but "time", the user needs a tool
>to update the original's "build generation" in the redo database
I fill I fully understand your suggestion. So basically (however you
already described it easy and good enough :-)):
* you separate targets on two kinds: redo-created targets
* and "originals" (I call them "sources", let's call them redo-originals)
* you consider that "time" tracking (you variant of counter in the
database/state) of redo-created targets is enough
* but one must also have a way to "touch" redo-originals (obviously)
If we assume that inode-based tracking for *newly* created redo-created
targets works in practice, then there is no need "counter" database
states. In theory all OSes can fail with everything and (I presume) your
completely throw away inode as a possible source of information. I agree
with that. But, in practice I believe that inode-based information for
*newly* created files works on any more or less usable/popular OS
correctly (let's forget about FUSE currently). Everything I wrote before
about inode-information possible fails were only and only regarded to
redo-original files. redo-c, goredo, apenwarr/* -- all of them create
temporary files and rename them to redo-created target. Probabilities of
new file's inode-collision and *time-information with an old one is
negligible in practice (for me). So I am sure that (in practice, with
non-FUSE filesystems) redo-created targets OOD are sufficient to be
tracked only by inode-information (ctime).
I really want to underline probabilities. Both hash-collisions
(cryptographic long digest) and completely broken filesystems (that are
used in practice, let's forget about FUSE) (having same
inode-information for newly created/renamed files) have probability *in
practice*. Even non-ECC RAM has much higher risk of flipping the bit
*anywhere*, ruining literally everything we can program in software.
So, if we take current apenwarr/redo (and assume it does fsync()s), then
it satisfies your suggestion for redo-created targets. If we take
current goredo (that does fsync()s by default even for directories after
file renaming) and comment its hash checking after ctime check, then it
also satisfies. inode-based tracking of fsync-ed newly created renamed
redo-created targets seems to be satisfactory.
Let's return to redo-originals. You basically suggest user to decide how
it will track originals. Do whatever he wants and "record" his decision
by redo-touch execution. I have nothing against that acceptable claim
and suggestion in general. For me, exactly the redo-original targets
were under the question will all that really (in practice!) possible
unabilities to track them by inode-information. mmap the redo-original
file, overwrite some part, do fdatasync(), poweroff computer -- data is
changed, but inode (ctime and everything else) update was not finished
(it was not updated): absolutely possible accident in practice.
Actually I think that it is an acceptable decision just to redo
everything (clean all) if we are not sure about consistency of our data,
if we saw a breakage. So I am not totally against inode-based tracking
of redo-originals, however they are clearly can be in inconsistent state
especially after the failures. But most people/computers live even
without ECC RAM pretty good in practice.
apenwarr/redo gives us ability to track redo-originals by
inode-information by default, or you can optionally do redo-stamp if it.
If you wish for something different, like you example of log message, or
probably "redo if atime (and only atime!) is changed", or "redo if utmpx
contains fact of user's logging in" -- you have to create your owns
"hacks" (additional intermediate redo-created targets) that will be
under either inode or redo-stamp tracking.
If we assume that those "utmpx-checks" are pretty rare in practice, then
it is ok to force user to create that intermediate hack/proxy targets,
in my opinion of course. Is it ok to replace redo-touch with redo-stamp?
I think yes. There is no difference between hypothetical:
some_metainfo_extractor | redo_dep_compare && redo-touch || :
and
some_metainfo_extractor | redo-stamp
where redo-stamp just compares the input with some previous value and
decides if "touch"-ing is necessary.
So basically current apenwarr/redo with inode-based tracking of
redo-created targets, optional inode-based tracking of redo-originals
and ability to redo-stamp-ing seems satisfactory. We have two tracking
mechanisms: inode and stamp based.
But actually in my (yeah, programmer biased) practice redo-stamp usage
frequency for redo-created targets is pretty high. In C projects which
redo-always check for environment variables, command paths, pkg-config-s
and similar things (I want to rebuild only necessary things if one of my
library's version or CFLAGS environment variable changes) -- redo-stamp
is used often. And I am sure that most other use-cases won't be hurt by
redo-stamp-ing too. So, if we have got inode-based redo-ifchange and
frequently used redo-stamp together -- can we replace redo-ifchange with
redo-stamp? I believe from practical point of real redo usage scenarious
-- yes we can do it safely. If there are occasions where it may hurt,
then we make hack/proxy intermediate targets with necessary output for
redo-stamp's decision. If we replace inode-based redo-ifchange with
redo-stamp and that move will remove huge quantity of redo-stamp
invocations in .do scripts and will add just small amount of
intermediate hack/proxy targets, then it is worth of it. It gives more
satisfaction to most users than burdens them.
So, in practice (in my opinion) redo-stamp used as a redo-change and as
a redo-touch covers the most frequent use-cases and does not require you
to write redo-stamp in so many .do files and make redo-stamp/ifchange/touch
decision for redo-original targets. You can not be satisfied with
inode-based redo-ifchange alone, because of untrustworthy redo-originals.
But redo-stamp can be easily used for purely inode-based tracking
(through intermediate target), for any kind of other one too. But
redo-stamp tracking is good by default for most cases out-of-box.
Of course, let's return to original world. If our target downloads (only
once) some huge tarball (project's dependency for example), then won't
it be too expensive to hash it all the time during each build? How many
targets are possibly dangerous in a sense that their content may be
changed without any inode altering? Won't it be a good optimization
still to check size/ctime/whatever and skip hash check if everything is
the same? *In practice* I think that it is safe optimizations to do and
to skip every time hashing of files noone touches at all during all
their lifetime. If we are sure that some files are definitely can be
modified by various reasons, we just can use redo-always to forcefully
check their content.
My conclusions:
* "counter" based state is just a reimplementation of filesystem.
Filesystems works in practice with newly created/renamed redo-created
targets. FUSE filesystems possibly does not work and "counter" state
will be a salvation here. But it is added complexity (always bad) and
hopefully zero probability that FUSE-based people running redo and
having issues with it exist now
* hash-based tracking (redo-stamp-ing) is the best default behaviour,
that also can be used to relatively easily, without redo implementation
altering, do whatever tracking you wish for (smth.mtime.do:stat %m smth).
No, practical collisions have zero probability
* inode-based optimizations are acceptable for most practical use-cases,
not burning computer resources in vain. redo-always-ing or touch-ing
can force inode-based OOD-skip decision to miss
Yeah, there are many assumptions: practical use-cases frequency, no
FUSE, fsync-ing. That is far from ideal solution not dependent on OS and
any kind of real world tasks/assumptions. But it is small and compact
(less code -- less probability to errors in it, much more higher than
*anything* we discussed so far, except for FUSE :-)). Probability of
non-ECC RAM ruining my ZFS is higher than anything bad will happen with
everything related to redo. And goredo chooses to checksum targets by
default (with inode-based decisions, that skips it) because of
convenience to most user tasks and (much!) lower complexity of the
goredo's code itself, without sacrificing abilities to easily do
whatever tracking you wish for (unlike inode-based tracking, anyway
requiring some additional tool/way to make guaranteed decisions like
cryptographic checksums comparing).
>I would like to hear your thoughts on this! I am still convinced that
>for practicality we want both modes of dependency tracking. I know that
>I use that regularly for interactive development, and it is very
>convenient to have.
Possibly I missed your thoughts somehow and I am walking around my own
ones again and again. But I am not convinced that there are only two
possible modes of dependency tracking. There can be any number of them
(purely atime-based, utmpx-based, flag-file based, whatever). Hashing
*ability* existence is a must, because it gives guarantees (zero
probability of collisions in practice with cryptographic enough long
digested hash) of confidence. Hash-based tracking can replace every kind
of other dependency tracking relatively easily, without complexity in
redo implementation, with high flexibility. And my practice did not
convince me that purely inode-based tracking is frequently useful.