redo-whichdo

157 views

Skip to first unread message

Avery Pennarun

unread,

Oct 6, 2018, 5:59:50 AM10/6/18

to redo

Hi all,

I've had some spare time to hack on redo again lately. I haven't
actually changed much code in my copy of redo, nor have I merged other
people's test suites or bug fixes or new features. However, I did use
redo to build a fairly complex build system for my very-obsolete
wvstreams package (see other thread), and this uncovered a few loose
ends in my existing redo implementation.

Most notably, when converting the wvstreams build system, I wanted to
copy the feature from its old Makefile-based build where you could
create an out/ directory and all the output files would be put in
there. This lets you configure the package differently (different
optimization settings, cross compilers, etc) in different output
directories, but share the same source code, which is nice if you're
trying to debug stuff and want to make sure it still works in a
cross-platform way.

This turns out to be a little complicated in redo, because redo
expects the .do file for a target to generally be in the same
directory, or a parent directory, of the target. When we separate
input and output trees like this, the .do files are naturally in the
input tree (since the output tree is empty), which means redo doesn't
know where to find them.

I remember a bunch of potential ideas for fixing this over the years,
like allowing 'do' subdirs with .do files (and the 'do' subdir could
be a symlink to the input dir), or an environment based search path
like $DOPATH, or explicit support in redo itself for separating input
and output dirs. None of those seemed very elegant, and moreover,
might not work at all (since it's not obvious exactly which files
should be in the input vs output dir at which times).

I decided to go with a more explicit approach: a default.do at the top
of the out/ hierarchy that just delegates build commands to .do files
in the source directory, by literally running the right .do file with
slightly twiddled command-line parameters. The .do files in question
have to understand the separation of input and output dirs (which I
called $SRCDIR and $OUT in my wvstreams build scripts), but this
method works pretty well.

The only catch is that $OUT/default.do has to find the right .do file
to run in $SRCDIR, and choosing a .do file is a pretty complicated
algorithm. I wanted to make the smallest possible change to redo to
make this possible, so I added a new command, redo-whichdo, that just
finds the .do file for a given target. Its output format is a bit
weird:

$ redo-whichdo ../wvstreams/x/y/z/foo.o

- ../wvstreams/x/y/z/foo.o.do
- ../wvstreams/x/y/z/default.o.do
- ../wvstreams/x/y/z/default.do
- ../wvstreams/x/y/default.o.do
- ../wvstreams/x/y/default.do
- ../wvstreams/x/default.o.do
- ../wvstreams/x/default.do
+ ../wvstreams/default.o.do
1 x/y/z/foo.o
2 x/y/z/foo

The '-' lines show .do files that we tried, but which turned out not
to exist. The "delegator" .do file needs to run read-ifcreate on this
file, because if it starts existing, the target will need to be
rebuilt using the new .do file.

The '+' line, if any, shows the one .do file that's the right match.
We would 'redo-ifchange' on that file to declare a dependency on it,
then execute it. Once we find a match, there are no more '-' lines
because there is no need to continue the search; all other names would
be lower priority, so even if they got created, it wouldn't affect our
target.

The '1' and '2' lines correspond to $1 and $2, respectively, the first
two parameters you'd pass to the .do file (after doing
chdir(dirname(dofile)), because we always run a .do file with cwd set
to the directory containing the .do, not the one containing the
target). $3 is the output filename, which can be anything, but for
the purposes of our delegator default.do, we'll just pass it along
unchanged.

The syntax is specially designed to be easily parseable with sh 'read
x y' syntax, even for filenames that contain spaces or weird
characters. The exception is newlines: I know filenames can contain
newlines on Unix, but let's be honest, that's just awful, so
redo-whichdo intentionally aborts if you try it.

As a bonus, the redo-whichdo command helps answer a burning question
in many people's minds, which is: how does redo decide which .do file
to use to build my target, anyway? Well, the output is a pretty
concise answer to that question, for any specific target, and might be
helpful for debugging.

None of this is really set in stone yet. If someone has better ideas
for the syntax or functionality, we can still change it, since this is
still an 'unreleased' version of redo.

Comments?

Have fun,

Avery

Nils Dagsson Moskopp

unread,

Oct 7, 2018, 9:11:44 PM10/7/18

to Avery Pennarun, redo

Avery Pennarun <apen...@gmail.com> writes:

> […]

>
> Most notably, when converting the wvstreams build system, I wanted to
> copy the feature from its old Makefile-based build where you could
> create an out/ directory and all the output files would be put in
> there. This lets you configure the package differently (different
> optimization settings, cross compilers, etc) in different output
> directories, but share the same source code, which is nice if you're
> trying to debug stuff and want to make sure it still works in a
> cross-platform way.
>
> This turns out to be a little complicated in redo, because redo
> expects the .do file for a target to generally be in the same
> directory, or a parent directory, of the target. When we separate
> input and output trees like this, the .do files are naturally in the
> input tree (since the output tree is empty), which means redo doesn't
> know where to find them.
>
> I remember a bunch of potential ideas for fixing this over the years,
> like allowing 'do' subdirs with .do files (and the 'do' subdir could
> be a symlink to the input dir), or an environment based search path
> like $DOPATH, or explicit support in redo itself for separating input
> and output dirs. None of those seemed very elegant, and moreover,
> might not work at all (since it's not obvious exactly which files
> should be in the input vs output dir at which times).

I would attempt to use union mounts if source files & build results
should exist in different folders. Why did you decide against that?

Wikipedia „Union mount“ <https://en.wikipedia.org/wiki/Union_mount>

Union mounts seem very elegant to me; they also do not require code
in redo to handle this … Am I missing something about union mounts?

Greetings,
--
Nils Dagsson Moskopp // erlehmann
<http://dieweltistgarnichtso.net>

signature.asc

Nathaniel Filardo

unread,

Oct 8, 2018, 3:13:13 AM10/8/18

to ni...@dieweltistgarnichtso.net, Avery Pennarun, redo

On Mon, Oct 8, 2018 at 2:11 AM Nils Dagsson Moskopp <ni...@dieweltistgarnichtso.net> wrote:

I would attempt to use union mounts if source files & build results
should exist in different folders. Why did you decide against that?

Wikipedia „Union mount“ <https://en.wikipedia.org/wiki/Union_mount>

Union mounts seem very elegant to me; they also do not require code
in redo to handle this … Am I missing something about union mounts?

Speaking from extensive experience, union mounts are awkward to use during development:

one must be very careful to always open source files through the source half of the union,

lest changes get committed to the build half, resulting in all kinds of confusion later.

There are not actually any good implementations of the union mount concept in Linux, either;

the best I have found is a FUSE implementation, though even that must have the kernel-side

file cache disabled. In-kernel implementations like aufs and overlayfs either have known

failure modes, are insufficiently expressive, and/or are immature. Linux does not, unlike

Plan 9, really enjoy letting users rearrange the single, shared namespace.

Cheers,

--nwf;

Avery Pennarun

unread,

Oct 9, 2018, 12:28:38 AM10/9/18

to Nathaniel Filardo, Nils Dagsson Moskopp, redo

On Mon, Oct 8, 2018 at 3:13 AM, Nathaniel Filardo <nwfi...@gmail.com> wrote:
> On Mon, Oct 8, 2018 at 2:11 AM Nils Dagsson Moskopp
> <ni...@dieweltistgarnichtso.net> wrote:
>> I would attempt to use union mounts if source files & build results
>> should exist in different folders. Why did you decide against that?
>>
>> Wikipedia „Union mount“ <https://en.wikipedia.org/wiki/Union_mount>
>>
>> Union mounts seem very elegant to me; they also do not require code
>> in redo to handle this … Am I missing something about union mounts?
>
> Speaking from extensive experience, union mounts are awkward to use during
> development:
> one must be very careful to always open source files through the source half
> of the union,
> lest changes get committed to the build half, resulting in all kinds of
> confusion later.

So, I'm not actually opposed to union mounts in general, but I agree
that the specific options we have available seem to all be rather
weak.

I don't think the problem is inherently intractable. For example, a
git-aware union mount could separate files into two categories:
tracked by the .git/index, and not. The former would be shared
between multiple union mounts, and the latter would be unique per
union mount. 'git add' would migrate a file from one to the other.
It seems nobody has invented this though.

As for redo, I think we should allow for the possibility of union
mounts (which I guess we do; they're supposed to be transparent, after
all). But I also think we should support any other workflow people
want to use. Union mounts are, as above, not very good right now, and
they certainly aren't cross-platform portable yet (although this will
probably slowly improve as things like Docker get more popular). Any
source package would be unwise to *depend* on union mounts if they
want to be portable, so I'd like to offer alternatives. redo-whichdo
looks like a useful tool for constructing various alternatives. (One
of my reasons for implementing the separate-objdir-from-srcdir feature
in the wvstreams+redo build system was mainly to see if I could do it
elegantly in redo, and if not, to improve redo until I could. That's
not necessarily an endorsement of the technique itself :))

Have fun,

Avery

Nils Dagsson Moskopp

unread,

Oct 12, 2018, 9:39:51 PM10/12/18

to Avery Pennarun, redo

Avery Pennarun <apen...@gmail.com> writes:

> The syntax is specially designed to be easily parseable with sh 'read
> x y' syntax, even for filenames that contain spaces or weird
> characters. The exception is newlines: I know filenames can contain
> newlines on Unix, but let's be honest, that's just awful, so
> redo-whichdo intentionally aborts if you try it.

Null bytes can never occur in filenames and might be a useful delimiter
when the output device is not a terminal. The Bash read builtin is able
to work with them. I am in the process of finding out how to do this in
plain Bourne shell, since my redo implementation might also have issues
with newlines in filenames (I have not tested it, just some suspicion).

This is how to consume lines delimited by nulls in Bourne Again Shell …

$ printf 'a b\0c d\0e f\0' |while read -r -d '' x y; do printf '%s,%s\n' "${x}" "${y}"; done
a,b
c,d
e,f

A way to account for even the laziest programmer in […]-delimited data
would be to base64 encode all fields, and use a delimiter that doesn't
occur in the base64 output; it is obviously not wise performance-wise.

Before you declare me crazy: I am perfectly aware that a filename with
a newline at the end is probably bogus. I use this as an exercize – to
see how I could make my own redo implementation cleaner and more safe.

> As a bonus, the redo-whichdo command helps answer a burning question
> in many people's minds, which is: how does redo decide which .do file
> to use to build my target, anyway? Well, the output is a pretty
> concise answer to that question, for any specific target, and might be
> helpful for debugging.

I agree.

> None of this is really set in stone yet. If someone has better ideas
> for the syntax or functionality, we can still change it, since this is
> still an 'unreleased' version of redo.
>
> Comments?

I am working on a redo-whichdo implemented in Bourne shell (like the
rest of my redo version) to explore (or maybe validate) your design;
I am going to post that implementation and learnings when I am done.

It could very well happen that I must concede that you are right and
there exists no way to properly handle filenames with newlines etc..

signature.asc

Avery Pennarun

unread,

Oct 12, 2018, 11:43:59 PM10/12/18

to Nils Dagsson Moskopp, redo

On Fri, Oct 12, 2018 at 9:39 PM Nils Dagsson Moskopp
<ni...@dieweltistgarnichtso.net> wrote:
> Avery Pennarun <apen...@gmail.com> writes:
> > The syntax is specially designed to be easily parseable with sh 'read
> > x y' syntax, even for filenames that contain spaces or weird
> > characters. The exception is newlines: I know filenames can contain
> > newlines on Unix, but let's be honest, that's just awful, so
> > redo-whichdo intentionally aborts if you try it.
>
> Null bytes can never occur in filenames and might be a useful delimiter
> when the output device is not a terminal. The Bash read builtin is able
> to work with them. I am in the process of finding out how to do this in
> plain Bourne shell, since my redo implementation might also have issues
> with newlines in filenames (I have not tested it, just some suspicion).

Yeah, null separators are the "right" solution for this sort of
problem, except that they're almost unusable in standard POSIX sh.
GNU extensions like "find -print0", "xargs -0", etc are all very nice,
but if you can't rely on them everywhere, they aren't a great choice
for portable programs. This is also true for bashisms in .do files.

AFAIK, there is nothing in apenwarr/redo that fundamentally wouldn't
work with newlines in filenames. But the proposed redo-whichdo output
format would have odd problems, as would the outputs of redo-targets,
redo-sources, and redo-ood.

It seems better to just disallow newlines in filenames (and abort if
we see one) on the grounds that anyone using them is insane.

> A way to account for even the laziest programmer in […]-delimited data
> would be to base64 encode all fields, and use a delimiter that doesn't
> occur in the base64 output; it is obviously not wise performance-wise.

It's not just performance: parsing base64 just makes everything less
human-readable and more gross.

> Before you declare me crazy: I am perfectly aware that a filename with
> a newline at the end is probably bogus. I use this as an exercize – to
> see how I could make my own redo implementation cleaner and more safe.

Assertions can provide clean and safe code too :)

Spaces in filenames are also annoying, though not as bad as newlines.
I just found some bugs in minimal/do yesterday related to spaces. I
think any use of xargs also suffers from spaces, if you don't use -0,
which you mostly can't because it's nonportable. Sigh.

> I am working on a redo-whichdo implemented in Bourne shell (like the
> rest of my redo version) to explore (or maybe validate) your design;
> I am going to post that implementation and learnings when I am done.

Cool, it'll be interesting to hear about.

> It could very well happen that I must concede that you are right and
> there exists no way to properly handle filenames with newlines etc..

Well, at least three ways exist:
1. depend on null separators
or
2. make the entire output consist of only a single filename
or
3. use a fancy encoding (eg. C's backslash encoding, http's URL encoding, etc)

I would just rather not resort to any of them, because they are all
hard to use in sh.

I often wish plan 9's rc shell had caught on. At least it can handle
arbitrary lists of strings, although I don't think they ever found a
good way to share those lists with programs outside rc.

Have fun,

Avery

Nils Dagsson Moskopp

unread,

Oct 18, 2018, 8:27:49 PM10/18/18

to Avery Pennarun, redo

Avery Pennarun <apen...@gmail.com> writes:

> On Fri, Oct 12, 2018 at 9:39 PM Nils Dagsson Moskopp
> <ni...@dieweltistgarnichtso.net> wrote:
>> Avery Pennarun <apen...@gmail.com> writes:
>> > The syntax is specially designed to be easily parseable with sh 'read
>> > x y' syntax, even for filenames that contain spaces or weird
>> > characters. The exception is newlines: I know filenames can contain
>> > newlines on Unix, but let's be honest, that's just awful, so
>> > redo-whichdo intentionally aborts if you try it.
>>
>> Null bytes can never occur in filenames and might be a useful delimiter
>> when the output device is not a terminal. The Bash read builtin is able
>> to work with them. I am in the process of finding out how to do this in
>> plain Bourne shell, since my redo implementation might also have issues
>> with newlines in filenames (I have not tested it, just some suspicion).
>
> Yeah, null separators are the "right" solution for this sort of
> problem, except that they're almost unusable in standard POSIX sh.
> GNU extensions like "find -print0", "xargs -0", etc are all very nice,
> but if you can't rely on them everywhere, they aren't a great choice
> for portable programs. This is also true for bashisms in .do files.

I guess you can always introduce an explicit “tr '\0' '\n'”, which makes
it clear that the code is bound to be buggy.

> AFAIK, there is nothing in apenwarr/redo that fundamentally wouldn't
> work with newlines in filenames. But the proposed redo-whichdo output
> format would have odd problems, as would the outputs of redo-targets,
> redo-sources, and redo-ood.

I am currently trying out the following approach: My implementation of
redo-whichdo delimits filenames with newlines if stdout is a terminal,
while delimiting filenames with nullbytes if stdout is not a terminal.

; redo-whichdo redo-sh.tar.gz
/home/erlehmann/bin/redo-sh.tar.gz.do
/home/erlehmann/bin/default.tar.gz.do
/home/erlehmann/bin/default.gz.do

; redo-whichdo redo-sh.tar.gz |xargs -n 1
xargs: WARNING: a NUL character occurred in the input. It cannot be passed through in the argument list. Did you mean to use the --null option?
/home/erlehmann/bin/redo-sh.tar.gz.do

; redo-whichdo redo-sh.tar.gz |xargs -0 -n 1
/home/erlehmann/bin/redo-sh.tar.gz.do
/home/erlehmann/bin/default.tar.gz.do
/home/erlehmann/bin/default.gz.do

I currently thing that this approach gives the following advantages:

a) The visual output is useful, primarily because the start of most
strings is the same. Of course, you can always go the newline-slash
route to make even that visually confusing.

b) Splitting the input is easy if you have GNU coreutils / busybox.

c) No one can depend on newlines as delimiters when parsing output.

> It seems better to just disallow newlines in filenames (and abort if
> we see one) on the grounds that anyone using them is insane.

I see newlines in filenames as a good litmus test. If your tooling can
handle the newline, it is probably robust enough to handle a lot more.

But then again I am a person who sends 2 gigabytes of small headers to
ensure that an HTTP library is behaving well when encountering malice.

>> A way to account for even the laziest programmer in […]-delimited data
>> would be to base64 encode all fields, and use a delimiter that doesn't
>> occur in the base64 output; it is obviously not wise performance-wise.
>
> It's not just performance: parsing base64 just makes everything less
> human-readable and more gross.

The idea is that the simplest thing that could work (first splitting
fields on delimiters, then base64-decoding any field) is the correct
approach. Any protocol that permits the parser to have shortcuts can
and will be parsed by shortcutters, so that must be made impossible.

>> Before you declare me crazy: I am perfectly aware that a filename with
>> a newline at the end is probably bogus. I use this as an exercize – to
>> see how I could make my own redo implementation cleaner and more safe.
>
> Assertions can provide clean and safe code too :)

I do not understand that remark – please elaborate.

> Spaces in filenames are also annoying, though not as bad as newlines.
> I just found some bugs in minimal/do yesterday related to spaces. I
> think any use of xargs also suffers from spaces, if you don't use -0,
> which you mostly can't because it's nonportable. Sigh.

As I have only test my own redo implementation with GNU coreutils and
busybox, I strongly suspect it is not-portable already. I will either
find some portable way to handle this, or reject POSIX and substitute
GNU (and busybox).

>> I am working on a redo-whichdo implemented in Bourne shell (like the
>> rest of my redo version) to explore (or maybe validate) your design;
>> I am going to post that implementation and learnings when I am done.
>
> Cool, it'll be interesting to hear about.

The first thing I did was change the output format to make it possible
to use redo-whichdo inside redo-ifchange for (nonexistence) dependency
management. My implementation of redo-whichdo outputs dofile filenames
until it finds one that exists – if it finds a dofile, it exits with a
status code indicating success, and when it reaches a filesystem root,
it exits with a status code indicating failure. On success, every line
except for the last one specifies a non-existence dependency; the last
line is the dofile we were looking for, and therefore the dependency …

; redo-whichdo redo-sh.tar.gz
/home/erlehmann/bin/redo-sh.tar.gz.do
/home/erlehmann/bin/default.tar.gz.do
/home/erlehmann/bin/default.gz.do

; redo-whichdo x.y.z
/home/erlehmann/bin/x.y.z.do
/home/erlehmann/bin/default.y.z.do
/home/erlehmann/bin/default.z.do
/home/erlehmann/bin/default.do
/home/erlehmann/default.y.z.do
/home/erlehmann/default.z.do
/home/erlehmann/default.do
/home/default.y.z.do
/home/default.z.do
/home/default.do
/default.y.z.do
/default.z.do
/default.do
# exited 1

As mentioned, newlines are used for terminals, nullbytes for others.

>> It could very well happen that I must concede that you are right and
>> there exists no way to properly handle filenames with newlines etc..
>
> Well, at least three ways exist:
> 1. depend on null separators
> or
> 2. make the entire output consist of only a single filename
> or
> 3. use a fancy encoding (eg. C's backslash encoding, http's URL encoding, etc)
>
> I would just rather not resort to any of them, because they are all
> hard to use in sh.

I want to make parsing as easy as possible, but not easier. Breaking
parsing attempts with a bad tool is a way to ensure that people will
not use them. Besides, I am still not convinced that null delimiters
are impossible to use in portable ways, as POSIX tr can handle them.

This whole thing reminds me of the Fedora 19 name „Schrödinger's Cat“,
which led to tools that could not handle umlauts and apostrophes being
fixed … or maybe not, because someone changed the U+0027 APOSTROPHE to
U+2019 RIGHT SINGLE QUOTATION MARK – thus masking the issue with quote
characters not being handled right instead of fixing it.

<https://lwn.net/Articles/545741/>

I have not yet attempted to use redo-whichdo inside redo-ifchange to
generate dependencies – have you tried it? That would be the primary
use case of redo-whichdo, or so I guess.

> I often wish plan 9's rc shell had caught on. At least it can handle
> arbitrary lists of strings, although I don't think they ever found a
> good way to share those lists with programs outside rc.

I actually use the rc shell as my standard shell and I do not think it
can handle null bytes better. It is, however, a lot better than Bourne
shell ever was.

> Have fun,

I doubt it; programming is not fun.

> Avery

signature.asc

Avery Pennarun

unread,

Oct 20, 2018, 12:22:31 AM10/20/18

to Nils Dagsson Moskopp, redo

On Thu, Oct 18, 2018 at 8:27 PM Nils Dagsson Moskopp

<ni...@dieweltistgarnichtso.net> wrote:
> Avery Pennarun <apen...@gmail.com> writes:
> > On Fri, Oct 12, 2018 at 9:39 PM Nils Dagsson Moskopp
> > <ni...@dieweltistgarnichtso.net> wrote:
> >> Avery Pennarun <apen...@gmail.com> writes:
> >> > The syntax is specially designed to be easily parseable with sh 'read
> >> > x y' syntax, even for filenames that contain spaces or weird
> >> > characters. The exception is newlines: I know filenames can contain
> >> > newlines on Unix, but let's be honest, that's just awful, so
> >> > redo-whichdo intentionally aborts if you try it.
> >>
> >> Null bytes can never occur in filenames and might be a useful delimiter
> >> when the output device is not a terminal. The Bash read builtin is able
> >> to work with them. I am in the process of finding out how to do this in
> >> plain Bourne shell, since my redo implementation might also have issues
> >> with newlines in filenames (I have not tested it, just some suspicion).
> >
> > Yeah, null separators are the "right" solution for this sort of
> > problem, except that they're almost unusable in standard POSIX sh.
> > GNU extensions like "find -print0", "xargs -0", etc are all very nice,
> > but if you can't rely on them everywhere, they aren't a great choice
> > for portable programs. This is also true for bashisms in .do files.
>
> I guess you can always introduce an explicit “tr '\0' '\n'”, which makes
> it clear that the code is bound to be buggy.

I guess. I'm still nervous about it. redo is fundamentally tied to
sh scripting; \0 is very unfriendly to sh scripts. Every use of \0 is
another encouragement to dive into non-posix features like xargs -0 or
find -print0.

Generally speaking, operating system portability is one of the biggest
problems for build systems; you're often trying to build a package
that contains portability workarounds for itself, but you can't build
it without portability.

Note that redo-sources, redo-targets, and redo-ood are already defined
in such a way that filenames containing newlines won't work.

> > AFAIK, there is nothing in apenwarr/redo that fundamentally wouldn't
> > work with newlines in filenames. But the proposed redo-whichdo output
> > format would have odd problems, as would the outputs of redo-targets,
> > redo-sources, and redo-ood.
>
> I am currently trying out the following approach: My implementation of
> redo-whichdo delimits filenames with newlines if stdout is a terminal,
> while delimiting filenames with nullbytes if stdout is not a terminal.
>
> ; redo-whichdo redo-sh.tar.gz
> /home/erlehmann/bin/redo-sh.tar.gz.do
> /home/erlehmann/bin/default.tar.gz.do
> /home/erlehmann/bin/default.gz.do

As python programmers like to say, explicit is better than implicit: I
don't like the idea of such a fundamental format change depending on
whether stdout is a tty or not. It's sort of okay when
enabling/disabling, say, ANSI colour codes, because they don't change
the true format of the output. But 'find -print0' doesn't stop
printing nulls just because it's on a tty. This seems risky to me.

> b) Splitting the input is easy if you have GNU coreutils / busybox.

That's not very helpful. I don't have to install those things if I
want to build a package that uses make+autoconf.

> > It seems better to just disallow newlines in filenames (and abort if
> > we see one) on the grounds that anyone using them is insane.
>
> I see newlines in filenames as a good litmus test. If your tooling can
> handle the newline, it is probably robust enough to handle a lot more.
>
> But then again I am a person who sends 2 gigabytes of small headers to
> ensure that an HTTP library is behaving well when encountering malice.

If you sent 2GB of small headers, I expect a well-behaved server to
realize you're an attacker and cut you off. This is my proposal for
redo: just detect that a filename contains a newline and abort with a
meaningful error message. Those files never happen unless someone is
being malicious, and the safest thing to do is stop right away.

> >> A way to account for even the laziest programmer in […]-delimited data
> >> would be to base64 encode all fields, and use a delimiter that doesn't
> >> occur in the base64 output; it is obviously not wise performance-wise.
> >
> > It's not just performance: parsing base64 just makes everything less
> > human-readable and more gross.
>
> The idea is that the simplest thing that could work (first splitting
> fields on delimiters, then base64-decoding any field) is the correct
> approach. Any protocol that permits the parser to have shortcuts can
> and will be parsed by shortcutters, so that must be made impossible.

That's, by far, not the simplest thing that can possibly work. base64
encoder/decoders are more complicated than all the rest of redo syntax
put together :)

The simplest thing is always to restrict your inputs and outputs so
that it's hard to make mistakes, not to add more layers of encoding.
More layers always means more surprises.

> >> Before you declare me crazy: I am perfectly aware that a filename with
> >> a newline at the end is probably bogus. I use this as an exercize – to
> >> see how I could make my own redo implementation cleaner and more safe.
> >
> > Assertions can provide clean and safe code too :)
>
> I do not understand that remark – please elaborate.

I just mean that assert statements preventing all inputs and outputs
that contain newlines would solve the problem in a very clean and
predictable way.

> > Spaces in filenames are also annoying, though not as bad as newlines.
> > I just found some bugs in minimal/do yesterday related to spaces. I
> > think any use of xargs also suffers from spaces, if you don't use -0,
> > which you mostly can't because it's nonportable. Sigh.
>
> As I have only test my own redo implementation with GNU coreutils and
> busybox, I strongly suspect it is not-portable already. I will either
> find some portable way to handle this, or reject POSIX and substitute
> GNU (and busybox).

apenwarr/redo and minimal/do try very hard to be portable. I probably
still screwed up in a few small places, but I've at least tested with
various shells and non-GNU operating systems. I'm pretty sure that's
very important.

Cross-OS portability is actually so important that I'm somewhat
tempted to bundle a whole sh and extra tools along with redo that
everyone is using the same one. It's currently way too easy to write
.do files containing unnecessary bashisms. We should somehow help
people to avoid that.

> The first thing I did was change the output format to make it possible
> to use redo-whichdo inside redo-ifchange for (nonexistence) dependency
> management. My implementation of redo-whichdo outputs dofile filenames
> until it finds one that exists – if it finds a dofile, it exits with a
> status code indicating success, and when it reaches a filesystem root,
> it exits with a status code indicating failure. On success, every line
> except for the last one specifies a non-existence dependency; the last
> line is the dofile we were looking for, and therefore the dependency …

Hmm, that's a pretty good idea. I like the much simpler output
syntax: now it's just a list of do files separated by a separator
character.

What you're sacrificing is auto-calculating $1 and $2, which is needed
for a "proxy.do" kind of script that wants to call a different .do
file. Maybe that's too fancy and the logic for that should be moved
into proxy.do anyway, I guess.

Maybe I'll try that out and see how hard it is. Calculating relative
paths in sh is unfortunately a little complicated, even though it's a
one-liner in python.

> I want to make parsing as easy as possible, but not easier. Breaking
> parsing attempts with a bad tool is a way to ensure that people will
> not use them. Besides, I am still not convinced that null delimiters
> are impossible to use in portable ways, as POSIX tr can handle them.

tr apparently does provide this, but if you have to use tr to "fix"
the NULs, you are writing portable but not-newline-safe code. If
their choice is between newline-safety and portability, I'm pretty
sure almost everyone will choose the latter. At that point they're
not protected but they do have more error-prone and longer scripts,
with extra fork/execs scattered around.

> This whole thing reminds me of the Fedora 19 name „Schrödinger's Cat“,
> which led to tools that could not handle umlauts and apostrophes being
> fixed … or maybe not, because someone changed the U+0027 APOSTROPHE to
> U+2019 RIGHT SINGLE QUOTATION MARK – thus masking the issue with quote
> characters not being handled right instead of fixing it.

Without a doubt, we need to test properly with all sorts of extra
characters, especially unicode characters.

We should be able to handle filenames that contain anything but
newline (shouldn't have been allowed) and nul (already not allowed).
We *might* want to disallow other control characters < ascii 32 also,
since they're obviously asking for trouble.

Note that make has already survived this long without even allowing
spaces in filenames. So perfection isn't really a requirement here.

> I have not yet attempted to use redo-whichdo inside redo-ifchange to
> generate dependencies – have you tried it? That would be the primary
> use case of redo-whichdo, or so I guess.

My own redo-ifchange already creates a dependency automatically on the
main .do file and the nonexistent .do files leading up to it.

I did successfully write a .do script that bounces to other .do
scripts using redo-whichdo in the original format I described. See my
'wvstreams' to the mailing list for a link to an example of that in
action. I think it should be pretty easy to parse my original format
or your suggested format (without $1 and $2 lines and using the error
code) from sh.

> > I often wish plan 9's rc shell had caught on. At least it can handle
> > arbitrary lists of strings, although I don't think they ever found a
> > good way to share those lists with programs outside rc.
>
> I actually use the rc shell as my standard shell and I do not think it
> can handle null bytes better. It is, however, a lot better than Bourne
> shell ever was.

No, I don't think it handles null bytes, but I think it supposedly
does somehow manage to export lists into the environment and then take
them out again, even if those lists contain spaces. I assume it does
that through some kind of encoding. If there was a standard encoding
that all shell stuff used automatically (even if it was base64...)
then I would want to use that. Unfortunately, "whitespace" separation
is the real standard, but nowadays that's too restrictive. Newline
seems like an okay compromise.

> > Have fun,
>
> I doubt it; programming is not fun.

I guess that's a matter of opinion :)

Avery

Nils Dagsson Moskopp

unread,

Oct 20, 2018, 11:21:35 AM10/20/18

to Avery Pennarun, redo

I do not have time right now to reply to the rest of your mail, but the
redo architecture is not fundamentally tied to sh scripting. Consider a
mad person's build scripts written in Python (“all.do” and “target.do”)
that work perfectly fine with my redo implementation, building “target”
exactly twice – as it is invalidated between redo-ifchange invocations.

; cat all.do
#!/usr/bin/env python3.5
From subprocess import run
From time import sleep

with open('source', 'w') as f:
f.write('foo')

run(['redo-ifchange', 'target'])

with open('target', 'r') as f:
target_contents_1 = f.read()

sleep(1)

with open('source', 'w') as f:
f.write('bar')

run(['redo-ifchange', 'target'])

with open('target', 'r') as f:
target_contents_2 = f.read()

if target_contents_1 == target_contents_2:
exit(1)

; cat target.do
#!/usr/bin/env python3.5
From subprocess import run
From datetime import datetime

run(['redo-ifchange', 'source'])
print(datetime.now().timestamp())

; redo
redo all
redo target
redo target

signature.asc

Avery Pennarun

unread,

Oct 20, 2018, 4:47:40 PM10/20/18

to Nils Dagsson Moskopp, redo

On Sat, Oct 20, 2018 at 11:21 AM Nils Dagsson Moskopp
<ni...@dieweltistgarnichtso.net> wrote:
> I do not have time right now to reply to the rest of your mail, but the
> redo architecture is not fundamentally tied to sh scripting. Consider a
> mad person's build scripts written in Python (“all.do” and “target.do”)
> that work perfectly fine with my redo implementation, building “target”
> exactly twice – as it is invalidated between redo-ifchange invocations.

Yeah, you're completely right. One of the nicest things about redo is
that you can write your .do scripts in any language. Theoretically.

Of course, almost nobody does. python is an especially bad choice
because of its comparatively long startup times, given how many .do
scripts typically run for a large build. Shells like sh tend to start
faster, which pays off for jobs like this. rc would presumably be
just as fast.

Maybe it would be more appropriate to say that sh is the lowest common
denominator: everything about redo should work with sh. Virtually any
non-sh language can handle the same inputs and outputs, plus more.

Avery

Nils Dagsson Moskopp

unread,

Oct 28, 2019, 4:43:55 PM10/28/19

to Avery Pennarun, redo

My redo-implementation in Bourne Shell uses redo-whichdo(1) internally …
and you are completely correct – I do use xargs(1) with the “-0” option.

Why did I do that? Because it works on GNU systems & 8 year old BusyBox.
On the OS X systems I installed GNU utilities to get it to work, though.

> Generally speaking, operating system portability is one of the biggest
> problems for build systems; you're often trying to build a package
> that contains portability workarounds for itself, but you can't build
> it without portability.
>
> Note that redo-sources, redo-targets, and redo-ood are already defined
> in such a way that filenames containing newlines won't work.

This is not a big problem, though.

>> > AFAIK, there is nothing in apenwarr/redo that fundamentally wouldn't
>> > work with newlines in filenames. But the proposed redo-whichdo output
>> > format would have odd problems, as would the outputs of redo-targets,
>> > redo-sources, and redo-ood.
>>
>> I am currently trying out the following approach: My implementation of
>> redo-whichdo delimits filenames with newlines if stdout is a terminal,
>> while delimiting filenames with nullbytes if stdout is not a terminal.
>>
>> ; redo-whichdo redo-sh.tar.gz
>> /home/erlehmann/bin/redo-sh.tar.gz.do
>> /home/erlehmann/bin/default.tar.gz.do
>> /home/erlehmann/bin/default.gz.do
>
> As python programmers like to say, explicit is better than implicit: I
> don't like the idea of such a fundamental format change depending on
> whether stdout is a tty or not. It's sort of okay when
> enabling/disabling, say, ANSI colour codes, because they don't change
> the true format of the output. But 'find -print0' doesn't stop
> printing nulls just because it's on a tty. This seems risky to me.

I guess I should add a “-0” option to my implementation of redo-whichdo.

>> b) Splitting the input is easy if you have GNU coreutils / busybox.
>
> That's not very helpful. I don't have to install those things if I
> want to build a package that uses make+autoconf.

That is true. On the other hand, the only system where I did not have
either all the GNU utilities or BusyBox out of the box has been OS X.

I did regret my choice of non-portable ways to solve problems until I
fixed some issues so that my redo implementation worked okay on OS X.

>> > It seems better to just disallow newlines in filenames (and abort if
>> > we see one) on the grounds that anyone using them is insane.
>>
>> I see newlines in filenames as a good litmus test. If your tooling can
>> handle the newline, it is probably robust enough to handle a lot more.
>>
>> But then again I am a person who sends 2 gigabytes of small headers to
>> ensure that an HTTP library is behaving well when encountering malice.
>
> If you sent 2GB of small headers, I expect a well-behaved server to
> realize you're an attacker and cut you off. This is my proposal for
> redo: just detect that a filename contains a newline and abort with a
> meaningful error message. Those files never happen unless someone is
> being malicious, and the safest thing to do is stop right away.

Full recognition before processing: Never take input you can not handle!

It also leaves a nice upgrade path open: “This thing did not work and it
works now.”

>> >> A way to account for even the laziest programmer in […]-delimited data
>> >> would be to base64 encode all fields, and use a delimiter that doesn't
>> >> occur in the base64 output; it is obviously not wise performance-wise.
>> >
>> > It's not just performance: parsing base64 just makes everything less
>> > human-readable and more gross.
>>
>> The idea is that the simplest thing that could work (first splitting
>> fields on delimiters, then base64-decoding any field) is the correct
>> approach. Any protocol that permits the parser to have shortcuts can
>> and will be parsed by shortcutters, so that must be made impossible.
>
> That's, by far, not the simplest thing that can possibly work. base64
> encoder/decoders are more complicated than all the rest of redo syntax
> put together :)
>
> The simplest thing is always to restrict your inputs and outputs so
> that it's hard to make mistakes, not to add more layers of encoding.
> More layers always means more surprises.

What I meant was creating a file format so that the simplest thing that
could possible consume it also has to be correct. Another example would
be to XOR data packets with a checksum you can easily get having parsed
the previous packet properly. This is not the simplest thing that could
possibly work – but if you had to work with such a format, you would do
as intended and take no shortcuts during parsing to avoid bruteforcing.

>> >> Before you declare me crazy: I am perfectly aware that a filename with
>> >> a newline at the end is probably bogus. I use this as an exercize – to
>> >> see how I could make my own redo implementation cleaner and more safe.
>> >
>> > Assertions can provide clean and safe code too :)
>>
>> I do not understand that remark – please elaborate.
>
> I just mean that assert statements preventing all inputs and outputs
> that contain newlines would solve the problem in a very clean and
> predictable way.

This is correct for all software – and I fully support this statement.

>> > Spaces in filenames are also annoying, though not as bad as newlines.
>> > I just found some bugs in minimal/do yesterday related to spaces. I
>> > think any use of xargs also suffers from spaces, if you don't use -0,
>> > which you mostly can't because it's nonportable. Sigh.
>>
>> As I have only test my own redo implementation with GNU coreutils and
>> busybox, I strongly suspect it is not-portable already. I will either
>> find some portable way to handle this, or reject POSIX and substitute
>> GNU (and busybox).
>
> apenwarr/redo and minimal/do try very hard to be portable. I probably
> still screwed up in a few small places, but I've at least tested with
> various shells and non-GNU operating systems. I'm pretty sure that's
> very important.

Please tell me how you work on portable scripts. Which tools do you use?

> Cross-OS portability is actually so important that I'm somewhat
> tempted to bundle a whole sh and extra tools along with redo that
> everyone is using the same one. It's currently way too easy to write
> .do files containing unnecessary bashisms. We should somehow help
> people to avoid that.

I thought of having a simple grep to check for bashisms, but the problem
with even something simple like “^function ” is that it could turn up in
a here-document.

>> The first thing I did was change the output format to make it possible
>> to use redo-whichdo inside redo-ifchange for (nonexistence) dependency
>> management. My implementation of redo-whichdo outputs dofile filenames
>> until it finds one that exists – if it finds a dofile, it exits with a
>> status code indicating success, and when it reaches a filesystem root,
>> it exits with a status code indicating failure. On success, every line
>> except for the last one specifies a non-existence dependency; the last
>> line is the dofile we were looking for, and therefore the dependency …
>
> Hmm, that's a pretty good idea. I like the much simpler output
> syntax: now it's just a list of do files separated by a separator
> character.

It also means that if the exit code indicates success, the last dofile
is a dependency of that target – while other dofiles are non-existence
dependencies of the target. If the exit code indicates failure, a file
that exists and was given as argument to redo-whichdo must be a source
file and not a target.

> What you're sacrificing is auto-calculating $1 and $2, which is needed
> for a "proxy.do" kind of script that wants to call a different .do
> file. Maybe that's too fancy and the logic for that should be moved
> into proxy.do anyway, I guess.

Could you elaborate on the use cases that need this proxy.do dofile? I
mean, it kind of takes away the easy mental model for how redo dofiles
work. I get that it is possible, I do not understand how it is useful.

> Maybe I'll try that out and see how hard it is. Calculating relative
> paths in sh is unfortunately a little complicated, even though it's a
> one-liner in python.

What exactly do you need to do? If you figured it out, what did you do?

>> I want to make parsing as easy as possible, but not easier. Breaking
>> parsing attempts with a bad tool is a way to ensure that people will
>> not use them. Besides, I am still not convinced that null delimiters
>> are impossible to use in portable ways, as POSIX tr can handle them.
>
> tr apparently does provide this, but if you have to use tr to "fix"
> the NULs, you are writing portable but not-newline-safe code. If
> their choice is between newline-safety and portability, I'm pretty
> sure almost everyone will choose the latter. At that point they're
> not protected but they do have more error-prone and longer scripts,
> with extra fork/execs scattered around.

I will think about it. You were right, explicit is better than implicit.

>> This whole thing reminds me of the Fedora 19 name „Schrödinger's Cat“,
>> which led to tools that could not handle umlauts and apostrophes being
>> fixed … or maybe not, because someone changed the U+0027 APOSTROPHE to
>> U+2019 RIGHT SINGLE QUOTATION MARK – thus masking the issue with quote
>> characters not being handled right instead of fixing it.
>
> Without a doubt, we need to test properly with all sorts of extra
> characters, especially unicode characters.
>
> We should be able to handle filenames that contain anything but
> newline (shouldn't have been allowed) and nul (already not allowed).
> We *might* want to disallow other control characters < ascii 32 also,
> since they're obviously asking for trouble.

Is anything other than ANSI escape codes ruining the terminal possible?

> Note that make has already survived this long without even allowing
> spaces in filenames. So perfection isn't really a requirement here.

The make utility has already survived this long without even rebuilding
a target when its build rules change – even though it is a very obvious
flaw. I think that bar is set way too low for our redo implementations.

>> I have not yet attempted to use redo-whichdo inside redo-ifchange to
>> generate dependencies – have you tried it? That would be the primary
>> use case of redo-whichdo, or so I guess.
>
> My own redo-ifchange already creates a dependency automatically on the
> main .do file and the nonexistent .do files leading up to it.

I am now using redo-whichdo inside redo-ifchange and it works very well,
an issue being that redo-ifchange is now slower. I think I can fix that.

> I did successfully write a .do script that bounces to other .do
> scripts using redo-whichdo in the original format I described. See my
> 'wvstreams' to the mailing list for a link to an example of that in
> action. I think it should be pretty easy to parse my original format
> or your suggested format (without $1 and $2 lines and using the error
> code) from sh.

I guess: <https://github.com/apenwarr/wvstreams/blob/master/delegate.od>

I have looked at it and still do not get it. To extract the targets, one
could just call redo-targets and copy those over, or am I mistaken here?

I wonder if there is any other use case.

>> > I often wish plan 9's rc shell had caught on. At least it can handle
>> > arbitrary lists of strings, although I don't think they ever found a
>> > good way to share those lists with programs outside rc.
>>
>> I actually use the rc shell as my standard shell and I do not think it
>> can handle null bytes better. It is, however, a lot better than Bourne
>> shell ever was.
>
> No, I don't think it handles null bytes, but I think it supposedly
> does somehow manage to export lists into the environment and then take
> them out again, even if those lists contain spaces. I assume it does
> that through some kind of encoding. If there was a standard encoding
> that all shell stuff used automatically (even if it was base64...)
> then I would want to use that. Unfortunately, "whitespace" separation
> is the real standard, but nowadays that's too restrictive. Newline
> seems like an okay compromise.

The rc shell I have uses ASCII 001 SOH (start of heading) as a list separator:

--- snip ---
; env -i rc -c 'a=(1 2 3 4 5); env' |hd
00000000 50 41 54 48 3d 2f 75 73 72 2f 6c 6f 63 61 6c 2f |PATH=/usr/local/|
00000010 62 69 6e 3a 2f 75 73 72 2f 62 69 6e 3a 2f 62 69 |bin:/usr/bin:/bi|
00000020 6e 3a 2e 0a 61 3d 31 01 32 01 33 01 34 01 35 0a |n:..a=1.2.3.4.5.|
00000030
--- snap ---

>> > Have fun,
>>
>> I doubt it; programming is not fun.
>
> I guess that's a matter of opinion :)

That's just like, your opinion, man. ;)

> Avery

Nils

P.S.: I found 2 differences in our implementations, see my test scripts
for them: <http://news.dieweltistgarnichtso.net/bin/redo-sh.html#tests>

---snip---
# 2019-10-28T21:39+01:00 /tmp
; redo-whichdo default.do
/tmp/default.do.do
/default.do.do
/default.do
# exited 1
# 2019-10-28T21:39+01:00 /tmp
; $HOME/share/src/apenwarr/redo/bin/redo-whichdo default.do
default.do.do
default.do.do
default.do
../default.do.do
../default.do
# exited 1
---snap---

P.P.S.: I do not mean the relative paths, though those irritate me.