On Thu, Oct 18, 2018 at 8:27 PM Nils Dagsson Moskopp
<
ni...@dieweltistgarnichtso.net> wrote:
> Avery Pennarun <
apen...@gmail.com> writes:
> > On Fri, Oct 12, 2018 at 9:39 PM Nils Dagsson Moskopp
> > <
ni...@dieweltistgarnichtso.net> wrote:
> >> Avery Pennarun <
apen...@gmail.com> writes:
> >> > The syntax is specially designed to be easily parseable with sh 'read
> >> > x y' syntax, even for filenames that contain spaces or weird
> >> > characters. The exception is newlines: I know filenames can contain
> >> > newlines on Unix, but let's be honest, that's just awful, so
> >> > redo-whichdo intentionally aborts if you try it.
> >>
> >> Null bytes can never occur in filenames and might be a useful delimiter
> >> when the output device is not a terminal. The Bash read builtin is able
> >> to work with them. I am in the process of finding out how to do this in
> >> plain Bourne shell, since my redo implementation might also have issues
> >> with newlines in filenames (I have not tested it, just some suspicion).
> >
> > Yeah, null separators are the "right" solution for this sort of
> > problem, except that they're almost unusable in standard POSIX sh.
> > GNU extensions like "find -print0", "xargs -0", etc are all very nice,
> > but if you can't rely on them everywhere, they aren't a great choice
> > for portable programs. This is also true for bashisms in .do files.
>
> I guess you can always introduce an explicit “tr '\0' '\n'”, which makes
> it clear that the code is bound to be buggy.
I guess. I'm still nervous about it. redo is fundamentally tied to
sh scripting; \0 is very unfriendly to sh scripts. Every use of \0 is
another encouragement to dive into non-posix features like xargs -0 or
find -print0.
Generally speaking, operating system portability is one of the biggest
problems for build systems; you're often trying to build a package
that contains portability workarounds for itself, but you can't build
it without portability.
Note that redo-sources, redo-targets, and redo-ood are already defined
in such a way that filenames containing newlines won't work.
> > AFAIK, there is nothing in apenwarr/redo that fundamentally wouldn't
> > work with newlines in filenames. But the proposed redo-whichdo output
> > format would have odd problems, as would the outputs of redo-targets,
> > redo-sources, and redo-ood.
>
> I am currently trying out the following approach: My implementation of
> redo-whichdo delimits filenames with newlines if stdout is a terminal,
> while delimiting filenames with nullbytes if stdout is not a terminal.
>
> ; redo-whichdo redo-sh.tar.gz
> /home/erlehmann/bin/
redo-sh.tar.gz.do
> /home/erlehmann/bin/
default.tar.gz.do
> /home/erlehmann/bin/
default.gz.do
As python programmers like to say, explicit is better than implicit: I
don't like the idea of such a fundamental format change depending on
whether stdout is a tty or not. It's sort of okay when
enabling/disabling, say, ANSI colour codes, because they don't change
the true format of the output. But 'find -print0' doesn't stop
printing nulls just because it's on a tty. This seems risky to me.
> b) Splitting the input is easy if you have GNU coreutils / busybox.
That's not very helpful. I don't have to install those things if I
want to build a package that uses make+autoconf.
> > It seems better to just disallow newlines in filenames (and abort if
> > we see one) on the grounds that anyone using them is insane.
>
> I see newlines in filenames as a good litmus test. If your tooling can
> handle the newline, it is probably robust enough to handle a lot more.
>
> But then again I am a person who sends 2 gigabytes of small headers to
> ensure that an HTTP library is behaving well when encountering malice.
If you sent 2GB of small headers, I expect a well-behaved server to
realize you're an attacker and cut you off. This is my proposal for
redo: just detect that a filename contains a newline and abort with a
meaningful error message. Those files never happen unless someone is
being malicious, and the safest thing to do is stop right away.
> >> A way to account for even the laziest programmer in […]-delimited data
> >> would be to base64 encode all fields, and use a delimiter that doesn't
> >> occur in the base64 output; it is obviously not wise performance-wise.
> >
> > It's not just performance: parsing base64 just makes everything less
> > human-readable and more gross.
>
> The idea is that the simplest thing that could work (first splitting
> fields on delimiters, then base64-decoding any field) is the correct
> approach. Any protocol that permits the parser to have shortcuts can
> and will be parsed by shortcutters, so that must be made impossible.
That's, by far, not the simplest thing that can possibly work. base64
encoder/decoders are more complicated than all the rest of redo syntax
put together :)
The simplest thing is always to restrict your inputs and outputs so
that it's hard to make mistakes, not to add more layers of encoding.
More layers always means more surprises.
> >> Before you declare me crazy: I am perfectly aware that a filename with
> >> a newline at the end is probably bogus. I use this as an exercize – to
> >> see how I could make my own redo implementation cleaner and more safe.
> >
> > Assertions can provide clean and safe code too :)
>
> I do not understand that remark – please elaborate.
I just mean that assert statements preventing all inputs and outputs
that contain newlines would solve the problem in a very clean and
predictable way.
> > Spaces in filenames are also annoying, though not as bad as newlines.
> > I just found some bugs in minimal/do yesterday related to spaces. I
> > think any use of xargs also suffers from spaces, if you don't use -0,
> > which you mostly can't because it's nonportable. Sigh.
>
> As I have only test my own redo implementation with GNU coreutils and
> busybox, I strongly suspect it is not-portable already. I will either
> find some portable way to handle this, or reject POSIX and substitute
> GNU (and busybox).
apenwarr/redo and minimal/do try very hard to be portable. I probably
still screwed up in a few small places, but I've at least tested with
various shells and non-GNU operating systems. I'm pretty sure that's
very important.
Cross-OS portability is actually so important that I'm somewhat
tempted to bundle a whole sh and extra tools along with redo that
everyone is using the same one. It's currently way too easy to write
.do files containing unnecessary bashisms. We should somehow help
people to avoid that.
> The first thing I did was change the output format to make it possible
> to use redo-whichdo inside redo-ifchange for (nonexistence) dependency
> management. My implementation of redo-whichdo outputs dofile filenames
> until it finds one that exists – if it finds a dofile, it exits with a
> status code indicating success, and when it reaches a filesystem root,
> it exits with a status code indicating failure. On success, every line
> except for the last one specifies a non-existence dependency; the last
> line is the dofile we were looking for, and therefore the dependency …
Hmm, that's a pretty good idea. I like the much simpler output
syntax: now it's just a list of do files separated by a separator
character.
What you're sacrificing is auto-calculating $1 and $2, which is needed
for a "proxy.do" kind of script that wants to call a different .do
file. Maybe that's too fancy and the logic for that should be moved
into proxy.do anyway, I guess.
Maybe I'll try that out and see how hard it is. Calculating relative
paths in sh is unfortunately a little complicated, even though it's a
one-liner in python.
> I want to make parsing as easy as possible, but not easier. Breaking
> parsing attempts with a bad tool is a way to ensure that people will
> not use them. Besides, I am still not convinced that null delimiters
> are impossible to use in portable ways, as POSIX tr can handle them.
tr apparently does provide this, but if you have to use tr to "fix"
the NULs, you are writing portable but not-newline-safe code. If
their choice is between newline-safety and portability, I'm pretty
sure almost everyone will choose the latter. At that point they're
not protected but they do have more error-prone and longer scripts,
with extra fork/execs scattered around.
> This whole thing reminds me of the Fedora 19 name „Schrödinger's Cat“,
> which led to tools that could not handle umlauts and apostrophes being
> fixed … or maybe not, because someone changed the U+0027 APOSTROPHE to
> U+2019 RIGHT SINGLE QUOTATION MARK – thus masking the issue with quote
> characters not being handled right instead of fixing it.
Without a doubt, we need to test properly with all sorts of extra
characters, especially unicode characters.
We should be able to handle filenames that contain anything but
newline (shouldn't have been allowed) and nul (already not allowed).
We *might* want to disallow other control characters < ascii 32 also,
since they're obviously asking for trouble.
Note that make has already survived this long without even allowing
spaces in filenames. So perfection isn't really a requirement here.
> I have not yet attempted to use redo-whichdo inside redo-ifchange to
> generate dependencies – have you tried it? That would be the primary
> use case of redo-whichdo, or so I guess.
My own redo-ifchange already creates a dependency automatically on the
main .do file and the nonexistent .do files leading up to it.
I did successfully write a .do script that bounces to other .do
scripts using redo-whichdo in the original format I described. See my
'wvstreams' to the mailing list for a link to an example of that in
action. I think it should be pretty easy to parse my original format
or your suggested format (without $1 and $2 lines and using the error
code) from sh.
> > I often wish plan 9's rc shell had caught on. At least it can handle
> > arbitrary lists of strings, although I don't think they ever found a
> > good way to share those lists with programs outside rc.
>
> I actually use the rc shell as my standard shell and I do not think it
> can handle null bytes better. It is, however, a lot better than Bourne
> shell ever was.
No, I don't think it handles null bytes, but I think it supposedly
does somehow manage to export lists into the environment and then take
them out again, even if those lists contain spaces. I assume it does
that through some kind of encoding. If there was a standard encoding
that all shell stuff used automatically (even if it was base64...)
then I would want to use that. Unfortunately, "whitespace" separation
is the real standard, but nowadays that's too restrictive. Newline
seems like an okay compromise.
> > Have fun,
>
> I doubt it; programming is not fun.
I guess that's a matter of opinion :)
Avery