Three diffs

Andras Farkas

unread,

Jun 14, 2020, 1:19:03 AM6/14/20

to Martijn Dekker, Korn Shell

Hello! Here are three diffs I've written in the past that can make it into
https://github.com/ksh93/ksh

The most obvious is that PS1 is incorrectly stated in the man page, sh.1, as
``$''
rather than
``$ ''
https://github.com/att/ast/commit/d2044373cf93f8f45bfd04d995697837f55a63ba
In the current repository, this would affect:
https://github.com/ksh93/ksh/blob/master/src/cmd/ksh93/sh.1

Another diff is to fix printf %H
https://github.com/att/ast/commit/309e759f67ec7a62527ec4d24004bec99117f606
https://github.com/att/ast/pull/1431
As it is now, output is incompatible with _both_ HTML and with XML,
unlike what the man page states, thus making printf %H unusable for
its intended purpose. (unless the user is fine with invalid html or
xml pages)
In the current repository, this'd affect:
https://github.com/ksh93/ksh/blob/master/src/cmd/ksh93/bltins/print.c

Another diff is to fix the _cd function included
https://groups.google.com/forum/#!msg/korn-shell/TVpHd5rrI5g/-Aw-A9gCAQAJ
In the current repository, this would affect:
https://github.com/ksh93/ksh/tree/master/src/cmd/ksh93/fun

I'm willing to put in the work to rewrite these diffs for the current
repository. Just putting them out here first in case not all three
are desired.
I consider all of these to be bug fixes and thus acceptable.

Martijn Dekker

unread,

Jun 14, 2020, 3:08:13 AM6/14/20

to Korn Shell, Andras Farkas

Op 14-06-20 om 07:18 schreef Andras Farkas:

> Hello! Here are three diffs I've written in the past that can make it into
> https://github.com/ksh93/ksh
>
> The most obvious is that PS1 is incorrectly stated in the man page, sh.1, as
> ``$''
> rather than
> ``$ ''
> https://github.com/att/ast/commit/d2044373cf93f8f45bfd04d995697837f55a63ba
> In the current repository, this would affect:
> https://github.com/ksh93/ksh/blob/master/src/cmd/ksh93/sh.1

Thanks. Fixed.

> Another diff is to fix printf %H
> https://github.com/att/ast/commit/309e759f67ec7a62527ec4d24004bec99117f606
> https://github.com/att/ast/pull/1431
> As it is now, output is incompatible with _both_ HTML and with XML,
> unlike what the man page states, thus making printf %H unusable for
> its intended purpose. (unless the user is fine with invalid html or
> xml pages)
> In the current repository, this'd affect:
> https://github.com/ksh93/ksh/blob/master/src/cmd/ksh93/bltins/print.c

That is easy to backport, so I'll go ahead and do it.

> Another diff is to fix the _cd function included
> https://groups.google.com/forum/#!msg/korn-shell/TVpHd5rrI5g/-Aw-A9gCAQAJ
> In the current repository, this would affect:
> https://github.com/ksh93/ksh/tree/master/src/cmd/ksh93/fun

I think a cleaner fix would be to just set a positional parameter to
"$dir" if there aren't any positional parameters and "$dir" is
non-empty, so that we can get rid of the "${dir:-$@}" expansion that
caused the bug by preventing $@ from expanding to zero words. "$@"
behaves in special ways and being clever with it is generally
counterproductive.

diff --git a/src/cmd/ksh93/fun/dirs b/src/cmd/ksh93/fun/dirs
index 0329970..8c29ad1 100755
--- a/src/cmd/ksh93/fun/dirs
+++ b/src/cmd/ksh93/fun/dirs
@@ -55,7 +55,8 @@ function _cd
case $dir in
\~*) dir=$HOME${dir#\~}
esac
- \cd "${dir:-$@}" >| /dev/null || return 1
+ (($#)) || set -- ${dir:+"$dir"}
+ \cd "$@" >| /dev/null || return 1
dir=${OLDPWD#$HOME/}
case $TERM in
630)

On a related side note, I just acquired a second-hand copy of "The New
KornShell Command and Programming Language" by Bolsky & Korn, 1995. This
'dirs' script is listed there in full on pages 274-276, including this
bug (wrong use of $@). It also has extra comments not included in the
distribution, including this comment right before the buggy line:

# \cd prevents alias substitution.
# You can use command cd with 12/28/93 and newer.

That is wrong because ksh93 adds an obnoxious default alias:

alias command="command "

which continues alias substitution after 'command' and therefore
disables precisely the sort of use that this comment in the official
book recommends. (It also causes 'command times' to be a syntax error.
See what the 'times' alias in 93u+ expands to, to understand why.)

For 93u+m I got rid of that default alias plus another one like it:
https://github.com/ksh93/ksh/commit/61d9bca5
It made the behaviour of 'command' violate POSIX. Whoever still wants
that behaviour can easily set the alias themselves.

> I'm willing to put in the work to rewrite these diffs for the current
> repository. Just putting them out here first in case not all three
> are desired.
> I consider all of these to be bug fixes and thus acceptable.

Agreed. No worries, they're easy to backport so I'll just do it.

- Martijn

--
|| modernish -- harness the shell
|| https://github.com/modernish/modernish
||
|| KornShell lives!
|| https://github.com/ksh93/ksh

Andras Farkas

unread,

Jun 14, 2020, 4:06:16 AM6/14/20

to Martijn Dekker, Korn Shell

On Sun, Jun 14, 2020 at 3:08 AM Martijn Dekker <mar...@inlv.org> wrote:
> Op 14-06-20 om 07:18 schreef Andras Farkas:

> > The most obvious is that PS1 is incorrectly stated in the man page, sh.1, as
> > ``$''
> > rather than
> > ``$ ''
> > https://github.com/att/ast/commit/d2044373cf93f8f45bfd04d995697837f55a63ba
> > In the current repository, this would affect:
> > https://github.com/ksh93/ksh/blob/master/src/cmd/ksh93/sh.1
>
> Thanks. Fixed.

:D

> > Another diff is to fix printf %H
> > https://github.com/att/ast/commit/309e759f67ec7a62527ec4d24004bec99117f606
> > https://github.com/att/ast/pull/1431
> > As it is now, output is incompatible with _both_ HTML and with XML,
> > unlike what the man page states, thus making printf %H unusable for
> > its intended purpose. (unless the user is fine with invalid html or
> > xml pages)
> > In the current repository, this'd affect:
> > https://github.com/ksh93/ksh/blob/master/src/cmd/ksh93/bltins/print.c
>
> That is easy to backport, so I'll go ahead and do it.

Alright!

> > Another diff is to fix the _cd function included
> > https://groups.google.com/forum/#!msg/korn-shell/TVpHd5rrI5g/-Aw-A9gCAQAJ
> > In the current repository, this would affect:
> > https://github.com/ksh93/ksh/tree/master/src/cmd/ksh93/fun
>
> I think a cleaner fix would be to just set a positional parameter to
> "$dir" if there aren't any positional parameters and "$dir" is
> non-empty, so that we can get rid of the "${dir:-$@}" expansion that
> caused the bug by preventing $@ from expanding to zero words. "$@"
> behaves in special ways and being clever with it is generally
> counterproductive.
>
> diff --git a/src/cmd/ksh93/fun/dirs b/src/cmd/ksh93/fun/dirs
> index 0329970..8c29ad1 100755
> --- a/src/cmd/ksh93/fun/dirs
> +++ b/src/cmd/ksh93/fun/dirs
> @@ -55,7 +55,8 @@ function _cd
> case $dir in
> \~*) dir=$HOME${dir#\~}
> esac
> - \cd "${dir:-$@}" >| /dev/null || return 1
> + (($#)) || set -- ${dir:+"$dir"}
> + \cd "$@" >| /dev/null || return 1
> dir=${OLDPWD#$HOME/}
> case $TERM in
> 630)

Code looks good at first... but I think this messes up mcd. In
testing, mcd gives me:
ksh93: mcd[106]: _cd[60]: cd: 3: [No such file or directory]
for many of my choices, even for choices that worked before this diff.
Of course, this is with me selecting menu item 3, and clearly, cd
eventually is being told "cd 3"
Thus, I don't think the diff pasted here should be committed. I'll
look into this more in-depth sometime soon.

> On a related side note, I just acquired a second-hand copy of "The New
> KornShell Command and Programming Language" by Bolsky & Korn, 1995. This
> 'dirs' script is listed there in full on pages 274-276, including this
> bug (wrong use of $@). It also has extra comments not included in the
> distribution, including this comment right before the buggy line:

Exactly. I spotted this bug while reading that book on a vacation,
and submitted my fix when I came home and checked to see if it was
also in the /fun/ directory of the code.

>
> # \cd prevents alias substitution.
> # You can use command cd with 12/28/93 and newer.
>
> That is wrong because ksh93 adds an obnoxious default alias:
>
> alias command="command "
>
> which continues alias substitution after 'command' and therefore
> disables precisely the sort of use that this comment in the official
> book recommends. (It also causes 'command times' to be a syntax error.
> See what the 'times' alias in 93u+ expands to, to understand why.)

Definitely lotsa weird stuff in the book and in the code, I agree.

> For 93u+m I got rid of that default alias plus another one like it:
> https://github.com/ksh93/ksh/commit/61d9bca5
> It made the behaviour of 'command' violate POSIX. Whoever still wants
> that behaviour can easily set the alias themselves.

*thumbs-up*

> > I'm willing to put in the work to rewrite these diffs for the current
> > repository. Just putting them out here first in case not all three
> > are desired.
> > I consider all of these to be bug fixes and thus acceptable.
>
> Agreed. No worries, they're easy to backport so I'll just do it.

Thanks so much for checking on these and replying!
Glad to help, when I can.

Martijn Dekker

unread,

Jun 14, 2020, 11:01:19 AM6/14/20

to Korn Shell, Andras Farkas

Op 14-06-20 om 09:08 schreef Martijn Dekker:

>> Another diff is to fix printf %H
>> https://github.com/att/ast/commit/309e759f67ec7a62527ec4d24004bec99117f606
>>
>> https://github.com/att/ast/pull/1431
>> As it is now, output is incompatible with _both_ HTML and with XML,
>> unlike what the man page states, thus making printf %H unusable for
>> its intended purpose. (unless the user is fine with invalid html or
>> xml pages)
>> In the current repository, this'd affect:
>> https://github.com/ksh93/ksh/blob/master/src/cmd/ksh93/bltins/print.c
>
> That is easy to backport, so I'll go ahead and do it.

Actually, I don't think that fix is entirely right.

Removing the space to &nsbp; conversion is certainly right. It's bogus.
Spaces and non-breaking spaces are different characters.

Removing the ' to ' conversion is more problematic. While the use
of bare ' and " characters in the content between HTML tags is perfectly
valid, this is not the case in the values of HTML tag attributes, which
are quoted with either ' or ". Which means that, depending on the
quoting used, either ' or " need to be changed to their HTML entities to
use them in attribute values. Since 'printf %H' has no way of knowing
the context, they should both always be converted to be safe.

Since ' is not a valid HTML entity before HTML5, I'll just change
it to the corresponding decimal ASCII code entity, '.

And hey, look at that, there is another fix to be done: in UTF-8
locales, 'printf %H' simply deletes all non-ASCII characters!

- M.

Andras Farkas

unread,

Jun 14, 2020, 11:54:54 AM6/14/20

to Martijn Dekker, Korn Shell

On Sun, Jun 14, 2020 at 11:01 AM Martijn Dekker <mar...@inlv.org> wrote:
> Op 14-06-20 om 09:08 schreef Martijn Dekker:
> >> Another diff is to fix printf %H
> >> https://github.com/att/ast/commit/309e759f67ec7a62527ec4d24004bec99117f606
> >>
> >> https://github.com/att/ast/pull/1431
> >> As it is now, output is incompatible with _both_ HTML and with XML,
> >> unlike what the man page states, thus making printf %H unusable for
> >> its intended purpose. (unless the user is fine with invalid html or
> >> xml pages)
> >> In the current repository, this'd affect:
> >> https://github.com/ksh93/ksh/blob/master/src/cmd/ksh93/bltins/print.c
> >
> > That is easy to backport, so I'll go ahead and do it.
>
> Actually, I don't think that fix is entirely right.
>
> Removing the space to &nsbp; conversion is certainly right. It's bogus.
> Spaces and non-breaking spaces are different characters.

Agreed.

> Removing the ' to ' conversion is more problematic. While the use
> of bare ' and " characters in the content between HTML tags is perfectly
> valid, this is not the case in the values of HTML tag attributes, which
> are quoted with either ' or ". Which means that, depending on the
> quoting used, either ' or " need to be changed to their HTML entities to
> use them in attribute values. Since 'printf %H' has no way of knowing
> the context, they should both always be converted to be safe.

Yeah, I did realise that when I wrote my diff, since I hadn't thought
of your solution. That is true...

> Since ' is not a valid HTML entity before HTML5, I'll just change
> it to the corresponding decimal ASCII code entity, '.

Alright! Ideally, I do wish it was possible for users to request the
output be tailored for XML or for old HTML.
I thought I was being on the safe side with what I did (in terms of
validity) but your solution (including the numeric code) sounds good
too, since it protects people using '' for attributes.
*thumbs-up!*
:D

> And hey, look at that, there is another fix to be done: in UTF-8
> locales, 'printf %H' simply deletes all non-ASCII characters!

RIP. I see.

Martijn Dekker

unread,

Jun 14, 2020, 12:20:02 PM6/14/20

to korn-...@googlegroups.com, Andras Farkas

Op 14-06-20 om 17:54 schreef Andras Farkas:

> On Sun, Jun 14, 2020 at 11:01 AM Martijn Dekker <mar...@inlv.org> wrote:

[...]

>> Since ' is not a valid HTML entity before HTML5, I'll just change
>> it to the corresponding decimal ASCII code entity, '.
> Alright! Ideally, I do wish it was possible for users to request the
> output be tailored for XML or for old HTML.

Since this is compatible with every version of both, I think it'll do fine.

But I'm almost of a mind to change " to " as well... it looks
inelegant to have one kind of quote converted to a numeric entity and
the other not. Plus, " is actually one byte shorter than " and
when converting large texts containing a lot of quotes, that could add
up to being somewhat significant. What do you think?

> I thought I was being on the safe side with what I did (in terms of
> validity) but your solution (including the numeric code) sounds good
> too, since it protects people using '' for attributes.
> *thumbs-up!*
> :D

:)

>> And hey, look at that, there is another fix to be done: in UTF-8
>> locales, 'printf %H' simply deletes all non-ASCII characters!
> RIP. I see.

I'll create an issue about that on github. I can't immediately see a
solution.

In the C locale, 'printf %H' converts all high-byte characters to
numeric entities. But it should not need to do that at all. Both HTML
and XML are pure ASCII, so there is never any conflict.

I've tried removing what appears to be the code that convert these, but
that does not eliminate the bug that they get deleted in UTF-8 locales.

- Martijn

Andras Farkas

unread,

Jun 14, 2020, 12:40:40 PM6/14/20

to Martijn Dekker, Korn Shell

On Sun, Jun 14, 2020 at 12:20 PM Martijn Dekker <mar...@inlv.org> wrote:
>
> Op 14-06-20 om 17:54 schreef Andras Farkas:
> > On Sun, Jun 14, 2020 at 11:01 AM Martijn Dekker <mar...@inlv.org> wrote:
> [...]
> >> Since ' is not a valid HTML entity before HTML5, I'll just change
> >> it to the corresponding decimal ASCII code entity, '.
> > Alright! Ideally, I do wish it was possible for users to request the
> > output be tailored for XML or for old HTML.
>
> Since this is compatible with every version of both, I think it'll do fine.

Agreed!

> But I'm almost of a mind to change " to " as well... it looks
> inelegant to have one kind of quote converted to a numeric entity and
> the other not. Plus, " is actually one byte shorter than " and
> when converting large texts containing a lot of quotes, that could add
> up to being somewhat significant. What do you think?

I'd rather it stay "! I often read and write HTML by hand, and
readability in code is far more important than saving a few bytes. If
someone has a special need to save bytes, they'll have to figure it
out themselves, I think.
So I'd rather all four of the named entities compatible everywhere
continue using their named versions, including quot.

Martijn Dekker

unread,

Jun 14, 2020, 12:48:35 PM6/14/20

to korn-...@googlegroups.com, Andras Farkas

Op 14-06-20 om 18:40 schreef Andras Farkas:

Fair. " it is then.

Thanks for the feedback.

- M.

Martijn Dekker

unread,

Aug 10, 2020, 6:43:42 PM8/10/20

to korn-...@googlegroups.com, Andras Farkas

Op 14-06-20 om 17:19 schreef Martijn Dekker:

> Op 14-06-20 om 17:54 schreef Andras Farkas:
>> On Sun, Jun 14, 2020 at 11:01 AM Martijn Dekker <mar...@inlv.org> wrote:
[...]

>>> And hey, look at that, there is another fix to be done: in UTF-8
>>> locales, 'printf %H' simply deletes all non-ASCII characters!
>> RIP. I see.
>
> I'll create an issue about that on github. I can't immediately see a
> solution.

[...]

Meanwhile I've learned how the AST multibyte macros work while fixing
another bug. See https://github.com/ksh93/ksh/issues/5 for discussion
related to that.

So I've now been able to make 'printf %H' (as well as 'printf %#H')
UTF-8-aware and also fix a number of other problems with these,
including those discussed earlier in this thread.

So I think it's now properly fixed in the current git code:
https://github.com/ksh93/ksh/commit/8477d2ce

Please test and let me know how it works.

Andras Farkas

unread,

Aug 10, 2020, 9:25:34 PM8/10/20

to Martijn Dekker, Korn Shell

Thanks for the work!
:D

However, in the commit you link:
  is incompatible with XML (and the documentation currently
claims output of %H is compatible with both HTML and XML, which only
lets us use four named values: gt, lt, amp, quot)
Options include:
1. Just outputting the non-breaking space exactly as it was input
2. Using the numeric version (like what was chosen instead of apos)
which is  
I greatly prefer the first of these two options, as I think only five
characters should be changed in any way by %H: < > & " ' as these are
the characters semantically meaningful to XML and HTML and what a user
expects to be changed.
From the man page:
A %H format can be used instead of %s to cause characters in arg
**that are special in HTML and XML** to be output as their entity
name.
(emphasis mine)

Also, the comment /* ' (' is nonstandard) */ isn't quite right.
' is standard in XML, XHTML, and HTML5, but not in HTML. A
better comment might be /* ' (' isn't available in HTML) */

Again, thanks for the work! Thanks also for fixing %#H: I had never
looked into it before.

Martijn Dekker

unread,

Aug 11, 2020, 1:36:16 AM8/11/20

to Andras Farkas, Korn Shell

Op 11-08-20 om 02:25 schreef Andras Farkas:

>   is incompatible with XML (and the documentation currently
> claims output of %H is compatible with both HTML and XML, which only
> lets us use four named values: gt, lt, amp, quot)

Good point, I'd forgotten about XML.

> Options include:
> 1. Just outputting the non-breaking space exactly as it was input
> 2. Using the numeric version (like what was chosen instead of apos)
> which is  
> I greatly prefer the first of these two options, as I think only five
> characters should be changed in any way by %H: < > & " ' as these are
> the characters semantically meaningful to XML and HTML and what a user
> expects to be changed.

The second option is what would happen by default, as all non-printable
and non-graph ("invisible") characters except the ASCII space are
changed to numeric values. This most closely matches the old behaviour
while fixing what was broken about it, so I'm inclined to keep it,
although I'm open to further arguments.

I also think encoding non-graph characters (other than the regular ASCII
space, carriage return and newline) is more user friendly, because it
makes the difference between the various kinds of spaces visible when
reading the code. This difference is important, because adjacent ASCII
spaces and control characters are collapsed into one space in XML and
HTML, whereas the various other Unicode spaces are rendered as they are.
So if the rendering ends up with strange spacing, it would be mystifying
if you cannot see the Unicode spaces causing it in the code.

So it appears we have different views on user expectations. Could you
elaborate on why you think a user would expect only those five
characters to be changed?

Andras Farkas

unread,

Aug 11, 2020, 2:05:05 AM8/11/20

to Martijn Dekker, Korn Shell

On Tue, Aug 11, 2020 at 1:36 AM Martijn Dekker <mar...@inlv.org> wrote:
> Op 11-08-20 om 02:25 schreef Andras Farkas:
> >   is incompatible with XML (and the documentation currently
> > claims output of %H is compatible with both HTML and XML, which only
> > lets us use four named values: gt, lt, amp, quot)
>
> Good point, I'd forgotten about XML.
>
> > Options include:
> > 1. Just outputting the non-breaking space exactly as it was input
> > 2. Using the numeric version (like what was chosen instead of apos)
> > which is  
> > I greatly prefer the first of these two options, as I think only five
> > characters should be changed in any way by %H: < > & " ' as these are
> > the characters semantically meaningful to XML and HTML and what a user
> > expects to be changed.
>
> The second option is what would happen by default, as all non-printable
> and non-graph ("invisible") characters except the ASCII space are
> changed to numeric values. This most closely matches the old behaviour
> while fixing what was broken about it, so I'm inclined to keep it,
> although I'm open to further arguments.

Ah, true! I'm fine with this, too.
So, I'm fine with the second of these two options being implemented,
now, for sure.

> I also think encoding non-graph characters (other than the regular ASCII
> space, carriage return and newline) is more user friendly, because it
> makes the difference between the various kinds of spaces visible when
> reading the code. This difference is important, because adjacent ASCII
> spaces and control characters are collapsed into one space in XML and
> HTML, whereas the various other Unicode spaces are rendered as they are.
> So if the rendering ends up with strange spacing, it would be mystifying
> if you cannot see the Unicode spaces causing it in the code.

Ah, true, I don't disagree with this! It really could go either way:
there are cases where numerically-encoding Unicode spaces will make
HTML more readable, and some cases where it'll be less readable. I
still do have a preference for the first option, as it means reading
the text of an HTML file in a text editor is more natural.
I suppose the question is "What should ksh do versus what should be
left for the user to do before or after it?" What's a more generic
solution?

> So it appears we have different views on user expectations. Could you
> elaborate on why you think a user would expect only those five
> characters to be changed?

Well, my reasoning was that for someone acquainted with HTML and XML,
they probably only hope for protection from the five semantically
important characters in text they intend to put in HTML/XML elements.
You do remind me, though, that someone already using Korn shell for
this might be interested in some other characters being encoded too.
So, this is fine.

Again, thanks so much. :D

Martijn Dekker

unread,

Aug 11, 2020, 3:24:16 AM8/11/20

to Andras Farkas, Korn Shell

Op 11-08-20 om 07:04 schreef Andras Farkas:

> Well, my reasoning was that for someone acquainted with HTML and XML,
> they probably only hope for protection from the five semantically
> important characters in text they intend to put in HTML/XML elements.

I've come around to your original viewpoint, but for a different reason.

&#n; numeric entities are always Unicode, so the generation of numeric
entities was only valid if you were in a UTF-8 locale processing UTF-8
text. In another locale, say ISO-8859-1, it would happily generate
numeric entities for non-graph characters > 127 with ISO-8859-1 values,
which is invalid.

So, that's not worth doing. I'll commit a fix that only encodes the
required five characters. Thanks for your testing and input!

Andras Farkas

unread,

Aug 11, 2020, 7:55:25 PM8/11/20

to Martijn Dekker, Korn Shell

On Tue, Aug 11, 2020 at 3:24 AM Martijn Dekker <mar...@inlv.org> wrote:
>
> Op 11-08-20 om 07:04 schreef Andras Farkas:
> > Well, my reasoning was that for someone acquainted with HTML and XML,
> > they probably only hope for protection from the five semantically
> > important characters in text they intend to put in HTML/XML elements.
>
> I've come around to your original viewpoint, but for a different reason.
>
> &#n; numeric entities are always Unicode, so the generation of numeric
> entities was only valid if you were in a UTF-8 locale processing UTF-8
> text. In another locale, say ISO-8859-1, it would happily generate
> numeric entities for non-graph characters > 127 with ISO-8859-1 values,
> which is invalid.
>
> So, that's not worth doing. I'll commit a fix that only encodes the
> required five characters. Thanks for your testing and input!

Thanks!
https://github.com/ksh93/ksh/commit/e01801572da15f81d05e47883001f9a31b23f695
looks good to me!

Reply all

Reply to author

Forward