Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Minor updates to sort ?

0 views
Skip to first unread message

Robert Elz

unread,
May 29, 2016, 1:04:41 AM5/29/16
to
Inspired by Paul Goyette's question (on netbsd-users) I took a look at sort,
and I'd like to commit the following updates if no-one objects.

The only changes that should affect anything are the addition of the posix
C option, which is identical to c, but doesn't write messages to stderr
if the input file is not sorted, and fixing bugs in the processing of -R
such that if -R is used (and without setting it to \n - the processing of
which is also fixed in case it is set that wat using -R 10) \n does not
become a field separator regardless of what might be set later with -t (if
-t preceded -R it would have worked correctly, but not the other way.)

Aside from that the changes are more or less cosmetic - they enforce using
only one of -c -C and -m (which make no sense used together), make the
usage() reflect reality (including formatting it to stop assuming it
is outputting to an 80 column display..., and reflect the man page changes
mentioned next), and fix a minor bug in a comment, removed the unused 'x'
option (what was that?) from SORT_OPTS (no effect, generates usage() either
way) and sorted the option processing (R comes before S...)

In the man page, -C is documented, the synopsis is split to show the
(only one file allowed) different usage for -C or -c, and perhaps most
importantly, the names "field1" and "filed2" are changed to "kstart" and
"kend" to make it (a little more) clear that the -k argument does not specify
or use a field as such, but designates the start and end of the sort keys
(with the designators using fields as an addressing object - which is all
fields are used for in sort, unlike awk, cut, etc.) and -R is fully documented.

There are no changes (at all) to anything actually related to sorting...

Anyone object to these changes? (patch appended)

kre

Index: msort.c
===================================================================
RCS file: /cvsroot/src/usr.bin/sort/msort.c,v
retrieving revision 1.30
diff -u -r1.30 msort.c
--- msort.c 5 Feb 2010 21:58:42 -0000 1.30
+++ msort.c 29 May 2016 05:00:34 -0000
@@ -365,7 +365,7 @@
* check order on one file
*/
void
-order(struct filelist *filelist, struct field *ftbl)
+order(struct filelist *filelist, struct field *ftbl, int quiet)
{
get_func_t get = SINGL_FLD ? makeline : makekey;
RECHEADER *crec, *prec, *trec;
@@ -387,10 +387,14 @@
exit(0);
while (get(fp, crec, crec_end, ftbl) == 0) {
if (0 < (c = cmp(prec, crec))) {
+ if (quiet)
+ exit(1);
crec->data[crec->length-1] = 0;
errx(1, "found disorder: %s", crec->data+crec->offset);
}
if (UNIQUE && !c) {
+ if (quiet)
+ exit(1);
crec->data[crec->length-1] = 0;
errx(1, "found non-uniqueness: %s",
crec->data+crec->offset);
Index: sort.1
===================================================================
RCS file: /cvsroot/src/usr.bin/sort/sort.1,v
retrieving revision 1.34
diff -u -r1.34 sort.1
--- sort.1 29 May 2013 15:00:35 -0000 1.34
+++ sort.1 29 May 2016 05:00:34 -0000
@@ -67,16 +67,26 @@
.Nd sort or merge text files
.Sh SYNOPSIS
.Nm
-.Op Fl bcdfHilmnrSsu
+.Op Fl bdfHilmnrSsu
.Oo
.Fl k
-.Ar field1 Ns Op Li \&, Ns Ar field2
+.Ar kstart Ns Op Li \&, Ns Ar kend
.Oc
.Op Fl o Ar output
.Op Fl R Ar char
.Op Fl T Ar dir
.Op Fl t Ar char
.Op Ar
+.Nm
+.Fl c|C
+.Op Fl bdfilnru
+.Oo
+.Fl k
+.Ar kstart Ns Op Li \&, Ns Ar kend
+.Op Fl t Ar char
+.Oc
+.Op Fl R Ar char
+.Op Ar file
.Sh DESCRIPTION
The
.Nm
@@ -101,6 +111,10 @@
produces no output.
See also
.Fl u .
+.It Fl C
+Identical to
+.Fl c
+without the error messages in the case of unsorted input.
.It Fl H
Ignored for compatibility with earlier versions of
.Nm .
@@ -137,9 +151,13 @@
option, check that there are no lines with duplicate keys.
.El
.Pp
-The following options override the default ordering rules.
-When ordering options appear independent of key field
-specifications, the requested field ordering rules are
+The following options,
+which should be given before any
+.Fl k
+options, override the default ordering rules.
+When ordering options appear independent of,
+and before, key field specifications,
+the requested field ordering rules are
applied globally to all sort keys.
When attached to a specific key (see
.Fl k ) ,
@@ -224,12 +242,21 @@
This should be used with discretion;
.Fl R Aq Ar alphanumeric
usually produces undesirable results.
+If char is not a single character, then it
+specifies the value of the desired record
+separator as an integer specified in any
+of the normal NNN, 0ooo, or 0xXXX ways,
+or as an octal value preceded by \e.
+Caution: do not attempt to specify Ctl-A
+as
+.Dq -R 1
+which will not do what was intended at all!
The default record separator is newline.
-.It Fl k Ar field1 Ns Op Li \&, Ns Ar field2
+.It Fl k Ar kstart Ns Op Li \&, Ns Ar kend
Designates the starting position,
-.Ar field1 ,
+.Ar kstart ,
and optional ending position,
-.Ar field2 ,
+.Ar kend ,
of a key field.
The
.Fl k
@@ -265,16 +292,16 @@
Fields are specified
by the
.Fl k
-.Ar field1 Ns Op \&, Ns Ar field2
+.Ar kstart Ns Op \&, Ns Ar kend
argument.
A missing
-.Ar field2
+.Ar kend
argument defaults to the end of a line.
.Pp
The arguments
-.Ar field1
+.Ar kstart
and
-.Ar field2
+.Ar kend
have the form
.Ar m Ns Li \&. Ns Ar n
and can be followed by one or more of the letters
@@ -284,7 +311,7 @@
.Cm r ,
which correspond to the options discussed above.
A
-.Ar field1
+.Ar kstart
position specified by
.Ar m Ns Li \&. Ns Ar n
.Pq Ar m , n No \*[Gt] 0
@@ -296,7 +323,7 @@
A missing
.Li \&. Ns Ar n
in
-.Ar field1
+.Ar kstart
means
.Ql \&.1 ,
indicating the first character of the
@@ -314,7 +341,7 @@
field.
.Pp
A
-.Ar field2
+.Ar kend
position specified by
.Ar m Ns Li \&. Ns Ar n
is interpreted as
@@ -451,7 +478,7 @@
Thus performance depends highly on efficient choice of sort keys, and the
.Fl b
option and the
-.Ar field2
+.Ar kend
argument of the
.Fl k
option should be used whenever possible.
Index: sort.c
===================================================================
RCS file: /cvsroot/src/usr.bin/sort/sort.c,v
retrieving revision 1.61
diff -u -r1.61 sort.c
--- sort.c 16 Sep 2011 15:39:29 -0000 1.61
+++ sort.c 29 May 2016 05:00:34 -0000
@@ -117,7 +117,7 @@
main(int argc, char *argv[])
{
int ch, i, stdinflag = 0;
- char cflag = 0, mflag = 0;
+ char mode = 0;
char *outfile, *outpath = 0;
struct field *fldtab;
size_t fldtab_sz, fld_cnt;
@@ -145,9 +145,9 @@
fldtab = emalloc(fldtab_sz * sizeof(*fldtab));
memset(fldtab, 0, fldtab_sz * sizeof(*fldtab));

-#define SORT_OPTS "bcdD:fHik:lmno:rR:sSt:T:ux"
+#define SORT_OPTS "bcCdD:fHik:lmno:rR:sSt:T:u"

- /* Convert "+field" args to -f format */
+ /* Convert "+field" args to -k format */
fixit(&argc, argv, SORT_OPTS);

if (!(tmpdir = getenv("TMPDIR")))
@@ -158,8 +158,10 @@
case 'b':
fldtab[0].flags |= BI | BT;
break;
- case 'c':
- cflag = 1;
+ case 'c': case 'C': case 'm':
+ if (mode)
+ usage("Incompatible operation modes");
+ mode = ch;
break;
case 'D': /* Debug flags */
for (i = 0; optarg[i]; i++)
@@ -179,15 +181,33 @@

setfield(optarg, &fldtab[++fld_cnt], fldtab[0].flags);
break;
- case 'm':
- mflag = 1;
- break;
case 'o':
outpath = optarg;
break;
case 'r':
REVERSE = 1;
break;
+ case 'R':
+ if (REC_D != '\n')
+ usage("multiple record delimiters");
+ REC_D = *optarg;
+ if (optarg[1] != '\0') {
+ char *ep;
+ int t = 0;
+
+ if (optarg[0] == '\\')
+ optarg++, t = 8;
+ REC_D = (int)strtol(optarg, &ep, t);
+ if (*ep != '\0' || REC_D < 0 ||
+ REC_D >= (int)__arraycount(d_mask))
+ errx(2, "invalid record delimiter %s",
+ optarg);
+ }
+ if (REC_D == '\n')
+ break;
+ d_mask['\n'] = d_mask[' '];
+ d_mask[REC_D] = REC_D_F;
+ break;
case 's':
/*
* Nominally 'stable sort', keep lines with equal keys
@@ -213,30 +233,11 @@
SEP_FLAG = 1;
d_mask[' '] &= ~FLD_D;
d_mask['\t'] &= ~FLD_D;
+ d_mask['\n'] &= ~FLD_D;
d_mask[(u_char)*optarg] |= FLD_D;
if (d_mask[(u_char)*optarg] & REC_D_F)
errx(2, "record/field delimiter clash");
break;
- case 'R':
- if (REC_D != '\n')
- usage("multiple record delimiters");
- REC_D = *optarg;
- if (REC_D == '\n')
- break;
- if (optarg[1] != '\0') {
- char *ep;
- int t = 0;
- if (optarg[0] == '\\')
- optarg++, t = 8;
- REC_D = (int)strtol(optarg, &ep, t);
- if (*ep != '\0' || REC_D < 0 ||
- REC_D >= (int)__arraycount(d_mask))
- errx(2, "invalid record delimiter %s",
- optarg);
- }
- d_mask['\n'] = d_mask[' '];
- d_mask[REC_D] = REC_D_F;
- break;
case 'T':
/* -T tmpdir */
tmpdir = optarg;
@@ -254,13 +255,13 @@
/* Don't sort on raw record if keys match */
posix_sort = 0;

- if (cflag && argc > optind+1)
+ if ((mode == 'c' || mode == 'C') && argc > optind+1)
errx(2, "too many input files for -c option");
if (argc - 2 > optind && !strcmp(argv[argc-2], "-o")) {
outpath = argv[argc-1];
argc -= 2;
}
- if (mflag && argc - optind > (MAXFCT - (16+1))*16)
+ if (mode == 'm' && argc - optind > (MAXFCT - (16+1))*16)
errx(2, "too many input files for -m option");

for (i = optind; i < argc; i++) {
@@ -309,8 +310,8 @@
num_input_files = argc - optind;
}

- if (cflag) {
- order(&filelist, fldtab);
+ if (mode == 'c' || mode == 'C') {
+ order(&filelist, fldtab, mode == 'C');
/* NOT REACHED */
}

@@ -348,7 +349,7 @@
err(2, "output file %s", outfile);
}

- if (mflag)
+ if (mode == 'm')
fmerge(&filelist, num_input_files, outfp, fldtab);
else
fsort(&filelist, num_input_files, outfp, fldtab);
@@ -393,13 +394,20 @@
static void
usage(const char *msg)
{
+ const char *pn = getprogname();
+
if (msg != NULL)
(void)fprintf(stderr, "%s: %s\n", getprogname(), msg);
(void)fprintf(stderr,
- "usage: %s [-bcdfHilmnrSsu] [-k field1[,field2]] [-o output]"
- " [-R char] [-T dir]", getprogname());
+ "usage: %s [-bdfHilmnrSsu] [-k kstart[,kend]] [-o output]"
+ " [-R char] [-T dir]\n", pn);
(void)fprintf(stderr,
" [-t char] [file ...]\n");
+ (void)fprintf(stderr,
+ " or: %s -[cC] [-bdfilnru] [-k kstart[,kend]] [-o output]"
+ " [-R char]\n", pn);
+ (void)fprintf(stderr,
+ " [-t char] [file]\n");
exit(2);
}

Index: sort.h
===================================================================
RCS file: /cvsroot/src/usr.bin/sort/sort.h,v
retrieving revision 1.35
diff -u -r1.35 sort.h
--- sort.h 5 Aug 2015 07:10:03 -0000 1.35
+++ sort.h 29 May 2016 05:00:34 -0000
@@ -191,7 +191,7 @@
int makeline(FILE *, RECHEADER *, u_char *, struct field *);
void makeline_copydown(RECHEADER *);
int optval(int, int);
-__dead void order(struct filelist *, struct field *);
+__dead void order(struct filelist *, struct field *, int);
void putline(const RECHEADER *, FILE *);
void putrec(const RECHEADER *, FILE *);
void putkeydump(const RECHEADER *, FILE *);

Robert Elz

unread,
May 29, 2016, 6:54:55 AM5/29/16
to
Date: Sat, 28 May 2016 22:29:51 -0700
From: Alistair Crooks <a...@pkgsrc.org>
Message-ID: <CAN5gJXqdQR2Di0+RdH8esA+m...@mail.gmail.com>

| Living on the edge and top posting aggresively, these changes would be good
| to have.

OK, but I will wait a day or two more to allow for more comments (or
more suggested improvements.)

I was asked (in off-list mail) to also add something to the manual about
how multiple -k args work. This is more verbose that what was suggested...

sort compares records by comparing the key fields selected by -k
arguments, from first given to last, until discovering a difference. If
there are no -k arguments, the whole record is treated as the key. After
exhausting the -k arguments, if no difference has been found, then the
result depends upon the -u and -S option settings. With -u the records
are considered identical, and one is supressed. Otherwise with -s set
(default) the records are left in their original order, or with -S (posix
mode) the whole record is considered as a tie breaker.

I also (now) have (currently anyway) in the proposed updated man page, the
following commented text immediately after the paragraph above ...

.\"
.\" If you fail to understand why it doesn't matter which order
.\" the records are output when they are wholly identical, there
.\" is nothing that this man page can say that wll help!
.\"

(This is because there is a comment in the posix spec about this
being unspecified...)

kre

Paul Goyette

unread,
May 29, 2016, 7:20:25 AM5/29/16
to
(Commenting in-line, not top-posting!)

On Sun, 29 May 2016, Robert Elz wrote:

> Date: Sat, 28 May 2016 22:29:51 -0700
> From: Alistair Crooks <a...@pkgsrc.org>
> Message-ID: <CAN5gJXqdQR2Di0+RdH8esA+m...@mail.gmail.com>
>
> | Living on the edge and top posting aggresively, these changes would be good
> | to have.

I also agree - these are all good changes.

> OK, but I will wait a day or two more to allow for more comments (or
> more suggested improvements.)
>
> I was asked (in off-list mail) to also add something to the manual about
> how multiple -k args work. This is more verbose that what was suggested...

(I made this off-list comment...)

> sort compares records by comparing the key fields selected by -k
> arguments, from first given to last, until discovering a difference. If
> there are no -k arguments, the whole record is treated as the key. After
> exhausting the -k arguments, if no difference has been found, then the
> result depends upon the -u and -S option settings. With -u the records
> are considered identical, and one is supressed. Otherwise with -s set
> (default) the records are left in their original order, or with -S (posix
> mode) the whole record is considered as a tie breaker.

Maybe update 2nd sentence?

If there are no -k arguments, the whole record is treated as a
single key.
>
> I also (now) have (currently anyway) in the proposed updated man page, the
> following commented text immediately after the paragraph above ...
>
> .\"
> .\" If you fail to understand why it doesn't matter which order
> .\" the records are output when they are wholly identical, there
> .\" is nothing that this man page can say that wll help!
> .\"
>
> (This is because there is a comment in the posix spec about this
> being unspecified...)

Posix can be so helpful at times, and at other times, not so much ...



+------------------+--------------------------+------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
+------------------+--------------------------+------------------------+

Robert Elz

unread,
May 29, 2016, 8:41:32 AM5/29/16
to
Date: Sun, 29 May 2016 19:19:45 +0800 (PHT)
From: Paul Goyette <pa...@whooppee.com>
Message-ID: <Pine.NEB.4.64.1...@pokey.whooppee.com>

| Maybe update 2nd sentence?

Done (tentatively, for now, of course).

kre

Alistair Crooks

unread,
May 30, 2016, 1:09:47 AM5/30/16
to
Living on the edge and top posting aggresively, these changes would be good
to have.

Thanks,
Alistair
0 new messages