Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

I use to be a Unix snob

66 views
Skip to first unread message

paul.d...@gmail.com

unread,
Feb 10, 2015, 12:00:32 AM2/10/15
to
Not a guru by any means, mind you, but pretty smug about all the things that were easy to do in Unix (well, cygwin) compared to one other major unnamed OS that had taken over the world.

Then one day, after hours of troubleshooting an analysis, I discovered that "uniq" doesn't really return the unique lines in a file. Unlike DISTINCT in SQL, unlike Excel's copy unique records, and unlike uniquification in Matlab.

From that day on, I got a new appreciation for the reality that every environment has its warts.

Then I started to wrap up to get to bed well after midnight.

Janis Papanagnou

unread,
Feb 10, 2015, 6:28:40 AM2/10/15
to
On 10.02.2015 06:00, paul.d...@gmail.com wrote:
>
> Then one day, after hours of troubleshooting an analysis, I discovered that
> "uniq" doesn't really return the unique lines in a file. Unlike DISTINCT
> in SQL, unlike Excel's copy unique records, and unlike uniquification in
> Matlab.

Mind, uniq != distinct, and a file is (contrary to a database) a sequential
medium, and the Unix tools typically have to work also on streams of data.

>
> From that day on, I got a new appreciation for the reality that every
> environment has its warts.

From your posting it's unclear what you think the warts are. The functions
supported by uniq are flexible controllable by options and well documented.

uniq - report or omit repeated lines

Repeated means _consecutive_ duplicates. Thus you need to sort the data if
you want the duplicates removed. The sort(1) command has also a -u switch.
And it's all documented in the man page:

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use `sort -u' without `uniq'.

If you want to remove also non-consecutive duplicates across the whole file
without sorting the data you have to choose another approach. One way to
achieve that is

awk '!a[$0]++'

You can put that code into a function or executable file called 'distinct'.

>
> Then I started to wrap up to get to bed well after midnight.

Tomorrow everything certainly looks more optimistic. :-)

Janis

paul.d...@gmail.com

unread,
Feb 10, 2015, 7:13:40 AM2/10/15
to
On Tuesday, February 10, 2015 at 6:28:40 AM UTC-5, Janis Papanagnou
wrote:
> On 10.02.2015 06:00, paul.domaskis_AT_gmail.com wrote:
>> Then one day, after hours of troubleshooting an analysis, I
>> discovered that "uniq" doesn't really return the unique lines in a
>> file. Unlike DISTINCT in SQL, unlike Excel's copy unique records,
>> and unlike uniquification in Matlab.
>
> Mind, uniq != distinct, and a file is (contrary to a database) a
> sequential medium, and the Unix tools typically have to work also on
> streams of data.
>
>> From that day on, I got a new appreciation for the reality that
>> every environment has its warts.
>
> From your posting it's unclear what you think the warts are. The
> functions supported by uniq are flexible controllable by options and
> well documented.
>
> uniq - report or omit repeated lines
>
> Repeated means _consecutive_ duplicates. Thus you need to sort the
> data if you want the duplicates removed. The sort(1) command has
> also a -u switch. And it's all documented in the man page:

<...snip...>

Well, uniq is an obvious play on the word "unique". Removing
consecutive duplicates lines is far, far different from uniquifying
the line. Yes, it is described in the man page, but the wart (as I
see it) is that the command name is misleading. Perhaps the missing
"e" in the name can serve as a constant reminder that it does a
local unique among adjacent lines rather than ensuring complete
uniqueness in a file (or stream).

Janis Papanagnou

unread,
Feb 10, 2015, 7:29:35 AM2/10/15
to
On 10.02.2015 13:13, paul.d...@gmail.com wrote:
>
> Well, uniq is an obvious play on the word "unique".

Or just an abbreviated form to save characters. (Mind the infamous
'creat' system call? Just silly!)

> Removing
> consecutive duplicates lines is far, far different from uniquifying
> the line. Yes, it is described in the man page, but the wart (as I
> see it) is that the command name is misleading.

I think it primarily depends on personal background what one associates
with that name. Actually, depending on the chosen options, uniq can do
very different things. In the Unix world it's not the best approach to
"just use" the commands without reading what they actually do.

Janis

paul.d...@gmail.com

unread,
Feb 10, 2015, 10:26:27 AM2/10/15
to
On Tuesday, February 10, 2015 at 7:29:35 AM UTC-5, Janis Papanagnou
wrote:
Of course, the more familiar one is with the quirks of a language or
computing environment, the less likely one is to be tripped up by
misleading command names. That goes for any language. But the
situation in which a command's name does not accurately convey its
function is simply a highly nonideal situation. It *requires* that
you don't trust the apparent meaning of commands in the language and
requires that you look it up every time (unless you use it every day)
or else you get bitten in the rear. Languages (or executable names)
don't have to be crafted like that.

Dan Espen

unread,
Feb 10, 2015, 10:34:45 AM2/10/15
to
The tiniest bit of reflection should tell the user that
"uniq" would _not_ deal with non-adjacent duplicates.

Adjacent duplicates can be dealt with trivially, non-adjacent
requires a much different approach.

I can't say why the OP might have thought otherwise,
but I don't think that's a common mis-conception.

--
Dan Espen

Keith Keller

unread,
Feb 10, 2015, 1:50:07 PM2/10/15
to
On 2015-02-10, paul.d...@gmail.com <paul.d...@gmail.com> wrote:
>
> Then one day, after hours of troubleshooting an analysis, I discovered that "uniq" doesn't really return the unique lines in a file. Unlike DISTINCT in SQL, unlike Excel's copy unique records, and unlike uniquification in Matlab.

Of course it doesn't. Its man page says so very explicitly. Sometimes
man pages can be a pain to deal with, especially when you don't know the
command you want, or have to navigate an enormous man page (I'm looking
at you, bash!), but in this case, you clearly know the command, the man
page is short and easy to read, so you really have no excuse for not
knowing this is how uniq works.

--keith

--
kkeller...@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information

Kaz Kylheku

unread,
Feb 10, 2015, 2:56:20 PM2/10/15
to
On 2015-02-10, paul.d...@gmail.com <paul.d...@gmail.com> wrote:
> Well, uniq is an obvious play on the word "unique". Removing
> consecutive duplicates lines is far, far different from uniquifying
> the line.

You have an entirely valid point.

I can't think of any language in which any cognate of the word "unique" refers
to squashing runs of identical items to a single instance, rather than
eliminating duplicates throughout the sequence.

Kenny McCormack

unread,
Feb 10, 2015, 3:10:18 PM2/10/15
to
In article <201502101...@kylheku.com>,
Thank you for that. Your other posts were making you out to be the typical
Unix snob - that is, the sort who say that "It's in the man page" or "It's
in the standards document(s)", as if that ends all discussions and answers
all questions.

The fact is that the name "uniq" *is* somewhat misleading - and, as you
say, doesn't correspond to the usage of that word in any known spoken
language. So, the question becomes: What would have been a better name for
it?

One that I've used in some instances and which bears some consideration is
"squeeze".

--
Atheism:
It's like being the only sober person in the car, and nobody will let you drive.

Kaz Kylheku

unread,
Feb 10, 2015, 3:11:50 PM2/10/15
to
On 2015-02-10, Keith Keller <kkeller...@wombat.san-francisco.ca.us> wrote:
> On 2015-02-10, paul.d...@gmail.com <paul.d...@gmail.com> wrote:
>>
>> Then one day, after hours of troubleshooting an analysis, I discovered that
>> "uniq" doesn't really return the unique lines in a file. Unlike DISTINCT in
>> SQL, unlike Excel's copy unique records, and unlike uniquification in
>> Matlab.
>
> Sometimes man pages can be a pain to deal with, especially when you don't
> know the command you want, or have to navigate an enormous man page (I'm
> looking at you, bash!), but in this case, you clearly know the command, the
> man page is short and easy to read, so you really have no excuse for not
> knowing this is how uniq works.

I like enormous man pages, because there is only one X such that "man X" gets
you everything. You just need a good pager with forward and backward regex
searching, like less.

In fact, I like enormous man pages so much that I wrote a man page which troffs
to a PDF that exceeds 260 pages in letter size with reasonable margins, and
that's with no table of contents or index.

Woe to the user who can't find what he or she is looking for in a man page and
has to use GNU Info. Or who has to first read a man page to figure out what
other man page he or she must read. (I'm looking at you, Perl!)

Kaz Kylheku

unread,
Feb 10, 2015, 3:34:15 PM2/10/15
to
On 2015-02-10, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> In article <201502101...@kylheku.com>,
> Kaz Kylheku <k...@kylheku.com> wrote:
>>On 2015-02-10, paul.d...@gmail.com <paul.d...@gmail.com> wrote:
>>> Well, uniq is an obvious play on the word "unique". Removing
>>> consecutive duplicates lines is far, far different from uniquifying
>>> the line.
>>
>>You have an entirely valid point.
>>
>>I can't think of any language in which any cognate of the word "unique" refers
>>to squashing runs of identical items to a single instance, rather than
>>eliminating duplicates throughout the sequence.
>
> Thank you for that. Your other posts were making you out to be the typical
> Unix snob - that is, the sort who say that "It's in the man page" or "It's
> in the standards document(s)", as if that ends all discussions and answers
> all questions.

I just found a counterexample: D language's uniq:

http://dlang.org/phobos/std_algorithm.html#.uniq

*facepalm*

Evidently, this is consciously based on Unix.

> The fact is that the name "uniq" *is* somewhat misleading - and, as you
> say, doesn't correspond to the usage of that word in any known spoken
> language. So, the question becomes: What would have been a better name for
> it?
>
> One that I've used in some instances and which bears some consideration is
> "squeeze".

As far as I can tell, repetitions of an item in a sequence are called "runs" by
English-speaking discrete mathematicians. The familiar term "run length
encoding" refers to a compression scheme replaces runs of items by a single
item and the run length for that item.

Something which eliminates runs can just be called "noruns".

sort | noruns

That's too long, so of course "nr" it has to be. Other possibilities
could be variations on "suppress", "squelch", "condense" and such.

:)

Oh, oh, I have a good one, haha:

sort | ecan # telephony term: "echo cancelation". :) :) :)

Harkens back to Bell Labs and everything, and also rides on an existing meaning
of echo in character processing:

ecan <<!
Hey
Hey
Hey
Can you hear me?
Can you hear me?
Can you hear me?
!
Hey
Can you hear me?

Barry Margolin

unread,
Feb 10, 2015, 3:43:43 PM2/10/15
to
In article <00ab03de-72fe-4ec1...@googlegroups.com>,
Compared to most Unix commands, "uniq" is a paragon of accuracy.
Consider one of the oldest, most common commands: "cat". Who would ever
guess that this is the name for printing the contents of a file on the
terminal? Practically every other OS has used more obvious names like
"print", "type", or "list".

Then there are the interactive versions, "more" and "less". "more" is
named for the fact that it displays a "--More--" prompt at the end of
each screenful, so it's reasonably mnemonic. But "less" is simply a play
on words. If someone doesn't know about "more" (and why would a
newbie?), the name "less" makes absolutely no sense.

Face it, the Unix CLI was not designed for the kind of people who need
everything spelled out for them. If you need that, we have GUIs and
menus now.

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

Barry Margolin

unread,
Feb 10, 2015, 3:48:56 PM2/10/15
to
In article <mbd8dd$89s$2...@dont-email.me>, Dan Espen <des...@verizon.net>
wrote:

> The tiniest bit of reflection should tell the user that
> "uniq" would _not_ deal with non-adjacent duplicates.

Why wouldn't it? Because it would have to save all the lines it has
seen, and generate the results after reading all the input? "sort" has
to do that, so why would it be obvious that "uniq" wouldn't do something
similar.

There are some programs that can generate output on the fly as the input
is coming, others that have to wait until they get everything. While the
category is often intuitively obvious ("tr" doesn't need to wait, "sort"
does), there are cases like "uniq" where the designer had to make an
arbitrary decision.

Kaz Kylheku

unread,
Feb 10, 2015, 3:53:58 PM2/10/15
to
On 2015-02-10, Barry Margolin <bar...@alum.mit.edu> wrote:
> Then there are the interactive versions, "more" and "less". "more" is
> named for the fact that it displays a "--More--" prompt at the end of
> each screenful, so it's reasonably mnemonic. But "less" is simply a play
> on words. If someone doesn't know about "more" (and why would a
> newbie?), the name "less" makes absolutely no sense.

less is not just a play on "more". In fact, the name recognizes the fact that
one purpose of pagers is to reduce output. By searching for what you want, you
end up displaying and reading much less material than the complete output of
the command.

Kaz Kylheku

unread,
Feb 10, 2015, 3:57:43 PM2/10/15
to
On 2015-02-10, Barry Margolin <bar...@alum.mit.edu> wrote:
> In article <mbd8dd$89s$2...@dont-email.me>, Dan Espen <des...@verizon.net>
> wrote:
>
>> The tiniest bit of reflection should tell the user that
>> "uniq" would _not_ deal with non-adjacent duplicates.
>
> Why wouldn't it? Because it would have to save all the lines it has
> seen, and generate the results after reading all the input? "sort" has
> to do that, so why would it be obvious that "uniq" wouldn't do something
> similar.

Not to mention thta "unique" means "occurs only once; has no equal".

A "unique individual" is not merely different from his two next door
neighbors.

Barry Margolin

unread,
Feb 10, 2015, 4:01:51 PM2/10/15
to
In article <201502101...@kylheku.com>,
Kaz Kylheku <k...@kylheku.com> wrote:

That sounds like a retcon to me. The Wikipedia page agrees with my
explanation:

Mark Nudelman initially wrote less during 1983-85, in the need of a
version of more able to do backward scrolling of the displayed text. The
name came from the joke of doing "backwards more."

Stephane Chazelas

unread,
Feb 10, 2015, 4:10:10 PM2/10/15
to
2015-02-10 12:28:37 +0100, Janis Papanagnou:
[...]
> uniq - report or omit repeated lines
>
> Repeated means _consecutive_ duplicates. Thus you need to sort the data if
> you want the duplicates removed. The sort(1) command has also a -u switch.
> And it's all documented in the man page:
>
> Note: 'uniq' does not detect repeated lines unless they are adjacent.
> You may want to sort the input first, or use `sort -u' without `uniq'.
[...]

That description is still misleading.

uniq doesn't remove repeated line, it removes all but the first
of a sequence of consecutive lines that sort the same.

$ printf '%b\n' '\u2460' '\u2461' '\u2462' | recode ..dump
UCS2 Mne Description

2460 1-o circled digit one
000A LF line feed (lf)
2461 2-o circled digit two
000A LF line feed (lf)
2462 3-o circled digit three
000A LF line feed (lf)

$ printf '%b\n' '\u2460' '\u2461' '\u2462' | uniq | recode ..dump
UCS2 Mne Description

2460 1-o circled digit one
000A LF line feed (lf)

In all UTF-8 locales on my (GNU) system, those characters (like
thousands of others) have no sorting order defined, so they just
sort the same.

If you want uniq to report unique lines, you need to fix the
locale to C (for sort as well)

LC_ALL=C sort | LC_ALL=C uniq

or:

LC_ALL=C sort -u

--
Stephane

Stephane Chazelas

unread,
Feb 10, 2015, 4:15:09 PM2/10/15
to
2015-02-10 15:43:39 -0500, Barry Margolin:
[...]
> Compared to most Unix commands, "uniq" is a paragon of accuracy.
> Consider one of the oldest, most common commands: "cat". Who would ever
> guess that this is the name for printing the contents of a file on the
> terminal?

cat is not any more the command for printing the contents of a
file than paste or cut -b1- or sed '' or dd or grep '^' or tail
-n+1...

It's the command to concatenate.

Like paste or dd, it can also be used to print the contents of a
file.

--
Stephane

frank.w...@gmail.com

unread,
Feb 10, 2015, 4:17:29 PM2/10/15
to
From KazKylheku:
>Something which eliminates runs can just be called "noruns".
>
> sort | noruns
>
>That's too long, so of course "nr" it has to be.

'nr' is "number" or "numerical recipes". Here are some
which are the same length as 'uniq':

frsl: filter redundant sequential lines
fdsl: filter duplicate sequential lines
rdsl: remove duplicate sequential lines
rrsl: remove redundant sequential lines

Or 'ersl', 'edsl', with "eliminate" as the first word. I
think if we try hard enough we can make a four letter
ackronym that is already a word.

How about 'uniq'="univerally negate identicals quickly"?

Frank

Kaz Kylheku

unread,
Feb 10, 2015, 4:48:23 PM2/10/15
to
On 2015-02-10, Stephane Chazelas <stephane...@gmail.com> wrote:
> 2015-02-10 15:43:39 -0500, Barry Margolin:
> [...]
>> Compared to most Unix commands, "uniq" is a paragon of accuracy.
>> Consider one of the oldest, most common commands: "cat". Who would ever
>> guess that this is the name for printing the contents of a file on the
>> terminal?
>
> cat is not any more the command for printing the contents of a
> file than paste or cut -b1- or sed '' or dd or grep '^' or tail
> -n+1...
>
> It's the command to concatenate.

It's the command to catenate, named by someone who perhaps knew
that "concatenate" contains a redundant "con-" prefix.

When pieces are are catenated, of course it is together that they are catenated
("con").

Things are never catenated apart. Or ... is that really so?

If we coin the word "decatenate" or "discatenate" (break the links in something
so that shorter pieces result), then we need "concatenate".

Usage:

"Jack reached into the fridge and decatenated a couple of links of sausage,
painstakingly arranging them for the most even heat exposure on the same
thickly varnished piece of aluminum-foil that has been lining the
rack of his toaster oven for the past seven years."

Sounds good to me!

Kaz Kylheku

unread,
Feb 10, 2015, 4:55:54 PM2/10/15
to
On 2015-02-10, frank.w...@gmail.com <frank.w...@gmail.com> wrote:
> From KazKylheku:
>>Something which eliminates runs can just be called "noruns".
>>
>> sort | noruns
>>
>>That's too long, so of course "nr" it has to be.
>
> 'nr' is "number" or "numerical recipes". Here are some

"Nr" isn't number in the English-speaking world; the contraction is "No.".

The connection to "numerical recipes" is just fishing. In that case "ls"
can't be used for listing files because it stands for "low sodium".
Not to mention, "lip service", "Levi Strauss" and, of course,
"lectori salutem".

"mv" is momentum ( mass x velocity ) to any geek, so out of the question.

> which are the same length as 'uniq':
>
> frsl: filter redundant sequential lines

Fellow of the Royal Society of Literature.

> fdsl: filter duplicate sequential lines

Fiber-optic digital subscriber line.

> rdsl: remove duplicate sequential lines

Remote DSL.

> rrsl: remove redundant sequential lines

Round-Robin Scheduling List

Hactar

unread,
Feb 10, 2015, 5:08:06 PM2/10/15
to
In article <mbctjc$uqk$1...@news.m-online.net>,
Case in point: "killall"

--
-eben QebWe...@vTerYizUonI.nOetP ebmanda.redirectme.net:81
LIBRA: A big promotion is just around the corner for someone
much more talented than you. Laughter is the very best medicine,
remember that when your appendix bursts next week. -- Weird Al

Dan Espen

unread,
Feb 10, 2015, 5:37:46 PM2/10/15
to
Barry Margolin <bar...@alum.mit.edu> writes:

> In article <mbd8dd$89s$2...@dont-email.me>, Dan Espen <des...@verizon.net>
> wrote:
>
>> The tiniest bit of reflection should tell the user that
>> "uniq" would _not_ deal with non-adjacent duplicates.
>
> Why wouldn't it? Because it would have to save all the lines it has
> seen, and generate the results after reading all the input? "sort" has
> to do that, so why would it be obvious that "uniq" wouldn't do something
> similar.

Because a uniq would have to do more than a sort.
After it sorts to find and remove the duplicates,
it then has to sort again to get back to the original order,
then write it's output.
To get back to the original order, you'd need to sequence number
each line in the original file, then remove the numbers at the end.

> There are some programs that can generate output on the fly as the input
> is coming, others that have to wait until they get everything. While the
> category is often intuitively obvious ("tr" doesn't need to wait, "sort"
> does), there are cases like "uniq" where the designer had to make an
> arbitrary decision.

Still seems to me like it's more than reasonable to expect a
"uniq" command to handle the much simpler
adjacent duplicates problem.

--
Dan Espen

Janis Papanagnou

unread,
Feb 10, 2015, 6:02:27 PM2/10/15
to
On 10.02.2015 19:44, Keith Keller wrote:
>
> Of course it doesn't. Its man page says so very explicitly. Sometimes
> man pages can be a pain to deal with, especially when you don't know the
> command you want,

I occasionally still use the (really old) apropos command for that
purpose. Not widely known, it seems, but quite helpful if you don't
know what command(s) to look for.

> or have to navigate an enormous man page (I'm looking at you, bash!),

Extremely long man pages repel me also, but they are still better than
the following man page entry (that you often find on Linux systems)...

"The full documentation for ... is maintained as a Texinfo manual."

Doh!

Janis

Janis Papanagnou

unread,
Feb 10, 2015, 6:27:33 PM2/10/15
to
On 10.02.2015 21:10, Kenny McCormack wrote:
>
> The fact is that the name "uniq" *is* somewhat misleading - and, as you
> say, doesn't correspond to the usage of that word in any known spoken
> language. So, the question becomes: What would have been a better name for
> it?

Ignoring the history of Unix commands (and their naming), I agree that a
selection of good names would be helpful. Has the OP wrongly assumed that
the names had been chosen sensibly, and has in consequence just not read
the manuals or tutorials? Anyway. Recklessly infering function from name
is bad enough - i.e. without testing or reading the documentation first -,
and in Unix you have all sorts of inhomogeneous and inconsistent names,
acronyms, or just a play with words, as command names.

Janis

Just typed "rm -rf *" and it didn't execute the 'Read Manual' program
with option 'Read Fast' about '*' (stars and stellar issues). - Have fun.
;-p

paul.d...@gmail.com

unread,
Feb 10, 2015, 8:59:15 PM2/10/15
to
On Tuesday, February 10, 2015 at 10:34:45 AM UTC-5, Dan Espen wrote:
The tinyiest bit of reflection would probably be based on the meaning of the word "unique", for which there is a specific meaning. I can't fathom why the respondent would think otherwise.

paul.d...@gmail.com

unread,
Feb 10, 2015, 9:01:33 PM2/10/15
to
On Tuesday, February 10, 2015 at 1:50:07 PM UTC-5, Keith Keller wrote:
> On 2015-02-10, paul.domaskis_at_gmail.com <paul.domaskis_at_gmail.com> wrote:
> >
> > Then one day, after hours of troubleshooting an analysis, I discovered that "uniq" doesn't really return the unique lines in a file. Unlike DISTINCT in SQL, unlike Excel's copy unique records, and unlike uniquification in Matlab.
>
> Of course it doesn't. Its man page says so very explicitly. Sometimes
> man pages can be a pain to deal with, especially when you don't know the
> command you want, or have to navigate an enormous man page (I'm looking
> at you, bash!), but in this case, you clearly know the command, the man
> page is short and easy to read, so you really have no excuse for not
> knowing this is how uniq works.

I didn't find this man page difficult. Just noting that command name is misleading, so you really can't trust the functionality that it suggests.

paul.d...@gmail.com

unread,
Feb 10, 2015, 9:05:18 PM2/10/15
to
On Tuesday, February 10, 2015 at 2:56:20 PM UTC-5, Kaz Kylheku wrote:
To be fair, C++ does that. However, it would be better if the name was more accurate.

paul.d...@gmail.com

unread,
Feb 10, 2015, 9:14:00 PM2/10/15
to
On Tuesday, February 10, 2015 at 3:43:43 PM UTC-5, Barry Margolin wrote:
> In article <00ab03de-72fe-4ec1...@googlegroups.com>,
Cat makes sense. It concatenates files.

This isn't an issue of having command behaviour spelled out to you -- that's exactly what man is for. (and gosh forbid, info, but let's not get into a natter about that). It's about choosing a good name for a command so that the language does a better job of communicating. "less" is a joke, and in that sense, is unforgettable. But "uniq"? Maybe there is a punch line behind it. I should try to dig more into the stories behind Unix.

Barry Margolin

unread,
Feb 10, 2015, 9:17:28 PM2/10/15
to
In article <20150210211...@chaz.gmail.com>,
Stephane Chazelas <stephane...@gmail.com> wrote:

> 2015-02-10 15:43:39 -0500, Barry Margolin:
> [...]
> > Compared to most Unix commands, "uniq" is a paragon of accuracy.
> > Consider one of the oldest, most common commands: "cat". Who would ever
> > guess that this is the name for printing the contents of a file on the
> > terminal?
>
> cat is not any more the command for printing the contents of a
> file than paste or cut -b1- or sed '' or dd or grep '^' or tail
> -n+1...
>
> It's the command to concatenate.

If someone asked you what command they should use to just print a file
on the terminal, what would you tell them to use?

Barry Margolin

unread,
Feb 10, 2015, 9:19:53 PM2/10/15
to
In article <mbe16j$rt2$1...@dont-email.me>, Dan Espen <des...@verizon.net>
wrote:

> Barry Margolin <bar...@alum.mit.edu> writes:
>
> > In article <mbd8dd$89s$2...@dont-email.me>, Dan Espen <des...@verizon.net>
> > wrote:
> >
> >> The tiniest bit of reflection should tell the user that
> >> "uniq" would _not_ deal with non-adjacent duplicates.
> >
> > Why wouldn't it? Because it would have to save all the lines it has
> > seen, and generate the results after reading all the input? "sort" has
> > to do that, so why would it be obvious that "uniq" wouldn't do something
> > similar.
>
> Because a uniq would have to do more than a sort.
> After it sorts to find and remove the duplicates,
> it then has to sort again to get back to the original order,
> then write it's output.
> To get back to the original order, you'd need to sequence number
> each line in the original file, then remove the numbers at the end.

Or it could just do what the typical awk solution does. Check if the
line is in a hash table. If not, add it to the hash and print it.

Which means I made a mistake in my earlier reply -- this doesn't require
waiting until all the input has been read.

paul.d...@gmail.com

unread,
Feb 10, 2015, 9:22:07 PM2/10/15
to
For sure. And it would be even more reasonable to expect that command to be named "undup[licate]" or something more reflective of the simpler function.

paul.d...@gmail.com

unread,
Feb 10, 2015, 9:26:45 PM2/10/15
to
I don't think the OP had any intent of suggesting that we could go back in time and change how the name was chosen, or that we should abandon the name and cause countless compatibility problems going forward. It seemed like the OP was simply pointing out a pot hole. However, the OP might be amused by the over defensiveness that has been shown toward this, and the equating of "it is written in the document" to "it is good to choose misleading names" (not by all, but by some).

Barry Margolin

unread,
Feb 10, 2015, 9:28:11 PM2/10/15
to
In article <bb8a02e0-3af7-484b...@googlegroups.com>,
paul.d...@gmail.com wrote:

> Cat makes sense. It concatenates files.

It only makes sense after the fact, IMHO. If you were thinking "Hmm,
what command would I use to print a file?", I don't think one of your
first 5 ideas would be "Oh, yeah, the command to concatenate files with
each other".

It's more of a weird corner case that happens to be the simplest way to
accomplish it: if you concatenate just a single file, you just get that
file, and if you don't redirect its output, it goes to the terminal.

Most other systems would provide a more mnemonic command that does it.
Unix's original philosophy included minimality: one tool for each job.
So if you can figure out a way to make an existing tool do what you
need, there's no need for a new command. Never mind that it's not
obvious which command you have to search for to do it.

paul.d...@gmail.com

unread,
Feb 10, 2015, 9:30:24 PM2/10/15
to
On Tuesday, February 10, 2015 at 9:17:28 PM UTC-5, Barry Margolin wrote:
> In article <20150210211...@chaz.gmail.com>,
> Stephane Chazelas <stephane...@gmail.com> wrote:
>
> > 2015-02-10 15:43:39 -0500, Barry Margolin:
> > [...]
> > > Compared to most Unix commands, "uniq" is a paragon of accuracy.
> > > Consider one of the oldest, most common commands: "cat". Who would ever
> > > guess that this is the name for printing the contents of a file on the
> > > terminal?
> >
> > cat is not any more the command for printing the contents of a
> > file than paste or cut -b1- or sed '' or dd or grep '^' or tail
> > -n+1...
> >
> > It's the command to concatenate.
>
> If someone asked you what command they should use to just print a file
> on the terminal, what would you tell them to use?

These days, it would be opened by an app (likely a PDF viewer or an M$ app) and a printer would be chosen from the app. Even text would be rendered to PDF or pasted into Word.

paul.d...@gmail.com

unread,
Feb 10, 2015, 9:33:21 PM2/10/15
to
On Tuesday, February 10, 2015 at 9:28:11 PM UTC-5, Barry Margolin wrote:
> In article <bb8a02e0-3af7-484b...@googlegroups.com>,
> paul.d...@gmail.com wrote:
>
> > Cat makes sense. It concatenates files.
>
> It only makes sense after the fact, IMHO. If you were thinking "Hmm,
> what command would I use to print a file?", I don't think one of your
> first 5 ideas would be "Oh, yeah, the command to concatenate files with
> each other".
>
> It's more of a weird corner case that happens to be the simplest way to
> accomplish it: if you concatenate just a single file, you just get that
> file, and if you don't redirect its output, it goes to the terminal.
>
> Most other systems would provide a more mnemonic command that does it.
> Unix's original philosophy included minimality: one tool for each job.
> So if you can figure out a way to make an existing tool do what you
> need, there's no need for a new command. Never mind that it's not
> obvious which command you have to search for to do it.

Frankly, I would "less" it or vim it. But those commands have a good reason and story behind it. "uniq" I think is just an inaccuracy. The moral was that you gotta watch out. Be paranoid when invoking unix commands. Be very, very paranoid.

Actually, that's dramatizing it, because as far as I'm concerned, "uniq" is an exception. Other commands can be cryptic, but not misleading.

Dan Espen

unread,
Feb 10, 2015, 9:37:39 PM2/10/15
to
Barry Margolin <bar...@alum.mit.edu> writes:

> In article <mbe16j$rt2$1...@dont-email.me>, Dan Espen <des...@verizon.net>
> wrote:
>
>> Barry Margolin <bar...@alum.mit.edu> writes:
>>
>> > In article <mbd8dd$89s$2...@dont-email.me>, Dan Espen <des...@verizon.net>
>> > wrote:
>> >
>> >> The tiniest bit of reflection should tell the user that
>> >> "uniq" would _not_ deal with non-adjacent duplicates.
>> >
>> > Why wouldn't it? Because it would have to save all the lines it has
>> > seen, and generate the results after reading all the input? "sort" has
>> > to do that, so why would it be obvious that "uniq" wouldn't do something
>> > similar.
>>
>> Because a uniq would have to do more than a sort.
>> After it sorts to find and remove the duplicates,
>> it then has to sort again to get back to the original order,
>> then write it's output.
>> To get back to the original order, you'd need to sequence number
>> each line in the original file, then remove the numbers at the end.
>
> Or it could just do what the typical awk solution does. Check if the
> line is in a hash table. If not, add it to the hash and print it.
>
> Which means I made a mistake in my earlier reply -- this doesn't require
> waiting until all the input has been read.

No problem, clearly we're trying to design this new "uniq" command
on the fly.

A hash table requires a synonym chain
and retention of all the original keys. Also some notion of
the original order.

So, I think a hash table is nothing more than a type of sort.

--
Dan Espen

Kaz Kylheku

unread,
Feb 10, 2015, 9:39:16 PM2/10/15
to
On 2015-02-11, Barry Margolin <bar...@alum.mit.edu> wrote:
> In article <20150210211...@chaz.gmail.com>,
> Stephane Chazelas <stephane...@gmail.com> wrote:
>
>> 2015-02-10 15:43:39 -0500, Barry Margolin:
>> [...]
>> > Compared to most Unix commands, "uniq" is a paragon of accuracy.
>> > Consider one of the oldest, most common commands: "cat". Who would ever
>> > guess that this is the name for printing the contents of a file on the
>> > terminal?
>>
>> cat is not any more the command for printing the contents of a
>> file than paste or cut -b1- or sed '' or dd or grep '^' or tail
>> -n+1...
>>
>> It's the command to concatenate.
>
> If someone asked you what command they should use to just print a file
> on the terminal, what would you tell them to use?

I'd say:

* use "less" for generally viewing arbitrary files; it's safe if
the file is huge or contains arbitrary binary codes.

* if you know the file contains text, but may be large then
use the head and tail commands to dump a sample from the top
or bottom.

* if you know the file is reasonable small and contains nothing but text, then
dump the whole thing with "cat file" (whose purpose is to catenate
together files: "cat file1 file2 file3 ... > outputfile", hence the name).
Do not cat files of unknown size or content.

Dumping a file to the terminal is less useful than a n00b might think. The data
will go by far faster than you can read, and you need that many lines of
scrollback history to go back and view it. If your display is 25 lines tall,
and has no scrollback, then the net effect of "tail -24 file" is
indistinguishable from "cat file", yet much faster.

Dan Espen

unread,
Feb 10, 2015, 9:41:42 PM2/10/15
to
undup? I sort of like dedup.

As far as "reasonable".
Why is it reasonable?
Bear in mind that you don't hear a request for a whole file deduplicator
every day.

I've heard of sort -u.
Apparently that's good for almost all usages.

--
Dan Espen

Kaz Kylheku

unread,
Feb 10, 2015, 9:46:42 PM2/10/15
to
On 2015-02-11, Barry Margolin <bar...@alum.mit.edu> wrote:
> In article <bb8a02e0-3af7-484b...@googlegroups.com>,
> paul.d...@gmail.com wrote:
>
>> Cat makes sense. It concatenates files.
>
> It only makes sense after the fact, IMHO. If you were thinking "Hmm,
> what command would I use to print a file?", I don't think one of your
> first 5 ideas would be "Oh, yeah, the command to concatenate files with
> each other".
>
> It's more of a weird corner case that happens to be the simplest way to
> accomplish it: if you concatenate just a single file, you just get that
> file, and if you don't redirect its output, it goes to the terminal.
>
> Most other systems would provide a more mnemonic command that does it.
> Unix's original philosophy included minimality: one tool for each job.

On original Unix, you were likely to be working with some crappy terminal
devices.

Catting a file to your tty is not something you would usually want to do if the
tty is a hard-copy machine printing everything on paper, or is running at 2400
baud, or is a display terminal with no scrollback history.

Actually the original Unix way to print a file *carefully* would be this:

$ ed file
1
Here is the first line
[Enter]
here is the second line
[Enter]
here is the third line.
q

Nicely see the file line by line, and q when you've had enough.

I'm not sure how far back the "more" command goes; but it's basically
just an optimized version of this which works with pipes.

Stephen Fisher

unread,
Feb 10, 2015, 11:00:20 PM2/10/15
to
On 2015-02-10, Kaz Kylheku <k...@kylheku.com> wrote:

> Or who has to first read a man page to figure out what other man page
> he or she must read. (I'm looking at you, Perl!)

curses is evil like that too.

Stephen Fisher

unread,
Feb 10, 2015, 11:10:23 PM2/10/15
to
On 2015-02-10, Kenny McCormack <gaz...@shell.xmission.com> wrote:

> The fact is that the name "uniq" *is* somewhat misleading - and, as
> you say, doesn't correspond to the usage of that word in any known
> spoken language. So, the question becomes: What would have been a
> better name for it?

This discussion got me to wondering where the uniq command came from.
BSD shows it back to 1980:

http://svnweb.freebsd.org/csrg/usr.bin/uniq/uniq.c?revision=1145&view=markup

With the next update not until 1989 with a replacement that better
handles various command line options. Neither mentions a history being
the command before that but it was already established by then.

Stephen Fisher

unread,
Feb 10, 2015, 11:20:20 PM2/10/15
to
On 2015-02-10, Kenny McCormack <gaz...@shell.xmission.com> wrote:

> The fact is that the name "uniq" *is* somewhat misleading - and, as
> you say, doesn't correspond to the usage of that word in any known
> spoken language. So, the question becomes: What would have been a
> better name for it?

This discussion got me to wondering where the uniq command came from.
BSD shows it back to 1980:

http://svnweb.freebsd.org/csrg/usr.bin/uniq/uniq.c?revision=1145&view=markup

With the next update not until 1989 with a replacement that better
handles various command line options. Neither mentions a history of the
command before that, but it was clearly well known by then.

Aragorn

unread,
Feb 11, 2015, 7:37:31 AM2/11/15
to
On Wednesday 11 February 2015 03:17, Barry Margolin conveyed the
following to comp.unix.shell...

> In article <20150210211...@chaz.gmail.com>,
> Stephane Chazelas <stephane...@gmail.com> wrote:
>
>> 2015-02-10 15:43:39 -0500, Barry Margolin:
>> [...]
>> > Compared to most Unix commands, "uniq" is a paragon of accuracy.
>> > Consider one of the oldest, most common commands: "cat". Who would
>> > ever guess that this is the name for printing the contents of a
>> > file on the terminal?
>>
>> cat is not any more the command for printing the contents of a
>> file than paste or cut -b1- or sed '' or dd or grep '^' or tail
>> -n+1...
>>
>> It's the command to concatenate.
>
> If someone asked you what command they should use to just print a file
> on the terminal, what would you tell them to use?

In UNIX, a terminal is a file as well. ;-)

That said, I generally use /usr/bin/less [*], but if the file is too
long, then perhaps I will combine that with /usr/bin/head or
/usr/bin/tail, or both. ;-)


[*] I do also on occasion use the buzzing lapwarmer, if it's a _very_
short file ─ e.g. /etc/inittab or any of the /etc/pam.d/* files.
For shell scripts, I generally use /usr/bin/less.

--
= Aragorn =

http://www.linuxcounter.net - registrant #223157

Ben Bacarisse

unread,
Feb 11, 2015, 8:03:15 AM2/11/15
to
Yes, it's mentioned in passing in the 1978 collection of papers about
Unix in the (now) famous special edition of the Bell System Technical
Journal (July-August 1978, vol. 57, no. 6, part 2). It predates v7 but
by how much I am not sure.

--
Ben.

Barry Margolin

unread,
Feb 11, 2015, 10:59:58 AM2/11/15
to
In article <mbef8b$fva$1...@dont-email.me>, Dan Espen <des...@verizon.net>
Why does it need the original order? Since it prints the line when it's
read, it will automatically be in order. The hash table is just used to
prevent printing duplicates, not to save everything that was read.
That's how the awk solution works, isn't it?

Barry Margolin

unread,
Feb 11, 2015, 11:01:57 AM2/11/15
to
In article <201502101...@kylheku.com>,
Kaz Kylheku <k...@kylheku.com> wrote:

> On 2015-02-11, Barry Margolin <bar...@alum.mit.edu> wrote:
> > In article <20150210211...@chaz.gmail.com>,
> > Stephane Chazelas <stephane...@gmail.com> wrote:
> >
> >> 2015-02-10 15:43:39 -0500, Barry Margolin:
> >> [...]
> >> > Compared to most Unix commands, "uniq" is a paragon of accuracy.
> >> > Consider one of the oldest, most common commands: "cat". Who would ever
> >> > guess that this is the name for printing the contents of a file on the
> >> > terminal?
> >>
> >> cat is not any more the command for printing the contents of a
> >> file than paste or cut -b1- or sed '' or dd or grep '^' or tail
> >> -n+1...
> >>
> >> It's the command to concatenate.
> >
> > If someone asked you what command they should use to just print a file
> > on the terminal, what would you tell them to use?
>
> I'd say:
>
> * use "less" for generally viewing arbitrary files; it's safe if
> the file is huge or contains arbitrary binary codes.

That doesn't "just print the file", it prints it and displays prompts,
etc.

I didn't think I needed to spell out that I was talking about the simple
command that does nothing but print a file, since I'm talking about
"cat".

Kenny McCormack

unread,
Feb 11, 2015, 11:03:48 AM2/11/15
to
In article <barmar-AF9FEC....@88-209-239-213.giganet.hu>,
Barry Margolin <bar...@alum.mit.edu> wrote:
...
>Why does it need the original order? Since it prints the line when it's
>read, it will automatically be in order. The hash table is just used to
>prevent printing duplicates, not to save everything that was read.
>That's how the awk solution works, isn't it?

The canonical AWK solution:

!x[$0]++

Does end up storing in memory every unique line of input.

So, for example, if you have 1000 lines of input, consisting of 800 unique
lines, you will be storing 800 lines of input in your string storage space.

--
Both the leader of the Mormon Church and the leader of the Catholic
church claim infallibility. Is it any surprise that these two orgs
revile each other? Anybody with any sense knows that 80-yr old codgers
are hardly infallible. Some codgers this age do well to find the crapper
in time and remember to zip-up.

Barry Margolin

unread,
Feb 11, 2015, 11:03:55 AM2/11/15
to
In article <201502101...@kylheku.com>,
Kaz Kylheku <k...@kylheku.com> wrote:

> On 2015-02-11, Barry Margolin <bar...@alum.mit.edu> wrote:
> > In article <bb8a02e0-3af7-484b...@googlegroups.com>,
> > paul.d...@gmail.com wrote:
> >
> >> Cat makes sense. It concatenates files.
> >
> > It only makes sense after the fact, IMHO. If you were thinking "Hmm,
> > what command would I use to print a file?", I don't think one of your
> > first 5 ideas would be "Oh, yeah, the command to concatenate files with
> > each other".
> >
> > It's more of a weird corner case that happens to be the simplest way to
> > accomplish it: if you concatenate just a single file, you just get that
> > file, and if you don't redirect its output, it goes to the terminal.
> >
> > Most other systems would provide a more mnemonic command that does it.
> > Unix's original philosophy included minimality: one tool for each job.
>
> On original Unix, you were likely to be working with some crappy terminal
> devices.
>
> Catting a file to your tty is not something you would usually want to do if
> the
> tty is a hard-copy machine printing everything on paper, or is running at
> 2400
> baud, or is a display terminal with no scrollback history.

Whether it's something you actually want to do is a separate issue.
Assuming you DO want to do it (e.g. you wanted to make a hardcopy
listing to put in your file cabinet), "cat" is what you would have used

Kaz Kylheku

unread,
Feb 11, 2015, 11:19:16 AM2/11/15
to
On 2015-02-11, Aragorn <thor...@telenet.be.invalid> wrote:
> That said, I generally use /usr/bin/less [*], but if the file is too
> long, then perhaps I will combine that with /usr/bin/head or
> /usr/bin/tail, or both. ;-)

Why would you? After less-ing the file, you can jump to the end by
typing G (same as in vi). There is your tail.

Dan Espen

unread,
Feb 11, 2015, 11:21:46 AM2/11/15
to
You're right on the order.

> That's how the awk solution works, isn't it?

I have no idea.
I just know that hashes on their own can't identify a duplicate,
just a potential duplicate.
Better hashes generate less synonyms, but there is always the
possibility so the hash must retain the original key (the full line) and
normally builds a synonym chain.

Better hashes start with a reasonably sized hash table and can
expand that table as the number of entries climbs.

--
Dan Espen

Barry Margolin

unread,
Feb 11, 2015, 11:35:56 AM2/11/15
to
In article <mbfuh1$o3n$3...@news.xmission.com>,
gaz...@shell.xmission.com (Kenny McCormack) wrote:

> In article <barmar-AF9FEC....@88-209-239-213.giganet.hu>,
> Barry Margolin <bar...@alum.mit.edu> wrote:
> ...
> >Why does it need the original order? Since it prints the line when it's
> >read, it will automatically be in order. The hash table is just used to
> >prevent printing duplicates, not to save everything that was read.
> >That's how the awk solution works, isn't it?
>
> The canonical AWK solution:
>
> !x[$0]++
>
> Does end up storing in memory every unique line of input.
>
> So, for example, if you have 1000 lines of input, consisting of 800 unique
> lines, you will be storing 800 lines of input in your string storage space.

But it doesn't have to store the order, or all the repetitions. I was
responding to this:

A hash table requires a synonym chain
and retention of all the original keys. Also some notion of
the original order.

Barry Margolin

unread,
Feb 11, 2015, 11:40:11 AM2/11/15
to
In article <mbfvhi$rlo$1...@dont-email.me>, Dan Espen <des...@verizon.net>
I said a "hash table", not just a hash. But I really was just using that
as a representative of any efficiently searched list of previous
strings. Hash tables are the most common way to implement this these
days (e.g. it's how PHP associative arrays, Perl hashes, Python
dictionaries, and Javascript objects are implemented).

Aragorn

unread,
Feb 11, 2015, 2:42:21 PM2/11/15
to
On Wednesday 11 February 2015 17:19, Kaz Kylheku conveyed the following
to comp.unix.shell...
It may be handy to filter down the output by a number of lines.
/var/log/messages springs to mind.

Kenny McCormack

unread,
Feb 11, 2015, 9:48:35 PM2/11/15
to
In article <barmar-4DD93C....@88-209-239-213.giganet.hu>,
Barry Margolin <bar...@alum.mit.edu> wrote:
...
(I wrote)
>> So, for example, if you have 1000 lines of input, consisting of 800 unique
>> lines, you will be storing 800 lines of input in your string storage space.
>
>But it doesn't have to store the order, or all the repetitions. I was
>responding to this:
>
(Someone else wrote)
>>A hash table requires a synonym chain
>>and retention of all the original keys. Also some notion of
>>the original order.

You are 100% correct.

--
b w r w g y b r y b

Piotr Meyer

unread,
May 2, 2015, 5:17:47 AM5/2/15
to
Dnia 11.02.2015 Barry Margolin <bar...@alum.mit.edu> napisał/a:
[...]

>> > If someone asked you what command they should use to just print a file
>> > on the terminal, what would you tell them to use?
>>
>> I'd say:
>>
>> * use "less" for generally viewing arbitrary files; it's safe if
>> the file is huge or contains arbitrary binary codes.
>
> That doesn't "just print the file", it prints it and displays prompts,
> etc.
>
> I didn't think I needed to spell out that I was talking about the simple
> command that does nothing but print a file, since I'm talking about
> "cat".

I'd like to suggest avoiding 'cat' for displaying anything (for example
short scripts) on terminal because of their behaviour - processing escape
codes. For example: http://unix.stackexchange.com/a/108269

So, from my POV, for "printing a file", there are better solutions:
- vis(1) - http://netbsd.gw.com/cgi-bin/man-cgi?vis+1 - but I don't see
this command in my Ubuntu
- less(1) - but not 'more' - although on some systems 'more' is provided
by less itself and is secure, but it varies
- alias show='cat -v' and typing 'show unknownscript.sh' instead

--
Piotr 'aniou' Meyer

Stephane Chazelas

unread,
May 4, 2015, 9:05:12 AM5/4/15
to
2015-05-02 11:14:09 +0200, Piotr Meyer:
[...]
> So, from my POV, for "printing a file", there are better solutions:
> - vis(1) - http://netbsd.gw.com/cgi-bin/man-cgi?vis+1 - but I don't see
> this command in my Ubuntu
> - less(1) - but not 'more' - although on some systems 'more' is provided
> by less itself and is secure, but it varies

Beware that some systems provide with an automatically enabled
lesspipe. On those systems, less should not be used on untrusted
files as it allows arbitrary code execution.

Also note that there are plenty of characters in Unicode that
can fool you (the space character to start with, but all the
spacings and zero-width ones and the many characters with
similar shapes). See also http://unix.stackexchange.com/a/110756

less will also do post-processing that can mangle the input
(like the \b processing).

See

printf 'rm -rf \b\b\b\b\b\b\becho ~\n' | less

for instance.

> - alias show='cat -v' and typing 'show unknownscript.sh' instead
[...]

cat -v is not portable and its output is ambiguous (and doesn't
show trailing spaces or disambiguate space vs tab). cat -vte
(also cat -A with some) is slightly better.

sed -n l

would be better. In anycase, that's fine for English files, not
so much for other languages that can't be expressed with ASCII
alone.

--
Stephane
0 new messages