Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

can unix sort routine sort w/ non ASCII collating sequence?

482 views
Skip to first unread message

br...@ausfac7.austin.ibm.com

unread,
Jul 2, 1993, 4:35:55 PM7/2/93
to
I need to sort using the unix sort, but need to use EBCDIC
collating sequence. Is this possible??? It's not mentioned in
the man pages for sort, so I'm kind of doubtful...

Thanks!

Rudolf Newskij

unread,
Jul 5, 1993, 5:18:46 AM7/5/93
to
br...@ausfac7.austin.ibm.com () writes:

Since sort uses the collating sequence defined in $LC_COLLATE,
it should be possible. There is an article in info named
"How to Create a New Collation Order". If you don't have yet a
file with EBCDIC collation sequence, go make one.

>Thanks!
You're welcome.

--
// You won't find a big fool with a small .signature

John Russell

unread,
Jul 3, 1993, 4:48:07 PM7/3/93
to
The dd command can convert to EBCDIC, so perhaps running the input
through dd, sorting, then using dd to convert back to ASCII would work.

John

Roy Johnson

unread,
Jul 7, 1993, 1:14:10 PM7/7/93
to
br...@ausfac7.austin.ibm.com () writes:

You might try converting to and from EBCDIC using dd(1):

dd conv=ascii if=file.ibm | sort | dd conv=ebcdic > newfile.ibm

Or something like that.
--
-------- Roy Johnson ---- rjoh...@shell.com ---- Speaking for myself --------
"When the only tool you have is Perl, the whole | "Hooray for snakes!"
world begins to look like your oyster." -- Me | -- The Simpsons (29 Apr 93)

br...@ausfac7.austin.ibm.com

unread,
Jul 8, 1993, 12:29:35 PM7/8/93
to

This would give the same results as just doing a straight unix sort
because the sort command is still using the ascii collating sequence
on the ebcdic translated chars. The only solution that I've seen that
would sort collating on the ebcdic collating sequence is to change the
collating sequence pointed to by the $LC_LOCATE environment variable.
There is an article describing this in info called
How To Create a New Collation Order.

I am still looking for a *cleaner* way, but don't think there is one.

-Bryan

Jack Seltzer

unread,
Jul 8, 1993, 6:24:58 PM7/8/93
to

I could not get this to work. It seems that the problem is that
sort expects to see the sorted list, one item per record and dd
converts a set of items (1/record) to a 1 record string. This output
file has the right data as verified by running it through od, but
sort looks at it as one record and consequently the sort is not done.
--
Jack Seltzer ja...@cheetah.acs.udel.edu
ja...@chopin.udel.edu

Nigel Keam

unread,
Jul 9, 1993, 4:14:28 PM7/9/93
to

This may contain answers to your questions:

IBM ITSC has a publication with the following title:

"Keys to Sort and Search for Culturally Expected Results"
[ I kid you not ]... the Order number is: GG24-3516-00

From memory it deals primarily with foreign languages, and whether Mc and
Mac should come before Ma... but it may well be applicable to non-ASCII
character sets also

Nigel Keam - Pencom Software

Mark Brown

unread,
Jul 12, 1993, 10:57:56 AM7/12/93
to

I'm sorry, but I know for a fact that we don't handle EBCDIC, which was
the original question....here's why:

We support several 'locales' (French, German, Japanese, etc...) but all
of these locales are in codesets that contain ascii as a subset. No one
to my knowledge has created an EBCDIC (or other non-ascii-subset)
codeset for AIX...nor for any other vendor that I know of. Hence, no way
to tell sort that you want the EBCDIC (which EBCDIC? there are over 80
dialects!) sorting/collating sequence.

FYI, there *is* a new codeset invented by ISO called "UNICODE" that does
not subset ascii, even though it does contain all of the ascii
charset...but this this is for a grand future plan to absorb all
codesets into this "one true codeset that contains everyone's
characters"....

cheers,
mark


--
Mark Brown IBM AWS Austin, TX.| Fear the Government
(512) 838-3926 VNET: MBROWN@AUSVM6| that fears your privacy.
MAIL: mbr...@austin.ibm.com| Keep personal cryptography legal.
DISCLAIMER: My views are independent of IBM official policy.

Rudolf Newskij

unread,
Jul 13, 1993, 4:34:44 AM7/13/93
to
mbr...@testsys.austin.ibm.com (Mark Brown) writes:

>| nigel@tarawera (Nigel Keam) writes:
>|This may contain answers to your questions:
>|
>|IBM ITSC has a publication with the following title:
>|
>|"Keys to Sort and Search for Culturally Expected Results"
>| [ I kid you not ]... the Order number is: GG24-3516-00
>|
>|From memory it deals primarily with foreign languages, and whether Mc and
>|Mac should come before Ma... but it may well be applicable to non-ASCII
>|character sets also
>|
>|Nigel Keam - Pencom Software

>I'm sorry, but I know for a fact that we don't handle EBCDIC, which was
>the original question....here's why:

>We support several 'locales' (French, German, Japanese, etc...) but all
>of these locales are in codesets that contain ascii as a subset. No one
>to my knowledge has created an EBCDIC (or other non-ascii-subset)
>codeset for AIX...nor for any other vendor that I know of. Hence, no way
>to tell sort that you want the EBCDIC (which EBCDIC? there are over 80
>dialects!) sorting/collating sequence.

So you are going to say you cannot set up a new collating
sequence as it's described in infoexplorer ("How to Create
a New Collation Order", "Locale Definition Source File
Format", "Character Set Description (charmap) Source File
Format" and other articles) ?
So, why is all that stuff there along with the localedef
command? What about setting LC_COLLATE to tell sort which
collating sequence you want?

The only difficulty in sorting EBCDIC data is that these
data are usually in record format. You can fix that using
an appropriate dd-command. Also you have to specify the
field delimitter which would be a @ if the fields are
separated by EBCDIC spaces.
Even without a specialized EBCDIC locale, the following
would do the job, if only text sorting is desired:
dd if=dinosaur cbs=$LRECL conv=ublock | sort -A -t@ +1
This would sort on the byte values of the 2nd field in
every record. Since in EBCDIC the order of characters is
<most punctuations> < 'a' < 'z' < 'A' < 'Z' < '0' < '9'
this would yield reasonable results.

The only thing *I* wonder is why someone should want to do
such things on UNIX. Why not convert the data? Why can't
the mainframe sort the data?

Mark Brown

unread,
Jul 15, 1993, 11:08:30 AM7/15/93
to
| ne...@r7013er1.return-online.de (Rudolf Newskij) writes:

|>I (mbr...@testsys.austin.ibm.com (Mark Brown)) wrote:
|>I'm sorry, but I know for a fact that we don't handle EBCDIC, which was
|>the original question....here's why:
|
|>We support several 'locales' (French, German, Japanese, etc...) but all
|>of these locales are in codesets that contain ascii as a subset. No one
|>to my knowledge has created an EBCDIC (or other non-ascii-subset)
|>codeset for AIX...nor for any other vendor that I know of. Hence, no way
|>to tell sort that you want the EBCDIC (which EBCDIC? there are over 80
|>dialects!) sorting/collating sequence.
|
|So you are going to say you cannot set up a new collating
|sequence as it's described in infoexplorer ("How to Create
|a New Collation Order", "Locale Definition Source File
|Format", "Character Set Description (charmap) Source File
|Format" and other articles) ?

No.

I said "no one to my knowledge has created", as quoted above.

Create one according to the pubs, and then you can sort by EBCDIC.


|So, why is all that stuff there along with the localedef
|command? What about setting LC_COLLATE to tell sort which
|collating sequence you want?

Create an EBCDIC locale, and you can do all that stuff.

|The only difficulty in sorting EBCDIC data is that these
|data are usually in record format. You can fix that using
|an appropriate dd-command. Also you have to specify the
|field delimitter which would be a @ if the fields are
|separated by EBCDIC spaces.

You also have to know *which* EBCDIC dialect you are handling. dd knows
of two of the more common ones, I wasn't joking when I said they were
many.

|Even without a specialized EBCDIC locale, the following
|would do the job, if only text sorting is desired:
| dd if=dinosaur cbs=$LRECL conv=ublock | sort -A -t@ +1
|This would sort on the byte values of the 2nd field in
|every record. Since in EBCDIC the order of characters is
| <most punctuations> < 'a' < 'z' < 'A' < 'Z' < '0' < '9'
|this would yield reasonable results.

What leads you to think that ASCII is a contiguous subset inside the
EBCDIC codeset? 'Cause it isn't.

What this gives you is EBCDIC data sorted in an ASCII order. Not
necessarily what the doctor ordered. If you want EBCDIC data sorted in
EBCDIC collation sequence, you have to build a locale.

--
Mark Brown IBM AWS Austin, TX.| *[WANTED]* $10,000 Reward.
(512) 838-3926 VNET: MBROWN@AUSVM6| Schrodinger's Cat.
MAIL: mbr...@austin.ibm.com| DEAD OR ALIVE

Rudolf Newskij

unread,
Jul 16, 1993, 8:52:07 AM7/16/93
to
mbr...@testsys.austin.ibm.com (Mark Brown) writes:

>No.
>I said "no one to my knowledge has created", as quoted above.
>Create one according to the pubs, and then you can sort by EBCDIC.

Okay, so this was a misunderstanding on my side. Since
it seems fairly easy to make a new locale, it made no
sense to me when you said "... you can't ..."


>|Even without a specialized EBCDIC locale, the following
>|would do the job, if only text sorting is desired:
>| dd if=dinosaur cbs=$LRECL conv=ublock | sort -A -t@ +1
>|This would sort on the byte values of the 2nd field in
>|every record. Since in EBCDIC the order of characters is
>| <most punctuations> < 'a' < 'z' < 'A' < 'Z' < '0' < '9'
>|this would yield reasonable results.

>What leads you to think that ASCII is a contiguous subset inside the
>EBCDIC codeset? 'Cause it isn't.

I don't think that. But my manual says, that -A sorts on a
byte-by-byte basis, (i.e. the bytes are interpreted as
small numbers).
Also note, that dd is only used to un-block the data,
not to convert it.


>What this gives you is EBCDIC data sorted in an ASCII order. Not
>necessarily what the doctor ordered. If you want EBCDIC data sorted in
>EBCDIC collation sequence, you have to build a locale.

I said:
>|would do the job, if only text sorting is desired:

^^^^^


and I said:
>|this would yield reasonable results.

^^^^^^^^^^
Of course, it does *not* give *perfect* results. And it
would completely fail on numeric fields...

I don't know which EBCDIC charsets you are referring to,
but I guess the character order there is still something
like:


>| <most punctuations> < 'a' < 'z' < 'A' < 'Z' < '0' < '9'

BTW, I suggested to build such a locale 2 weeks before,
maybe my posting was lost somewhere.

The very best way to sort EBCDIC data still is
to punch the following on cards

//XYZGFOO JOB REGION=512K
//S EXEC SORT
//SORTIN DD DSN=UGLY.MAINFRAM.DATA,DISP=SHR
//SORTOUT DD DSN=UGLY.MAINFRAM.SORTED.DATA,DISP=(,KEEP),
UNIT=DASD
//SYSIN DD *
(your sort statements go here)
//
and then feed it to the RDR of some MVS-iron. :-)
--
#!/bin/ksh
set $(type csh); (echo '#!/bin/ksh';echo echo Not supported!) >$3

Michael Phillips

unread,
Jul 16, 1993, 11:48:04 AM7/16/93
to
ne...@r7013er1.return-online.de (Rudolf Newskij) writes:
:
: The very best way to sort EBCDIC data still is

: to punch the following on cards
:
: //XYZGFOO JOB REGION=512K
: //S EXEC SORT
: //SORTIN DD DSN=UGLY.MAINFRAM.DATA,DISP=SHR
: //SORTOUT DD DSN=UGLY.MAINFRAM.SORTED.DATA,DISP=(,KEEP),
: UNIT=DASD
: //SYSIN DD *
: (your sort statements go here)
: //
: and then feed it to the RDR of some MVS-iron. :-)

Gee, whats wrong VM/CMS, or VSE ????
Even DOS rel.26 does a fine job sorting
EBCDIC data ;^) ;^) ;^) ;^)
--
Michael G. Phillips
mg...@dbsoftware.com
(404)239-2766 "Just because it worked doesn't mean it works." -- me

0 new messages