sorting unicode file under windows command line

happytoday

unread,

Feb 14, 2012, 5:34:02 AM2/14/12

to

I am trying to sort a file according to unicode field
(position,length) under Berkeley unix version (windows version). I
tried msort3.exe utility but can not find msort3.exe working with me.
Is there a command line utitlity or perl/sedawk program that sorts a
file according to unicode column UTF-8 with
start_position,length_position.

I can not install cygin.

My line are like this :

1 ÷ن·ع¤ ¹ن´ ل¨كم ·ـ±¤ ·نك،ق¤ ’’–
•—–گڈگ 1 1 1 1 ’’–•—–
گڈگ ¹ك،صع
¤A master_file_HOLD.log ك،صع¤ âڈغم¤ ¹ؤق èقن·ـ غـ®ع¤ ÷ہ±
·ـ±ـ أ
گ‘
2 ÷¤مئ¹ لنبت هقہ±
–”—ک—–گڈگ 1 1 1 2 –”—ک—–
گڈگ ¹ك،صع
¤A master_file_HOLD.log ك،صع¤ âڈ¹ؤـ ¤¹¨آ أ ”ڈ

Eric Pement

unread,

Feb 14, 2012, 10:34:40 AM2/14/12

to

On Feb 14, 4:34 am, happytoday <ehabaziz2...@gmail.com> wrote:
> I am trying to sort a file according to unicode field
> (position,length) under Berkeley unix version (windows version). I
> tried msort3.exe utility but can not find msort3.exe working with me.
> Is there a command line utitlity or perl/sedawk program that sorts a
> file according to unicode column UTF-8 with start_position,length_position.

You should try GNU sort, which does run under Windows. Note the need
to set some environment variables. From the info pages:

(1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
`en_US'), then `sort' may produce output that is sorted differently
than you're accustomed to. In that case, set the `LC_ALL' environment
variable to `C'. Note that setting only `LC_COLLATE' has two
problems.
First, it is ineffective if `LC_ALL' is also set. Second, it has
undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset)
is
set to an incompatible value. For example, you get undefined behavior
if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'.

A comment on stackoverflow.com says this:

... keep in mind that GNU sort depends on a correct locale setting
(the LC_* environment variables, and specifically the LC_COLLATE one).
LC_COLLATE (or LC_ALL) should be set to a locale with UTF-8 support
(e.g. en_US.UTF-8 or el_GR.UTF-8), preferably in the language that you
are interested in.

To sort on start position, end position, do this;

sort -t x -k 1.M,1.N

where 'x' is a character known to exist nowhere in the file, M is the
start column number, and N is the end column number.

Eric

George Mpouras

unread,

Feb 14, 2012, 12:40:24 PM2/14/12

to

#!c:/Perl/bin/perl.exe
# Have a happy sorting

use encoding 'utf8';
my @File;
my %Positions_and_lenghts = (
0 => 1 ,
2 => 1 ,
4 => 4 ,
9 => 6 ,
);

#For external file
#open FILE, '>:utf8', 'c:/ome/utf8/file' or die "$^E\n";
binmode STDOUT, ':utf8';

while (my $line = <DATA>) {
my $row;
foreach my $POS (sort {$a<=>$b} keys %Positions_and_lenghts) {
push @{$row}, substr $line, $POS, $Positions_and_lenghts{$POS}
}
push @File, $row
}
#use Data::Dumper; print Dumper(\@File);exit;

foreach my $row (sort {$a->[0] cmp $b->[0] || $a->[2] cmp $b->[2]} @File) {
print "@{$row}\n"
}

__DATA__
1 a αα δδδ
1 b ββ γγγ
2 a αα βββ
2 b ββ ααα

tch...@perl.com

unread,

Feb 15, 2012, 4:08:12 PM2/15/12

to

I think you should consider using the ucsort script. It is made
for sorting Unicode in the UTF-8 encoding, because it sorts
via the Unicode Collation Algorithm. It by default sorts the entire
line without thinking about fields, but there are options to account
for those. They are not like the regular sort program's options.
For example, to sort just using characters 10-20 in each line, you
might do

$ ucsort --pre="s/^.{10}(.{10}).*/$1/" < inputfile > outputfile

You can get the beta version from

http://training.perl.com/scripts/ucsort

--tom

tch...@perl.com

unread,

Feb 15, 2012, 4:13:25 PM2/15/12

to

On Tuesday, February 14, 2012 8:34:40 AM UTC-7, Eric Pement wrote:

> A comment on stackoverflow.com says this:
>
> ... keep in mind that GNU sort depends on a correct locale setting
> (the LC_* environment variables, and specifically the LC_COLLATE one).
> LC_COLLATE (or LC_ALL) should be set to a locale with UTF-8 support
> (e.g. en_US.UTF-8 or el_GR.UTF-8), preferably in the language that you
> are interested in.

This is the problem with all the vendor-locale things: they are unreliable,
and they require particular locale settings. This is not reasonable in
a Unicode world. Much better to use Unicode::Collate and if necessary also
Unicode::Collate::Locale. For example, a pure-Perl solution for sorting
according to German phonebook conventions, and with uppercase
before lowercase, is:

$ ucsort --locale=de__phonebook --upper_before_lower

happytoday

unread,

Feb 17, 2012, 6:00:16 AM2/17/12

to

On Feb 14, 7:40 pm, "George Mpouras"

Regarding the perl script I tried :

C:\COMPILER\Perl\bin>perl -c sorting.pl sorting.txt
sorting.pl syntax OK

use encoding 'utf8';
my @File;
my %Positions_and_lenghts = (
0 => 1 ,
2 => 1 ,
4 => 4 ,

70 => 9 ,
);

#For external file
open FILE, '>:utf8', 'c:\compiler\bin\perl\sorting.txt' or die "$^E

\n";
binmode STDOUT, ':utf8';

while (my $line = <DATA>) {
my $row;
foreach my $POS (sort {$a<=>$b} keys %Positions_and_lenghts) {
push @{$row}, substr $line, $POS, $Positions_and_lenghts{$POS}
}
push @File, $row

}

#use Data::Dumper; print Dumper(\@File);exit;

foreach my $row (sort {$a->[0] cmp $b->[0] || $a->[2] cmp $b->[2]}
@File) {
print "@{$row}\n"

}

but nothing happened .

Jim Gibson

unread,

Feb 17, 2012, 1:30:11 PM2/17/12

to

In article
<c2e04dbd-3586-4eb8...@db5g2000vbb.googlegroups.com>,

happytoday <ehabaz...@gmail.com> wrote:

>
> Regarding the perl script I tried :
>
> C:\COMPILER\Perl\bin>perl -c sorting.pl sorting.txt
> sorting.pl syntax OK
>

[program snipped]

>
> but nothing happened .

If you mean nothing happened after you entered the following:

perl -c sorting.pl sorting.txt

that is because the -c flag tells perl to check the syntax of your
program but do not run it. That is what happened. Is that what you did?
If not, please explain further.

--
Jim Gibson

happytoday

unread,

Feb 17, 2012, 2:57:26 PM2/17/12

to

Firstly about the suggested program suggested by George :
I run :
C:\COMPILER\Perl\bin>perl -e sorting.pl sorting.txt

But also nothing is happened .

About the ucsort : How can I run it under windows environment . The
script is written for unix .

Jim Gibson

unread,

Feb 17, 2012, 3:21:42 PM2/17/12

to

In article
<55a53580-0b11-4b35...@9g2000vbq.googlegroups.com>,

happytoday <ehabaz...@gmail.com> wrote:

> Firstly about the suggested program suggested by George :
> I run :
> C:\COMPILER\Perl\bin>perl -e sorting.pl sorting.txt
>
> But also nothing is happened .

Why are you using the '-e' option? That is for executing Perl programs
that are entered on the command line. You should use:

perl sorting.pl sorting.txt

See 'perldoc perlrun'

--
Jim Gibson

ehabaz...@gmail.com

unread,

Mar 4, 2012, 6:35:36 AM3/4/12

to

I assure to you that is nothing is happened . How can I recognize the character set of the data file .
My Perl scripts is like this :

use encoding 'utf8';
my @File;
my %Positions_and_lenghts = (

110 => 12 ,
);

#For external file
open FILE, '>:utf8', 'c:\compiler\perl\bin\sorting.txt' or die "$^E\n";

binmode STDOUT, ':utf8';

while (my $line = <DATA>) {
my $row;
foreach my $POS (sort {$a<=>$b} keys %Positions_and_lenghts) {
push @{$row}, substr $line, $POS, $Positions_and_lenghts{$POS}
}
push @File, $row

}

#use Data::Dumper; print Dumper(\@File);exit;

foreach my $row (sort {$a->[0] cmp $b->[0] || $a->[2] cmp $b->[2]} @File) {
print "@{$row}\n"

}

My Data file is like :

1 ·ـ±ـ فنك¤¹¨¤ غ·،ت فدق گ“”ڈڈ‘“‘ڈڈڈگ 0 0 0 0 گ“”ڈڈ‘“‘ڈڈڈگ ن¹آA ¹ؤـ ç¹ك،صع¤ ç¹ك،صع¤ ç¹ك،صع¤ ¹ـ±ً¤ ©¹·ع¤ وم،ص¹آع¤ أم± èطہ گک
2 ·ن؟م¨¤ ·ـ±¤ ·ن±م ·ـ±¤ گڈکڈ““•–ڈڈڈگ 0 0 0 0 گڈکڈ““•–ڈڈڈگ ك؟قع¤A ¹ؤـ ç¹ك،صع¤ ç¹ك،صع¤ èنزمقـع¤ - ë¤·àآع¤ ؟ط¹ـ ¹·،ق
3 فًہع¤ ·¨ت ·ـ±ـ ·مـ±ـ غ·،ت گڈکڈ““–ڈگڈڈگ 0 0 0 0 گڈکڈ““–ڈگڈڈگ ك؟قع¤A ¹ؤـ ç¹ك،صع¤ ç¹ك،صع¤ âڈڈ..ç¹ك،صع¤ ءـآ ÷نت ءنمہع¤ ¹ہ® أ ÷ـ ÷،ـت أ ¤ گ”
4 لھ،±آ ·¤م®ع¤ ·¨ت ÷نہ± ê·®ـ