Fast Linear Hex Search

Mazin07

unread,

Sep 23, 2005, 9:06:26 PM9/23/05

to

I need a Perl script to search through a file to find specific hex
strings, NOT ASCII. Ideally, it should not read more of the file than
it needs, and to quit when it does find it. The file is about 4 mb, with
8 million hex digits.

How should I go about doing this in perl? Obviously I don't want to
read the whole thing into memory, and I need to stay away from dodgy
functions that turn hex into ascii.

If anybody's interested, it is 8 million digits of pi, in hex form (2
digits per byte). I already converted from ASCII into that packed
format, but now I need to search it.

There's a slow PHP version if you care. http://ericjiang.co.nr/pi.php

Jeff Schwab

unread,

Sep 24, 2005, 9:45:24 AM9/24/05

to

What do you mean "in hex form?" Do you just mean that each 4-bit nybble
represents an unsigned integer from 0 to 15?

jerry davis

unread,

Sep 24, 2005, 10:41:16 AM9/24/05

to

On Friday 23 September 2005 08:06 pm, Mazin07 wrote:
> I need a Perl script to search through a file to find specific hex
> strings, NOT ASCII. Ideally, it should not read more of the file than
> it needs, and to quit when it does find it. The file is about 4 mb, with
> 8 million hex digits.
>
> How should I go about doing this in perl? Obviously I don't want to
> read the whole thing into memory, and I need to stay away from dodgy
> functions that turn hex into ascii.

look at the functions sysopen and sysread
this will do the trick. on sysread you can specify how many bytes to read.

>
> If anybody's interested, it is 8 million digits of pi, in hex form (2
> digits per byte). I already converted from ASCII into that packed
> format, but now I need to search it.

this sounds cool, I bought a book on pi, from B&N, and it had I think 10000
digits of pi.

>
> There's a slow PHP version if you care. http://ericjiang.co.nr/pi.php

--
Registered Linux User: 275424
Today's Fortune: The only constant is change.

Ilya Zakharevich

unread,

Sep 24, 2005, 5:43:33 PM9/24/05

to

[A complimentary Cc of this posting was sent to
jerry davis

<jfd...@wi.rr.com>], who wrote in article <200509240941...@wi.rr.com>:
> > If anybody's interested, it is 8 million digits of pi, in hex form (2
> > digits per byte). I already converted from ASCII into that packed
> > format, but now I need to search it.
>
> this sounds cool, I bought a book on pi, from B&N, and it had I think 10000
> digits of pi.

perl -MMath::Pari=:prec=10000,Pi -wle "print Pi"

AFAIK, the algorithm takes O(n^3/2). To get a million of digits may
take several minutes on contemporary machines. 8e6 should be feasible
too; make take about hour (but if the cached size of memory is low,
can be much slower). In both cases you may need to increase the size
of the PARI stack, see the docs.

Hope this helps,
Ilya

Mazin07

unread,

Sep 24, 2005, 3:25:28 PM9/24/05

to

jerry davis wrote:
> On Friday 23 September 2005 08:06 pm, Mazin07 wrote:
>
>>I need a Perl script to search through a file to find specific hex
>>strings, NOT ASCII. Ideally, it should not read more of the file than
>>it needs, and to quit when it does find it. The file is about 4 mb, with
>>8 million hex digits.
>>
>>How should I go about doing this in perl? Obviously I don't want to
>>read the whole thing into memory, and I need to stay away from dodgy
>>functions that turn hex into ascii.
>
>
> look at the functions sysopen and sysread
> this will do the trick. on sysread you can specify how many bytes to read.

I'm not too sure on the best way to go about searching it. With
sysopen, would I have to incrementally read more bytes? That would mean
looping over a lot - how would it affect the run speed?
How do I get it to give me hex instead of ascii? I don't want to
continually convert ascii to hex.

Keith Thompson

unread,

Sep 25, 2005, 8:31:36 PM9/25/05

to

Mazin07 <maz...@gmail.com> writes:
> I need a Perl script to search through a file to find specific hex
> strings, NOT ASCII. Ideally, it should not read more of the file
> than it needs, and to quit when it does find it. The file is about 4
> mb, with 8 million hex digits.
>
> How should I go about doing this in perl? Obviously I don't want to
> read the whole thing into memory, and I need to stay away from dodgy
> functions that turn hex into ascii.
>
> If anybody's interested, it is 8 million digits of pi, in hex form (2
> digits per byte). I already converted from ASCII into that packed
> format, but now I need to search it.

If it's 2 digits per byte, it's not in hex form. Hexadecimal is a
(typically ASCII) text representation of base-16 numbers, using the
characters '0'..'9', 'a'..'f'. It typically takes 8 bits per
hexadecimal digit.

The format you're describing sounds like plain binary. Or is it
binary-coded decimal?

If it's binary-coded decimal, it's going to look something like this:

0011 0001 0100 0001 0101 1001 ...
(3) (1) (4) (1) (5) (9) ...

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Keith Thompson

unread,

Sep 25, 2005, 8:33:10 PM9/25/05

to

Mazin07 <maz...@gmail.com> writes:
> I need a Perl script to search through a file to find specific hex
> strings, NOT ASCII. Ideally, it should not read more of the file
> than it needs, and to quit when it does find it. The file is about 4
> mb, with 8 million hex digits.
>
> How should I go about doing this in perl? Obviously I don't want to
> read the whole thing into memory, and I need to stay away from dodgy
> functions that turn hex into ascii.
>
> If anybody's interested, it is 8 million digits of pi, in hex form (2
> digits per byte). I already converted from ASCII into that packed
> format, but now I need to search it.

One more thing: Why don't you want to read the whole thing into
memory? On most modern systems, 4 megabytes isn't an unreasonable
amount.

Sebastien Nadeau

unread,

Sep 26, 2005, 9:26:27 AM9/26/05

to

At 20:33 2005-09-25, you wrote:

>One more thing: Why don't you want to read the whole thing into
>memory? On most modern systems, 4 megabytes isn't an unreasonable
>amount.

Yes, that is exactly what I would do.

And for search purposes, since you search on decimal digits, it would
be great to put them all in a hash (since memory shouldn't be a
problem) like that:
{1} => 1
{2} => 4
{3} => 1
{4} => 5
{5} => 9
...
Then sort the hash on the values and you would easily be able to
compute stats like how many 4's are there?, how many times are
identical consecutive digits encountered?, etc.

Sébastien

Sébastien Nadeau
Technicien en informatique
Bibliothèque de l'Université Laval
(418) 656-2131 ext. 6815
Notice of Confidentiality
<http://www.rec.ulaval.ca/lce/securite/confidentialite.htm>

Keith Thompson

unread,

Sep 26, 2005, 8:37:17 PM9/26/05

to

Sebastien Nadeau <sebastie...@bibl.ulaval.ca> writes:
> At 20:33 2005-09-25, you wrote:
>>One more thing: Why don't you want to read the whole thing into
>>memory? On most modern systems, 4 megabytes isn't an unreasonable
>>amount.
>
> Yes, that is exactly what I would do.
>
> And for search purposes, since you search on decimal digits, it would
> be great to put them all in a hash (since memory shouldn't be a
> problem) like that:
> {1} => 1
> {2} => 4
> {3} => 1
> {4} => 5
> {5} => 9
> ...
> Then sort the hash on the values and you would easily be able to
> compute stats like how many 4's are there?, how many times are
> identical consecutive digits encountered?, etc.

Even if 4 megabytes of data is reasonable, once you start doing things
like creating hashes with 8 million elements, or sorting
8-million-element arrays, you might start having resource problems.

Mazin07

unread,

Sep 26, 2005, 7:01:41 PM9/26/05

to

Keith Thompson wrote:
> Mazin07 <maz...@gmail.com> writes:
>
>>I need a Perl script to search through a file to find specific hex
>>strings, NOT ASCII. Ideally, it should not read more of the file
>>than it needs, and to quit when it does find it. The file is about 4
>>mb, with 8 million hex digits.
>>
>>How should I go about doing this in perl? Obviously I don't want to
>>read the whole thing into memory, and I need to stay away from dodgy
>>functions that turn hex into ascii.
>>
>>If anybody's interested, it is 8 million digits of pi, in hex form (2
>>digits per byte). I already converted from ASCII into that packed
>>format, but now I need to search it.
>
>
> If it's 2 digits per byte, it's not in hex form. Hexadecimal is a
> (typically ASCII) text representation of base-16 numbers, using the
> characters '0'..'9', 'a'..'f'. It typically takes 8 bits per
> hexadecimal digit.
>
> The format you're describing sounds like plain binary. Or is it
> binary-coded decimal?
>
> If it's binary-coded decimal, it's going to look something like this:
>
> 0011 0001 0100 0001 0101 1001 ...
> (3) (1) (4) (1) (5) (9) ...
>

Binary-coded decimal actually. I was referring to it like that because
the digits 1 and 2 would show as 12 in hex.

so.. (3.) 14159265 would be

14 15 92 65
when viewed in a hex editor.

So what's the best function to search for strings of numbers in that,
asides from turning it into an ascii string in memory? What kind of
functions does Perl have that would be suited to that? I searched
perl's manual, but didn't find anything.

Or would it be better to just expand the whole thing into human-readable
ASCII string in memory? It seems like that would take all the advantage
out of it.

Yitzchak Scott-Thoennes

unread,

Sep 27, 2005, 6:43:56 AM9/27/05

to

On Mon, Sep 26, 2005 at 11:01:41PM +0000, Mazin07 wrote:
> Binary-coded decimal actually. I was referring to it like that because
> the digits 1 and 2 would show as 12 in hex.
>
> so.. (3.) 14159265 would be
>
> 14 15 92 65
> when viewed in a hex editor.
>
> So what's the best function to search for strings of numbers in that,
> asides from turning it into an ascii string in memory? What kind of
> functions does Perl have that would be suited to that? I searched
> perl's manual, but didn't find anything.
>
> Or would it be better to just expand the whole thing into human-readable
> ASCII string in memory? It seems like that would take all the advantage
> out of it.

If you mean you want to search for a string of digits like 14159265
and there may be an odd number of digits, or you want to find them
whether they start at an odd or even position, you have enormously
complicated things by packing your data that way. You could come
up with a regex that would work, but it's much less trouble just
to have your file not packed.

Sebastien Nadeau

unread,

Sep 27, 2005, 11:22:08 AM9/27/05

to

At 20:37 2005-09-26, you wrote:

>Even if 4 megabytes of data is reasonable, once you start doing things
>like creating hashes with 8 million elements, or sorting
>8-million-element arrays, you might start having resource problems.

I agree with that.

In theory sorting 8 000 000 digits with a quicksort is not much
longer than sorting an array of 100 000. Of course this is not true
anymore if you need to swap memory pages while sorting 8 000 000. But
even printing Hello World will be a pain if memory is filled. I
should have said: "if memory is not an issue on your system" instead
of "memory shouldn't be a problem".

Mark Jason Dominus

unread,

Sep 27, 2005, 9:51:47 AM9/27/05

to

Mazin07 says:
>
> Binary-coded decimal actually. I was referring to it like that because
> the digits 1 and 2 would show as 12 in hex.
>
> so.. (3.) 14159265 would be
>
> 14 15 92 65
> when viewed in a hex editor.
>
> So what's the best function to search for strings of numbers in that,
> asides from turning it into an ascii string in memory? What kind of
> functions does Perl have that would be suited to that?

It doesn't. Hardly anyone ever does what you're doing, so there's no
built-in function for it.

> Or would it be better to just expand the whole thing into human-readable
> ASCII string in memory?

That's what I would think.

> It seems like that would take all the advantage out of it.

Now you're like the guy who stores all his groceries by gluing them to
the ceiling. It keeps the stuff out of the way, and the ceiling was
ormerly just wasted space. But then the guy wants to know how he can
cook the food without unsticking it from the ceiling. "You mean I
have to take it down again to eat it?" he asks. "It seems like that

would take all the advantage out of it."

Yeah, I guess so. The house doesn't have a special stove for cooking
food that's attached to the ceiling, but that's not necessarily a
defect in the design of the house.

Well, anyway, I am not getting to the point of this article, which is
to present some code that searches BCD data without unpacking it. I
did it as an exercise, not because I wanted to recommend this as a
worth while thing to do.

The idea is this. Suppose you are looking for $needle = "123456" in
some BCD-coded target string $haystack. The first thing you can do is
to use

index($haystack, bcd_pack($needle))

where bcd_pack turns "123456" into "\x12\x34\x56".

This searches for the needle directly, and index() uses a fast
string-search algorithm. If you find "\x12\x34\x56", you win. But if
you don't find it, you're not finished yet, because the haystack might
contain (for example) "\x91\x23\x45\x69" which should also match. So
you need to do a trick. You trim off the "1" and the "6" from
"123456" and search for "\x23\x45" instead; if you find any matches,
you make sure they are preceded by something of the form X1 and
followed by something of the form 6X. If so, you win.

There are some other details you have to worry about, but I think at
this point looking at the code would bne easier than following an
explanation, so here it is:

================================================================

#!/usr/bin/perl
use strict 'vars';

# Utility function
#
# $haystack is the string to be searched, in BCD form
# $needle is the thing we're looking for, also in BCD form
# we only get a successful match if $needle appears,
# immediately preceded by $head and followed by $tail
#
# Either or both of head and tail may be undefined
#
# Returns leftmost position at which there is a match,
# or -1 if there is none
sub _search_bcd_util {
my ($haystack, $needle, $head, $tail) = @_;
my $nlen = length($needle);
my $cur_pos = 0 + defined($head) ? 1 : 0;
my $pos;
while ($cur_pos <= length($haystack)) {
$pos = index($haystack, $needle, $cur_pos);
return -1 if $pos == -1;
next if defined($head)
&& (substr($haystack, $pos-1, 1) & "\x0f") ne $head;
next if defined($tail)
&& (substr($haystack, $pos+$nlen, 1) & "\xf0") ne $tail;
return $pos;
} continue {
$cur_pos = $pos+1;
}
return -1;
}

sub search_bcd {
my ($haystack, $needle) = @_;
my $hlen = length $haystack;
my $nlen = length $needle;

my $odd = $nlen % 2;
my $bcd1 = pack "H*", substr($needle, 0, $nlen - $odd);
my $bcd2 = pack "H*", substr($needle, 1, $nlen - 2 + $odd);
my $tail = pack "H2", substr($needle, -1, 1) . "0";
my $head = pack "H2", "0" . substr($needle, 0, 1);

my $found1 = _search_bcd_util($haystack, $bcd1, undef, $odd ? $tail : undef);
my $found2 = _search_bcd_util($haystack, $bcd2, $head, $odd ? undef : $tail);

return -1 if $found1 < 0 && $found2 < 0;
return $found1 == -1 ? 2*$found2 -1 :
$found2 == -1 ? 2*$found1 :
$found1 < $found2 ? 2*$found1 :
2*$found2 -1 ;
}

if ($ARGV[0] eq "TEST") {
require Test::More;
Test::More->import("no_plan");
my $pi = "141592653589793238";
my $haystack = pack "H*", $pi;

is (search_bcd($haystack, "653"), 6);
is (search_bcd($haystack, "535"), 7);
is (search_bcd($haystack, "6535"), 6);
is (search_bcd($haystack, "5358"), 7);

for my $i (0 .. length($pi)-2) {
for my $j (1 .. length($pi)-$i-1) {
my $needle = substr($pi, $i, $j);
is(search_bcd($haystack, $needle), index($pi, $needle),
"Looking for <$needle> at position $i");
}
}
for my $failure (qw(0 12 43 142 999 373 142857)) {
is(search_bcd($haystack, $failure), -1,
"search for absent '$failure'");
}
}

================================================================

This seems to work OK, although I did not make an effort to test it
carefully, so I am not entirely sure. I have not tried to optimize it
for speed. I don't know if this would be worth doing for your
application, either with or without speed optimizations. I suspect
not, but who knows?

"Hope this helps."

Mazin07

unread,

Sep 27, 2005, 4:10:24 PM9/27/05

to

I give up trying it with Perl. Consider the thread closed, but thanks
to eveyone who tried to help me. Maybe some low-level language will
suffice.

Ilya Zakharevich

unread,

Sep 27, 2005, 5:19:44 PM9/27/05

to

[A complimentary Cc of this posting was sent to

Yitzchak Scott-Thoennes

<stho...@efn.org>], who wrote in article <2005092710...@efn.org>:

> If you mean you want to search for a string of digits like 14159265
> and there may be an odd number of digits, or you want to find them
> whether they start at an odd or even position, you have enormously
> complicated things by packing your data that way.

"Enormously"??? I doubt it. Fixed strings are easy:

/\x14\x15\x92\x65/ || /[\x01...\xF1]\x41\x59\x26[\x50-\x5F]/

(of course, this leading [\x01...\xF1] should be completely spelled out).

What *is* enormously complicated is looking for regular expressions in
the expanded string - without expansion...

Hope this helps,
Ilya

Yitzchak Scott-Thoennes

unread,

Sep 27, 2005, 7:50:51 PM9/27/05

to

mailed and posted

I didn't mean to imply that it was objectively difficult, just that it
increases the complexity of a solution by a factor of 3 or more.

Just a routine to compute the regex to use for a given series of digits
is going to be as complicated as the entire code needed to search the file
for the digits in ascii form.

Mark Jason Dominus

unread,

Sep 28, 2005, 9:49:27 AM9/28/05

to

Mark Rafn:

> Be sure to keep us informed how this works out. I can't see how choosing a
> different language is going to help with your mismatch between storage format
> and search parameter.

Yeah. If anything, it will get worse. The basic problem is that the
smallest piece of data that is easy to work with is a single byte
(which is pretty much the definition of a byte) and the data format
he's chosen stores a digit in a smaller space than that.

The only way I can imagine to solve the problem in a lower-level
language is to do pretty much the same thing that I did in Perl, but
translated into the lower-level language. And maybe you'd have to
implement index() from scratch too.

You see this kind of thing a lot in beginning programmers. They still
have a very hazy idea of what the computer is doing and why it does
what it does; the conceptual connection between what they put in and what comes
out is still uncleare to them. The output of a computer program is
an extremely complicated function of the input, and if you don't have
a good undersatnding of the very complicated internal mechanism, it
appears to be random.

This is why beginning programmers try putting random code into their
programs in hopes that it will work, and why it is so hard to get them
to try to read and understand an error message.

Here I think the same problem is manifesting in a different way.
Programs work by magic, so Mazin's approach is to look for the
appropriate magic incantation in Perl that will magically solve his
problem. If Perl doesn't have the right incantation, then he'll go
try a different language that might have it.

Malcolm Dew-Jones

unread,

Sep 28, 2005, 1:19:05 PM9/28/05

to

Mark Jason Dominus (m...@plover.com) wrote:
: Mark Rafn:

: > Mazin07 <maz...@gmail.com> wrote:
: > >I give up trying it with Perl. Consider the thread closed, but thanks
: > >to eveyone who tried to help me. Maybe some low-level language will
: > >suffice.
: >
: > Be sure to keep us informed how this works out. I can't see how choosing a
: > different language is going to help with your mismatch between storage format
: > and search parameter.

: Yeah. If anything, it will get worse. The basic problem is that the
: smallest piece of data that is easy to work with is a single byte
: (which is pretty much the definition of a byte) and the data format
: he's chosen stores a digit in a smaller space than that.

And the silly thing is that he's not really saving much space. It's not
as if he's got some great 10:1 compression ratio happening, or gigabytes
of data to squeeze into memory.

Premature optimization!

--

This programmer available for rent.

Ilya Zakharevich

unread,

Sep 28, 2005, 5:23:18 PM9/28/05

to

[A complimentary Cc of this posting was sent to

Malcolm Dew-Jones
<yf...@victoria.tc.ca>], who wrote in article <433ad089$1...@news.victoria.tc.ca>:

> And the silly thing is that he's not really saving much space. It's not
> as if he's got some great 10:1 compression ratio happening, or gigabytes
> of data to squeeze into memory.

Sorry, but this argument is, in general, BS-ish. Most of the time it
is not a compression ratio which matters, but whether the working set
fits in one of "caches" (especially if you consider the memory as the
cache of the swap space ;-).

So if it is on a Sparc with 8M second-level cache, his choice of
keeping the data compressed *may* be beneficial in some situations.

Hope this helps,
Ilya

Mark Manning

unread,

Sep 29, 2005, 1:18:11 PM9/29/05

to

Hi everyone - just signed up. Used to be on the old (circa early
1990's) group. Anyway, this sounds like a problem presented by a
teacher to his students (IMHO). After all, with today's systems having
from 128MB and up to 8GB of memory what is wrong with just sucking 8MB
worth of data into memory and working on it there? A simple index
command could do the searching for you. Now, if it were 8GB of data -
that would be something - but 8MB? Sounds like a teacher told them not
to just suck it into memory and do that. The teacher probably wanted
them to learn how to use files to do searching for information and since
PI doesn't repeat - that makes it harder to do the searching since any
string will occur only once. The guy would probably do better by going
to a site about PI and looking up where the strings start (and no - I
haven't gone out there and looked but there is a site for everything
else so why not where strings in PI start?). Anyway, just my $0.02 worth.

Mark

Keith Thompson

unread,

Sep 29, 2005, 5:54:15 PM9/29/05

to

Sebastien Nadeau <sebastie...@bibl.ulaval.ca> writes:
> At 20:37 2005-09-26, you wrote:
>
>>Even if 4 megabytes of data is reasonable, once you start doing things
>>like creating hashes with 8 million elements, or sorting
>>8-million-element arrays, you might start having resource problems.
>
> I agree with that.
>
> In theory sorting 8 000 000 digits with a quicksort is not much longer
> than sorting an array of 100 000.

Um, what theory would that be?

Quicksort is O(n * log n). Sorting 8,000,000 items should require
about 110 times as many operations as sorting 100,000 items. If the
time to sort 8,000,000 items is very short, there's not much real
difference -- but I just tried an experiment, and the program crashed
before it finished sorting.

Keith Thompson

unread,

Sep 29, 2005, 6:03:07 PM9/29/05

to

Mark Jason Dominus <m...@plover.com> writes:
> Mark Rafn:
>> Mazin07 <maz...@gmail.com> wrote:
>> >I give up trying it with Perl. Consider the thread closed, but thanks
>> >to eveyone who tried to help me. Maybe some low-level language will
>> >suffice.
>>
>> Be sure to keep us informed how this works out. I can't see how
>> choosing a different language is going to help with your mismatch
>> between storage format and search parameter.
>
> Yeah. If anything, it will get worse. The basic problem is that the
> smallest piece of data that is easy to work with is a single byte
> (which is pretty much the definition of a byte) and the data format
> he's chosen stores a digit in a smaller space than that.
>
> The only way I can imagine to solve the problem in a lower-level
> language is to do pretty much the same thing that I did in Perl, but
> translated into the lower-level language. And maybe you'd have to
> implement index() from scratch too.

Well, maybe. There are languages that implement packed arrays.
In Pascal, for example, I think the syntax would be
packed array [lo..hi] of 0..15
Ada has a similar feature.

This should be a good match for the way he's storing the data, and
he can use ordinary array indexing to search the array.

Of course, the compiler is going to have to transform all the indexing
operations into something that extracts data from individual bytes.
Depending on how clever the compiler is, this may or may not result in
faster code than what you'd get if you manually coded it in Perl or C.
It will almost certainly be slower than an unpacked array (one element
per byte) -- but cache size issues make it unclear what the best
tradeoff is going to be.

Ilya Zakharevich

unread,

Sep 29, 2005, 6:56:53 PM9/29/05

to

[A complimentary Cc of this posting was sent to

Mark Manning
<mar...@airmail.net>], who wrote in article <11jo8ep...@corp.supernews.com>:

> Hi everyone - just signed up. Used to be on the old (circa early
> 1990's) group. Anyway, this sounds like a problem presented by a
> teacher to his students (IMHO). After all, with today's systems having
> from 128MB and up to 8GB of memory what is wrong with just sucking 8MB
> worth of data into memory and working on it there?

Well, my watch still does not have 8M of builtin memory...

> PI doesn't repeat - that makes it harder to do the searching since any
> string will occur only once.

Doubtlessly, you wrote this before your morning coffee, right?

Hope this helps,
Ilya

John Bokma

unread,

Sep 29, 2005, 8:39:32 PM9/29/05

to

Keith Thompson <ks...@mib.org> wrote:

> Well, maybe. There are languages that implement packed arrays.
> In Pascal, for example, I think the syntax would be
> packed array [lo..hi] of 0..15
> Ada has a similar feature.

And don't forget the Z80. I recall how I calculated once a mersenne prime
number of which I knew the power, by doubling the number, and using BCD to
store the number :-) The Z80 has some BCD operations, can't recall all
though.

--
John MexIT: http://johnbokma.com/mexit/
personal page: http://johnbokma.com/
Experienced programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

Sebastien Nadeau

unread,

Sep 30, 2005, 9:51:48 AM9/30/05

to

At 17:54 2005-09-29, you wrote:

>Sebastien Nadeau <sebastie...@bibl.ulaval.ca> writes:
> > At 20:37 2005-09-26, you wrote:
> >
> >>Even if 4 megabytes of data is reasonable, once you start doing things
> >>like creating hashes with 8 million elements, or sorting
> >>8-million-element arrays, you might start having resource problems.
> >
> > I agree with that.
> >
> > In theory sorting 8 000 000 digits with a quicksort is not much longer
> > than sorting an array of 100 000.
>
>Um, what theory would that be?
>
>Quicksort is O(n * log n). Sorting 8,000,000 items should require
>about 110 times as many operations as sorting 100,000 items. If the
>time to sort 8,000,000 items is very short, there's not much real
>difference -- but I just tried an experiment, and the program crashed
>before it finished sorting.

Exactly the theory you explained. There's not an order of magnitude
between sorting 100,000 and 8,000,000 items. Only a multiplicative
constant. Quicksort in the worst case is O(n^2), but will generally
run in O(n * log n) when well implemented. I agree with you however:
I talked before testing. So I tried this script:

use strict;
my @digits;
my $time1 = time;
foreach my $i (1 .. 1000000) {
push @digits, int (rand 10);
}
my @sorted = sort @digits;
my $time2 = time;
my $total = $time2 - $time1;
print "$total seconds\n";
exit;

And the first time it sorted 1,000,000 digits in 3 seconds.
It took 5 seconds to sort 2,000,000 and 8 seconds to sort 3,000,000.
That's exactly what I was expecting.

However, sorting 4,000,000 took 20 seconds (that's what it printed)
and then Perl crashed (or I was not patient enough to wait the whole
day so I killed the process). Indeed, I didn't even try with 8,000,000.

Since quicksort is an inplace sort, it shouldn't be a memory problem,
but in every occasions the perl process memory usage jumped to reach
the limit of available memory of my machine (I'm running with 512
megs or physical memory).

Can someone explain that?

Mark Manning

unread,

Sep 30, 2005, 10:41:42 AM9/30/05

to

Tea actually - but yes. I was very tired and when I get tired I usually
start stating the obvious. :-)

Martin wrote me on this also but his message gave me an idea on how this
person's problem could be solved. What do you think of this:

1. Build $index_1[0-9][0->#]
2. Start loop looking for strings.
2a. Use $index_1 to index into the BCD string.
3. Build reduced search index ($index_2[0->#])
using a max of two, three, or four BCD characters.
4. Use reduced $index_2 as starting point to look for string.
5. End of loop

Sort of like a divide and conquer algorithm but not quite.

Think that might work?

T Beck

unread,

Sep 30, 2005, 1:15:17 PM9/30/05

to

Sebastien Nadeau wrote:
> At 17:54 2005-09-29, you wrote:
>
> Exactly the theory you explained. There's not an order of magnitude
> between sorting 100,000 and 8,000,000 items. Only a multiplicative
> constant. Quicksort in the worst case is O(n^2), but will generally
> run in O(n * log n) when well implemented. I agree with you however:
> I talked before testing. So I tried this script:

Since big O is defined as the asymptotic worst case performance of an
algorithm, how is it "worst case O() and average O()?" (This is really
a question, I have no doubt in my mind that I've missed something, or
forgotten something from my school days)

> Since quicksort is an inplace sort, it shouldn't be a memory problem,
> but in every occasions the perl process memory usage jumped to reach
> the limit of available memory of my machine (I'm running with 512
> megs or physical memory).
>
> Can someone explain that?

Quicksort is a recursive algorithm, which would mean that every call to
it is going to consume just a bit more stack. Do that a few million
times and you're chipping away at your RAM.

--T Beck

Yitzchak Scott-Thoennes

unread,

Sep 30, 2005, 3:57:27 PM9/30/05

to

On Fri, Sep 30, 2005 at 10:15:17AM -0700, T Beck wrote:
> Sebastien Nadeau wrote:
> > Since quicksort is an inplace sort, it shouldn't be a memory problem,
> > but in every occasions the perl process memory usage jumped to reach
> > the limit of available memory of my machine (I'm running with 512
> > megs or physical memory).
> >
> > Can someone explain that?
>
> Quicksort is a recursive algorithm, which would mean that every call to
> it is going to consume just a bit more stack. Do that a few million
> times and you're chipping away at your RAM.

But quicksort isn't going to recurse a few million times except on
pathological data.

But in any case, perl no longer defaults to using quicksort.
See: http://search.perl.org/perldoc/sort

Yitzchak Scott-Thoennes

unread,

Sep 30, 2005, 3:58:34 PM9/30/05

to

On Fri, Sep 30, 2005 at 12:57:27PM -0700, Yitzchak Scott-Thoennes wrote:
> But in any case, perl no longer defaults to using quicksort.
> See: http://search.perl.org/perldoc/sort

I mean http://search.cpan.org/perldoc/sort

Mark Jason Dominus

unread,

Sep 30, 2005, 4:01:30 PM9/30/05

to

"T Beck" <Tracy...@Infineon.com>:

> Quicksort is a recursive algorithm, which would mean that every call to
> it is going to consume just a bit more stack.

The call consumes memory, but the memory is released when the call returns.

In nearly all cases, the maximum amount of stack space used when
sorting N items is only O(log(N)).

John Bokma

unread,

Sep 30, 2005, 6:09:51 PM9/30/05

to

Yitzchak Scott-Thoennes <stho...@efn.org> wrote:

> On Fri, Sep 30, 2005 at 10:15:17AM -0700, T Beck wrote:
>> Sebastien Nadeau wrote:
>> > Since quicksort is an inplace sort, it shouldn't be a memory problem,
>> > but in every occasions the perl process memory usage jumped to reach
>> > the limit of available memory of my machine (I'm running with 512
>> > megs or physical memory).
>> >
>> > Can someone explain that?
>>
>> Quicksort is a recursive algorithm, which would mean that every call to
>> it is going to consume just a bit more stack. Do that a few million
>> times and you're chipping away at your RAM.
>
> But quicksort isn't going to recurse a few million times except on
> pathological data.

And even then IIRC, one is tail recursion (so can be optimized away), and
for small number of items one shouldn't use quicksort in the first place
:-).

Mark Manning

unread,

Oct 1, 2005, 1:21:50 AM10/1/05

to

Yes - kind-of. I've had this happen many times and someone on the Perl
list explained it to me years ago. So I will try to remember what they
said.....

Basically, Perl allocates memory but does not free up memory as quickly
as it gobbles it up. That is to say that the garbage collection routine
waits a while to see if the memory is going to be used again. So if the
Quicksort is going at a speed of, say five, the garbage collector is
only going at a speed of between one and three. So memory is growing
simply because Perl's garbage collector hasn't realized that some of the
memory can be released. That is also why you see it go up, then down,
then farther up, then down a bit, and up again. The garbage collector
is trying to keep up but the priority is on doing the quicksort. It is
assumed that you have enough memory to handle doing the sort in the
first place.

You can write your own Quicksort and alleviate the memory problems since
you can make the original array global versus local. Then all you are
doing is moving pointers into the array around rather than the array
itself all the time. It is a bit slower than the built-in Quicksoft,
but not by that much.

An excellent (although out of print) source of information on sorting
algorithms is the old Nibble magazine (from the Apple ][ era). It
contained no fewer than six different sorting algorithms written out
completely in flowcharts and Applesoft. These, in turn, were gotten
from Donald Knuth's books on mathematics et al. (Tomes actually.)

For alphabetic sorting, I have found the Postal Sort to be the quickest
and easiest to use. If you want more info just e-mail me.

Mark

Sebastien Nadeau wrote:

> Can someone explain that?

Mark Jason Dominus

unread,

Sep 30, 2005, 10:51:15 PM9/30/05

to

"T Beck" <Tracy...@Infineon.com>:

>
> Since big O is defined as the asymptotic worst case performance of an
> algorithm, how is it "worst case O() and average O()?" (This is really
> a question, I have no doubt in my mind that I've missed something, or
> forgotten something from my school days)

That's not the definition. In general, a quantity Q is said to be
O(f(x)) for some function f(x), if there is some constant C so that,
for sufficiently large x,

Q < C * f(x)

If the running time of an algorithm is O(f(x)), then

running time < C * f(x)

for some constant C and sufficiently large x. Clearly, this is a
statement about the worst case of the algorithm.

But you can also say that the average running time of some algorithm
is O(f(x)), which means that there is a constant C, etc., so that

average running time < C * f(x)

So it is correct to say that the running time of quicksort is not
O(x log x), but the average running time of quicksort *is* O(x log x).

Hope this helps.

Tassilo v. Parseval

unread,

Oct 1, 2005, 2:54:47 AM10/1/05

to

Also sprach Mark Manning:

> Yes - kind-of. I've had this happen many times and someone on the Perl
> list explained it to me years ago. So I will try to remember what they
> said.....

You remember (or were told) wrongly.

> Basically, Perl allocates memory but does not free up memory as quickly
> as it gobbles it up. That is to say that the garbage collection routine
> waits a while to see if the memory is going to be used again. So if the
> Quicksort is going at a speed of, say five, the garbage collector is
> only going at a speed of between one and three. So memory is growing
> simply because Perl's garbage collector hasn't realized that some of the
> memory can be released. That is also why you see it go up, then down,
> then farther up, then down a bit, and up again. The garbage collector
> is trying to keep up but the priority is on doing the quicksort. It is
> assumed that you have enough memory to handle doing the sort in the
> first place.

I am inclined to say that this is gibberish. Perl's garbage collector
has no speed at which it is going. It is a simple reference-counting
mechanism. Once the reference-count of a variable drops to zero its
memory is not exactly freed but rather marked as reusable.

What you describe requires an asynchrous mode of operation, using a
dedicated garbage-collecting-thread. There is no such thing in perl.

> You can write your own Quicksort and alleviate the memory problems since
> you can make the original array global versus local. Then all you are
> doing is moving pointers into the array around rather than the array
> itself all the time. It is a bit slower than the built-in Quicksoft,
> but not by that much.

Have you any benchmarks to back that claim? I have one here, and
depending on the size of the array to be sorted, my handrolled version
is between 9 and 13 times slower than the built-in:

#!/usr/bin/perl -w

use Benchmark qw/cmpthese/;

sub qsort {
my ($cmp, $ary, $left, $right) = @_;

return if $left >= $right;

my $piv = ($left+$right) / 2;
@$ary[$left, $piv] = @$ary[$piv, $left];
my $last = $left;

*main::b = \$ary->[$left];

for ($left+1 .. $right) {
*main::a = \$ary->[$_];
++$last and @$ary[$_, $last] = @$ary[$last, $_] if &$cmp < 0;
}

@$ary[$left, $last] = @$ary[$last, $left];

qsort($cmp, $ary, $left, $last-1);
qsort($cmp, $ary, $last+1, $right);
}

cmpthese(1, { # for huge lists, 1 iteration is already painful enough
psort => sub {
my @ary = reverse 1 .. 2_000_000;
qsort(sub { return $a <=> $b }, \@ary, 0, $#ary);
},
csort => sub {
my @ary = reverse 1 .. 2_000_000;
@ary = sort { $a <=> $b } @ary;
},
});

Granted, this is not the most efficient qsort, however choosing the
middle element as pivot element is the best choice for an already
pre-sorted list so that should make up a bit for the lack of
optimization. Also note that the requirement of doing a real in-place
sort makes this routine quite ugly with its explicit passing-around of
the sub-array-bounds.

What can be said is that psort uses only around a half or third memory
of what csort uses for a list with 2 million elements. The latter is
heavily swapping on my machine, and it yet it's still much faster.

Therefore I conclude a handrolled qsort is only a useful option when
perl's sort bails out with out-of-memory.

Tassilo
--
use bigint;
$n=71423350343770280161397026330337371139054411854220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);

Dr.Ruud

unread,

Oct 4, 2005, 8:22:52 AM10/4/05

to

T Beck:

> Quicksort is a recursive algorithm

There are non-recursive Quicksort-implementations too.
http://en.wikipedia.org/wiki/Quicksort

For some kinds of data: Radix, Bucket and Postman's sort are interesting
alternatives.

--
Affijn, Ruud

"Gewoon is een tijger."

Mark Manning

unread,

Oct 2, 2005, 12:21:21 PM10/2/05

to

Tassilo v. Parseval wrote:
> Also sprach Mark Manning:

>
> You remember (or were told) wrongly.

That may be. :-)

> I am inclined to say that this is gibberish. Perl's garbage collector
> has no speed at which it is going. It is a simple reference-counting
> mechanism. Once the reference-count of a variable drops to zero its
> memory is not exactly freed but rather marked as reusable.
>
> What you describe requires an asynchrous mode of operation, using a
> dedicated garbage-collecting-thread. There is no such thing in perl.

No. I did not mean to convey that the garbage collector is a separate
thread or process (although it may be) - only that it has to wait and
verify that the memory is available for recouping. Therefore, it runs
slower (or maybe I should say at a slower rate) than the quicksort
itself is using up memory. Further, from what I understand about Perl -
which can be wrong also - Perl expects you to have the needed memory so
it can complete whatever function you are calling/using. Thus, the
quicksort itself has a higher priority than recouping memory. Even if
Perl isn't multithreaded (which I thought they had put in back in the
90's so Perl could be used with Apache more readily and handle multiple
threads at one time), the quicksort routine would then be called and,
since there aren't multiple threads, the quicksort routine would not
release the memory until it was through using it. Which again would
mean that the quicksort routine had higher priority than the garbage
collector as the GC could not run until after QS had finished.

If, however, Perl does use multithreading, then it would stand to
reason, IMHO, that the GC would be put on a separate thread so it could
monitor and try to grab memory back without having to wait for any
process Perl is running otherwise to complete. Applesoft, from the old
Apple][+ days had a GC which would interrupt programs running on the
Apple][+ to do garbage collection. This was slow and tedious because
your program would just stop functioning until the GC was through. With
the advent of ProDOS, Apple figured out a way to make the garbage
collector steal time from a running program to do the GCing. With
MS-DOS, QBasic used a similar scheme to recover their memory. Perl's
first implementation used the same type of scheme from QBasic but when
multithreading came along I thought they had changed over to using
threads for these types of things. From what I have read (and
understand) Java uses this type of an approach so the program doesn't
have to slow down and so does PHP.

>>You can write your own Quicksort and alleviate the memory problems since
>>you can make the original array global versus local. Then all you are
>>doing is moving pointers into the array around rather than the array
>>itself all the time. It is a bit slower than the built-in Quicksoft,
>>but not by that much.
>
>
> Have you any benchmarks to back that claim? I have one here, and
> depending on the size of the array to be sorted, my handrolled version
> is between 9 and 13 times slower than the built-in:

The term "a bit more" is not a dependent clause to "It will run as fast
as" or even "It will run nearly as fast as". It is just slower. Thus
"a bit more" is a relative term and not to be taken as a precise
measurement of time, size, length, breath, width, or any other
measurement term. Between 9 and 13 times slower is "a bit more slower".
Just as between 1 and 2 times slower is "a bit more slower" and 1,000
to 1,000,000 is "a bit more slower" - only usually used to denote
sarcasm because it is really a bit more slower rather just a little
slower. ;-)

> Granted, this is not the most efficient qsort, however choosing the
> middle element as pivot element is the best choice for an already
> pre-sorted list so that should make up a bit for the lack of
> optimization. Also note that the requirement of doing a real in-place
> sort makes this routine quite ugly with its explicit passing-around of
> the sub-array-bounds.

And that is a bad thing because....?

>
> What can be said is that psort uses only around a half or third memory
> of what csort uses for a list with 2 million elements. The latter is
> heavily swapping on my machine, and it yet it's still much faster.
>
> Therefore I conclude a handrolled qsort is only a useful option when
> perl's sort bails out with out-of-memory.

Wasn't that what we were talking about? That he had run a QS and had
run out of memory? So the suggestion was you could write your own QS
and it would run slower but by an unknown amount - just a bit slower. :-)

Sebastien Nadeau

unread,

Oct 3, 2005, 9:53:37 AM10/3/05

to

Tassilo v. Parseval wrote:

> [...] my handrolled version

> is between 9 and 13 times slower than the built-in:

> [...]

> Therefore I conclude a handrolled qsort is only a useful option when
> perl's sort bails out with out-of-memory.

Yes. It seems to be the case here however. 9 to 13 times is not so
dramatic since it's not a plunge from an order of magnitude.
Especially if the built-in version crashes, I'm able to wait a little
more if the sort is not part of a time critical routine.

Anyway, just before leaving friday I tried:

use sort "_quicksort";

Since Perl's default sort is now a merge sort. And indeed, it sorted
a list of 4,000,000 digits in 18 seconds without crashing.

Sébastien