My longest string ever (so far) - over 300 million characters!

mjc

unread,

Jan 16, 2010, 7:15:22 PM1/16/10

to

And it was in a real application.

I had to extract 25-bit fields from a 90MB binary file, with frames of
10,000 fields indicated by a 33-bit sync value. The words I was
interested in were indicated by being preceded by a special tag word.

My first step was to convert the binary file to hex text using od. I
then wrote some gawk code to read the text file and extract the (32-
bit) words preceded by the tag word. There were 9 million of them.

I concatenated them into a single string of 72 million hex characters
(had to do byte-swapping along the way), and then, one character at a
time, converted that into a string of 0's and 1's 300 million
characters long. I could then easily (using index) search for the sync
pattern (independent of any word boundaries) and find the data I
wanted.

The total run time was just under 7 minutes (under Red Hat 5.1).

Some optimizations I had to do:

To build up the string of 9 million hex words, I had to group them 256
words at a time before concatenating them to the big string. When I
just did one word at a time, I took forever - I had to stop it.

Similarly, When converting the hex to binary, I converted groups of
256 characters at a time before appending them to the big binary
string.

Thinking about it now, I could probably combine the gathering of the
hex words with the conversion to binary - my program was a revision of
one where that combining wasn't done.

Anyway, it's nice that gawk can handle really long strings.

Martin Cohen

P.S. It would be nice if index had another argument which specified
where the search would start. Instead, after finding the sync pattern
and the fields, I had to remove the sync pattern from the big string
so I could find the next pattern, which occurred over 1000 times.

Janis Papanagnou

unread,

Jan 17, 2010, 1:42:38 AM1/17/10

to

mjc wrote:
> [...]

> P.S. It would be nice if index had another argument which specified
> where the search would start.

I seem to recall that this was planned for a future gawk release.

> Instead, after finding the sync pattern
> and the fields, I had to remove the sync pattern from the big string
> so I could find the next pattern, which occurred over 1000 times.

By removing you mean copying into a new string using substr() with
no third argument, I suppose.

Janis

Aharon Robbins

unread,

Jan 22, 2010, 7:20:49 AM1/22/10

to

In article <hiubgv$l5j$1...@news.m-online.net>,

Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>mjc wrote:
>> [...]
>> P.S. It would be nice if index had another argument which specified
>> where the search would start.
>
>I seem to recall that this was planned for a future gawk release.

I don't remember that, actually. But I've added it to the FUTURES file
so that maybe I'll remember to think about doing it.

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Janis Papanagnou

unread,

Jan 22, 2010, 11:09:48 AM1/22/10

to

Aharon Robbins wrote:
> In article <hiubgv$l5j$1...@news.m-online.net>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> mjc wrote:
>>> [...]
>>> P.S. It would be nice if index had another argument which specified
>>> where the search would start.
>> I seem to recall that this was planned for a future gawk release.
>
> I don't remember that, actually. But I've added it to the FUTURES file
> so that maybe I'll remember to think about doing it.

So I probably misremembered, and it was just part of a discussion.

Thanks for adding that to the possible future features list.

Janis

>
> Arnold

Anton Treuenfels

unread,

Jan 22, 2010, 7:25:24 PM1/22/10

to

"Aharon Robbins" <arn...@skeeve.com> wrote in message
news:hjc56m$1c6$2...@news.bytemine.net...

>>mjc wrote:
>>> [...]
>>> P.S. It would be nice if index had another argument which specified
>>> where the search would start.
>

> I don't remember that, actually. But I've added it to the FUTURES file
> so that maybe I'll remember to think about doing it.

Well as long as you've done that anyway...how about making negative indices
legal and meaning to set the start position from the end of the string
instead of the beginning? Ie:

index(searchstr, targetstr, -10)

starts searching at the tenth-to-last character of "searchstr".

I stole that idea from Javascript myself. It's pretty easy to implement:

function startpos(pos, str) {

if ( pos < 0 )
pos += length(str) + 1

if ( pos > 0 && pos <= length(str) )
return( pos )

return( 0 )
}

so I used it for my versions of index(), rindex(), substr(), ord() and maybe
others I forget. Handiest for getting that last character off a string via
substr(), but essentially free for the others, so why not? (I said to
myself)

- Anton Treuenfels

Aharon Robbins

unread,

Jan 24, 2010, 3:17:23 AM1/24/10

to

I can think about it. But it sounds like a start down the slippery slope
towards perl. The length() call on a string is essentially free, since gawk
maintains the string length, and I think it's clearer to write

lastind = length(str) - 1
lastchar = substgr(str, lastind, 1)

Arnold

In article <qd2dnZFFtpeiRMbW...@earthlink.com>,

Janis Papanagnou

unread,

Jan 24, 2010, 6:22:36 AM1/24/10

to

Aharon Robbins wrote:
> I can think about it. But it sounds like a start down the slippery slope
> towards perl.

Uh-oh! (But I don't think so in this case.)

> The length() call on a string is essentially free, since gawk
> maintains the string length, and I think it's clearer to write
>
> lastind = length(str) - 1
> lastchar = substgr(str, lastind, 1)

To add my 2 cents; added functions are better than syntactical
changes, and added _optional_ function arguments better than new
functions or a changed function signature.[*] Having to write
(and think about correctness) of the above two lines code would
make it worth adding that optional parameter. (There's a typo in
that code, BTW.) And for other than extracting the last character
you'd have to make changes in two places in the two lines of code,
or add another variable to write the code better maintainable.

Janis

[*] Not changing anything may seem be the best from another POV.

Ed Morton

unread,

Jan 24, 2010, 11:07:12 AM1/24/10

to

On 1/24/2010 5:22 AM, Janis Papanagnou wrote:
> Aharon Robbins wrote:
>> I can think about it. But it sounds like a start down the slippery slope
>> towards perl.
>
> Uh-oh! (But I don't think so in this case.)
>
>> The length() call on a string is essentially free, since gawk
>> maintains the string length, and I think it's clearer to write
>>
>> lastind = length(str) - 1
>> lastchar = substgr(str, lastind, 1)
>
> To add my 2 cents; added functions are better than syntactical
> changes, and added _optional_ function arguments better than new
> functions or a changed function signature.[*] Having to write
> (and think about correctness) of the above two lines code would
> make it worth adding that optional parameter. (There's a typo in
> that code, BTW.) And for other than extracting the last character
> you'd have to make changes in two places in the two lines of code,
> or add another variable to write the code better maintainable.
>
> Janis
>
> [*] Not changing anything may seem be the best from another POV.

That sounds good to me. It's great to have gawk extensions for things that
either take a lot of code to write or are conceptually tricky (e.g. gensub() and
asort() or having split() build an array of the separators), but it's not
difficult or even interesting code to implement the changes being discussed
here. If I ever needed either functionality, even if I remembered there was a
gawk extension for it, it'd take as long to ready the man page to see the syntax
as it'd take to just write the couple of lines of code to implement it in
standard awk.

Ed.

mjc

unread,

Jan 24, 2010, 5:16:01 PM1/24/10

to

> >> In article <qd2dnZFFtpeiRMbWnZ2dnUVZ_vWdn...@earthlink.com>,

Since I started this, here is my suggestion:

In index(str, look, prev), prev means the position just before where
you want to start searching. This would allow the following loop:

k = 0;
while ( (k = index(str, look, k)) > 0 )
{
... with substr(str, k) or substr(str, k, length(look));
}

This would also allow the substr(..., -n) usage.