fseek() extension to gawk version 4.1

Håkon Hægland

unread,

Jan 10, 2014, 5:24:53 AM1/10/14

to

I wondered if there is an fseek() extension to gawk version 4.1?
If not, would it be difficult to write one?

In the following question:

http://stackoverflow.com/questions/21031756/can-i-speed-up-awk-program-using-nr-function

the OP asks for a method to jump to specific byte position.
I do not know, if this really is a problem even for very large files,
gawk should be able to skip to a given position quite quickly. For
instance if we would like to start at line number 20 million:

NR>=20000000 { do something }

should be fast? Anyway, it could still be interesting to test this.

Here is how I would like it to work: In the BEGIN block you write

@load "fseek"
BEGIN {
fseek(20000000)
}

then everything else works as usual. So in effect, the only difference
is that gawk believes that the input file is much smaller than it
really is, and starts at byte position 20000000 instead of at the
beginning of the file.. (So NR now counts from that position)

Janis Papanagnou

unread,

Jan 10, 2014, 6:44:08 AM1/10/14

to

On 10.01.2014 11:24, H�kon H�gland wrote:
> I wondered if there is an fseek() extension to gawk version 4.1?
> If not, would it be difficult to write one?
>
> In the following question:
>
> http://stackoverflow.com/questions/21031756/can-i-speed-up-awk-program-using-nr-function
>
> the OP asks for a method to jump to specific byte position.

I understood that article that the OP wanted to jump to a specific
_record_ ("by telling it a starting point setting the NR"), not to
a byte offset, as usually used in seek functions).

> I do not know, if this really is a problem even for very large files,
> gawk should be able to skip to a given position quite quickly. For
> instance if we would like to start at line number 20 million:
>
> NR>=20000000 { do something }
>
> should be fast? Anyway, it could still be interesting to test this.
>
> Here is how I would like it to work: In the BEGIN block you write
>
> @load "fseek"
> BEGIN {
> fseek(20000000)
> }
>
> then everything else works as usual. So in effect, the only difference
> is that gawk believes that the input file is much smaller than it
> really is, and starts at byte position 20000000 instead of at the
> beginning of the file.. (So NR now counts from that position)

You seem to be mixing concepts; byte positions and record numbers.

The point with seek is that you can directly address the position
and jump to it, while with a record number you have no chance to
do that, even in the simple non-regexp RS case.

It's an open question how interrogate the (byte-)position in case
that the awk program wants to interrupt processing of the current
slice, how to pass it to the environment (to be able to pass it
back again for a later invocation). If you'd use stdout you could
hardly use stdout for regular output; you'd be forced to write
files then. Not very nice.

BTW, the OP at stackoverflow may also not have a good approach if
he loops in shell instead of letting awk do the looping. The OP
has also the option to do the processing in shell alone. The newer
versions of ksh93, for instance, support seek operations with its
redirection operators.

Yet again it's worth to ask whether there's really any performance
issue. I checked it on my system with a test file of size 860332414
bytes that contains 2951832 lines of text. The positioning simply
done with

NR<2000000 { next }
{ exit(0) }

at about 2/3rd of the file required less than a second.

Janis

Anton Treuenfels

unread,

Jan 10, 2014, 9:25:17 AM1/10/14

to

"Janis Papanagnou" <janis_pa...@hotmail.com> wrote in message
news:laome8$rqc$1...@news.m-online.net...

> The point with seek is that you can directly address the position
> and jump to it, while with a record number you have no chance to
> do that, even in the simple non-regexp RS case.

Well, with unstructured text files, yes. A text file of fixed-length lines
would be an exception.

- Anton Treuenfels

Håkon Hægland

unread,

Jan 10, 2014, 9:51:17 AM1/10/14

to

On Friday, January 10, 2014 12:44:08 PM UTC+1, Janis Papanagnou wrote:
> The point with seek is that you can directly address the position
> and jump to it, while with a record number you have no chance to
> do that, even in the simple non-regexp RS case.
>
>
> It's an open question how interrogate the (byte-)position in case
> that the awk program wants to interrupt processing of the current
> slice, how to pass it to the environment (to be able to pass it
> back again for a later invocation). If you'd use stdout you could
> hardly use stdout for regular output; you'd be forced to write
> files then. Not very nice.

Thanks for the reply. Do we need to interrogate the byte position? I agree that in some cases it would be nice. On the other hand, if the byte position to seek to is given apriori or as input to the awk program then I do not see a need for that.

On the other hand, from your answer, I now see an additional or separate question on how to determine the current byte position in an awk program. One idea could be to call length($0) on each record and sum until a given line NR is reached. But this would assume that each character is a single byte. Another alternative could be to build another extension "ftell" using the GNU C library ftell() function..

Janis Papanagnou

unread,

Jan 10, 2014, 1:58:14 PM1/10/14

to

Yes, you can always find special cases. For a general gawk feature
that isn't a point though.

Janis

Janis Papanagnou

unread,

Jan 10, 2014, 2:07:06 PM1/10/14

to

On 10.01.2014 15:51, H�kon H�gland wrote:
>
> Thanks for the reply. Do we need to interrogate the byte position? I agree
> that in some cases it would be nice. On the other hand, if the byte
> position to seek to is given apriori or as input to the awk program then I
> do not see a need for that.
>
> On the other hand, from your answer, I now see an additional or separate
> question on how to determine the current byte position in an awk program.

> [...]

Well, my answer was primarily to show that the proposal may not be well
thought through. I don't think it's a good idea. We would have to get a
sensibly answer first as how to design that feature, and I haven't seen
anything yet that makes sense to me. I also don't see how it would fit
into the awk paradigm. As said before, from what I read, I suspect the
other forum's OP doesn't seem to have incorporated his shell/awk hybrid
algorithm in a sensible way, and I fear we suggest non-fitting features
for awk that wouldn't be necessary if the issue would have been handled
in an appropriate way in the first place.

Janis