Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

fseek and ftell abilities for GAWK

189 views
Skip to first unread message

Marc de Bourget

unread,
Nov 20, 2016, 3:18:50 PM11/20/16
to
Compared to TAWK, I miss one thing most: fseek and ftell.

I need it often because I have to deal with very big files.
With TAWK, I can create hashes with the line numbers as key
and the ftell value of the start of this line as value and
then jump easily later to this line saving memory and time.

Are there plans to implement it?
If it is already implemented, please correct me.

Marc de Bourget

unread,
Jan 5, 2017, 9:12:01 AM1/5/17
to
Hi GAWK developpers, since it is already implemented in C,
maybe it is not that hard to implement in GAWK, is it?
For me, fseek/ftell is the most missing GAWK feature.

Ed Morton

unread,
Jan 5, 2017, 10:41:56 AM1/5/17
to
Mark the way to communicate with the gawk developers is by email to
bug-...@gnu.org, not post in this NG. You can post here to ask other users
questions and sometimes you MIGHT get a gawk developer to read and respond to
your post but YMMV.

Ed.

Marc de Bourget

unread,
Jan 5, 2017, 11:07:20 AM1/5/17
to
Hi Ed, ok. I have posted it there.

Kenny McCormack

unread,
Jan 5, 2017, 11:16:12 AM1/5/17
to
In article <19fb97a2-56e9-4fa7...@googlegroups.com>,
Marc:
I was about to give you a "technical" answer to your question - that
is, a typical Unix-y "You could do that" sort of response. I think that it
should be possible to do this in more-or-less straightforward fashion in a
GAWK extension library. However, I then re-thought it, and realized that
this is really a policy question, not a technical one. From a policy
perspective, I think both of the following are legitimate issues:

1) Although I do think it is probably do-able via an extension library
- and I was about to spend a bit of time messing around with it, until I
realized that it's not really a technical question - I don't think it is
possible to do it without, as Dennis Ritchie would have said, getting a
little too cozy with the implementation.

I.e., you'd have to have a reliable (and portable!) way to figure out which
fd (file descriptor) the current file is open on. I can do this with
Linux, pretty reliably, by looking at /proc/<pid>/fd, but that's not going
to work on non-Linux systems. It gets ugly real quick, and you realize
something very important: namely, that although you *could* do this in
"user-space", it'd be a lot better if it were done in implementor-space.

And that's the key: There's a class of problems that, although they
*could* be done in user-(including via extension lib)-space, they be better
done by the implementors. And that's the rub, when the GAWK developers have
clearly and vociferously declared that they won't do it in implementor-space
if it is at all possible for it to be done in user-space.

2) I really, seriously, and sincerely don't understand why this is an
issue for you (you, personally), because I believe that both of the
following are true:
a) You own TAWK (and are quite happy with it).
b) You work on Windows.

So, you should just do your work with TAWK and be happy. Sadly, I don't
use Windows much nowadays, so I don't get to use TAWK as much as I'd like.

From both the features and the performance points-of-view, TAWK is far
and away the best AWK implementation on the planet. I don't understand why
you are at all concerned with GAWK. To be fair, TAWK has 3 negatives,
listed below, but I don't think any of them should affect you.
a) Runs on only two platforms (Windows, Solaris).
b) Is not open source.
c) Is not currently commercially available.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/DanaC

Marc de Bourget

unread,
Jan 5, 2017, 11:41:51 AM1/5/17
to
Thank you Kenny.
I'd like to add the most important TAWK negative:
d) No UTF-8 support.

Whereas, thanks to Kaz's GAWK Cygnal version, UTF-8 works with GAWK
on native Windows. I need to process UTF-8 files in the near future.


Kenny McCormack

unread,
Jan 5, 2017, 1:26:43 PM1/5/17
to
In article <b1bc0600-6f68-4823...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
...
>I'd like to add the most important TAWK negative:
>d) No UTF-8 support.
>
>Whereas, thanks to Kaz's GAWK Cygnal version, UTF-8 works with GAWK
>on native Windows. I need to process UTF-8 files in the near future.

Fair point. I've never had to worry about i18n, so it's a non-issue for me.

I think that TAWK was on the verge of having i18m support when it all went
poof! and Pat disappeared into the wilderness.

There's that guy, Paul, who posts here every once in a while. He claims to
have access to (and license to develop and publish) the TAWK code base (at
least for Windows).

I wonder if he might have any plans to make TAWK UTF-8 aware.

--
Rich people pay Fox people to convince middle class people to blame poor people.

(John Fugelsang)

Andrew Schorr

unread,
Jan 5, 2017, 9:47:06 PM1/5/17
to
Hi,

On Thursday, January 5, 2017 at 11:16:12 AM UTC-5, Kenny McCormack wrote:
> And that's the key: There's a class of problems that, although they
> *could* be done in user-(including via extension lib)-space, they be better
> done by the implementors. And that's the rub, when the GAWK developers have
> clearly and vociferously declared that they won't do it in implementor-space
> if it is at all possible for it to be done in user-space.

I think that these days, I might be considered a gawk developer. But it wasn't always that way. My involvement with gawk started when I wanted to process XML files with gawk. I found that Jurgen Kahrs had patched gawk to do this, and I worked with him to develop further the extension library mechanism. After several years of work, we were able to get our changes merged into mainline gawk. My point is that if fseek/ftell is really valuable to you, then you can take advantage of the fact that gawk is an open-source project and develop a patch to implement this feature. You can then attempt to contribute this patch to the gawk development team and convince them that this is a good change to make. It's not easy! It takes lots of time and effort! That's how it has always been with all of the open-source projects with which I have ever been involved.

If you don't want to go to the trouble of convincing the developers to merge a patch, then the other approach is to do this in an extension library. I think the hooks that we added to gawk to enable it to parse XML files will probably give you the power that you need to implement your own input parser that can support fseek/ftell. But it won't be easy -- parsing awk records is quite tricky, in part because RS may be a regular expression. Nonetheless, nothing is stopping you from pursuing either of these approaches. There is no guarantee of success!

Regards,
Andy

Marc de Bourget

unread,
Jan 6, 2017, 1:15:15 AM1/6/17
to
Le jeudi 5 janvier 2017 16:41:56 UTC+1, Ed Morton a ÄCcritâ :
> On 1/5/2017 8:11 AM, Marc de Bourget wrote:
> > Le dimanche 20 novembre 2016 21:18:50 UTC+1, Marc de Bourget a ÄCcrit :
> >> Compared to TAWK, I miss one thing most: fseek and ftell.
> >>
> >> I need it often because I have to deal with very big files.
> >> With TAWK, I can create hashes with the line numbers as key
> >> and the ftell value of the start of this line as value and
> >> then jump easily later to this line saving memory and time.
> >>
> >> Are there plans to implement it?
> >> If it is already implemented, please correct me.
> >
> > Hi GAWK developpers, since it is already implemented in C,
> > maybe it is not that hard to implement in GAWK, is it?
> > For me, fseek/ftell is the most missing GAWK feature.
> >
>

Marc de Bourget

unread,
Jan 6, 2017, 1:15:15 AM1/6/17
to
Le dimanche 20 novembre 2016 21:18:50 UTC+1, Marc de Bourget a ÄCcritâ :

Kenny McCormack Kenny McCormack

unread,
Jan 6, 2017, 1:15:15 AM1/6/17
to
In article <19fb97a2-56e9-4fa7...@googlegroups.com>, Marc de
Marc:
I was about to give you a "technical" answer to your question - that
is, a typical Unix-y "You could do that" sort of response. I think that it
should be possible to do this in more-or-less straightforward fashion in a GAWK
extension library. However, I then re-thought it, and realized that
this is really a policy question, not a technical one. From a policy
perspective, I think both of the following are legitimate issues:

1) Although I do think it is probably do-able via an extension library
- and I was about to spend a bit of time messing around with it, until I
realized that it's not really a technical question - I don't think it is
possible to do it without, as Dennis Ritchie would have said, getting a little
too cozy with the implementation.

I.e., you'd have to have a reliable (and portable!) way to figure out which fd
(file descriptor) the current file is open on. I can do this with Linux,
pretty reliably, by looking at /proc/<pid>/fd, but that's not going to work on
non-Linux systems. It gets ugly real quick, and you realize something very
important: namely, that although you *could* do this in "user-space", it'd be a
lot better if it were done in implementor-space.

And that's the key: There's a class of problems that, although they
*could* be done in user-(including via extension lib)-space, they be better
done by the implementors. And that's the rub, when the GAWK developers have
clearly and vociferously declared that they won't do it in implementor-space if
it is at all possible for it to be done in user-space.

Kenny McCormack Kenny McCormack

unread,
Jan 6, 2017, 1:15:15 AM1/6/17
to

Ed Morton

unread,
Jan 6, 2017, 1:15:15 AM1/6/17
to
On 1/5/2017 8:11 AM, Marc de Bourget wrote:
> Le dimanche 20 novembre 2016 21:18:50 UTC+1, Marc de Bourget a ÄCcrit :
>> Compared to TAWK, I miss one thing most: fseek and ftell.
>>
>> I need it often because I have to deal with very big files.
>> With TAWK, I can create hashes with the line numbers as key
>> and the ftell value of the start of this line as value and
>> then jump easily later to this line saving memory and time.
>>
>> Are there plans to implement it?
>> If it is already implemented, please correct me.
>
> Hi GAWK developpers, since it is already implemented in C,
> maybe it is not that hard to implement in GAWK, is it?
> For me, fseek/ftell is the most missing GAWK feature.
>

Marc de Bourget

unread,
Jan 6, 2017, 1:15:15 AM1/6/17
to
Le jeudi 5 janvier 2017 17:16:12 UTC+1, Kenny McCormack a ÄCcritâ :
Thank you Kenny.

Kenny McCormack Kenny McCormack Kenny McCormack Kenny McCormack

unread,
Jan 6, 2017, 4:16:06 PM1/6/17
to
In article <b1bc0600-6f68-4823...@googlegroups.com>, Marc de
Bourget <marcde...@gmail.com> wrote: ...
>I'd like to add the most important TAWK negative:
>d) No UTF-8 support.
>
>Whereas, thanks to Kaz's GAWK Cygnal version, UTF-8 works with GAWK
>on native Windows. I need to process UTF-8 files in the near future.

Kenny McCormack Kenny McCormack Kenny McCormack Kenny McCormack

unread,
Jan 6, 2017, 4:16:06 PM1/6/17
to
In article <19fb97a2-56e9-4fa7...@googlegroups.com>, Marc de

Andrew Schorr

unread,
Jan 6, 2017, 4:16:06 PM1/6/17
to
Hi,

On Thursday, January 5, 2017 at 11:16:12 AM UTC-5, Kenny McCormack wrote:
> And that's the key: There's a class of problems that, although they
> *could* be done in user-(including via extension lib)-space, they be better
> done by the implementors. And that's the rub, when the GAWK developers have
> clearly and vociferously declared that they won't do it in implementor-space
> if it is at all possible for it to be done in user-space.

Kaz Kylheku

unread,
Jan 8, 2017, 5:32:01 PM1/8/17
to
On 2017-01-05, Marc de Bourget <marcde...@gmail.com> wrote:
> Whereas, thanks to Kaz's GAWK Cygnal version, UTF-8 works with GAWK
> on native Windows. I need to process UTF-8 files in the near future.

A issue has been identified in Cygnal. It is tracked as issue #15.

I have a local fix, no public git commit or snapshot.

http://www.kylheku.com/cygnal/issue-15.html

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1

Marc de Bourget

unread,
Jan 9, 2017, 6:21:22 AM1/9/17
to
@Kenny: Yes, I already use Paul's TAWK 6 version. UTF-8 support is not planned.
The greatest advantage of TAWK 6 compared to TAWK 5 is that it overcomes the
2 GB file size limit of TAWK 5 for fseek/ftell operations.

@Andrew: Thank you. My C knowledge is still too limited to add this extension.

@Kaz: Thank you, I think this is no big issue.

Kenny McCormack

unread,
Jan 12, 2017, 12:55:28 PM1/12/17
to
In article <aa7262fa-ddb4-4291...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
Marc, how serious are you about making this happen?

Having re-thought this, and having done a little research, I think it could
be implemented, as an extension library. I envision this being a 3-person
job, involving me, you, and Kaz. Basically, I know how to do it, at least
to a first approximation - but I only work on Linux/Unix these days. If I
was able to get something working on Linux, then maybe Kaz could compile it
on Cygwin for you to test. What do you think of that?

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Rorschach

Marc de Bourget

unread,
Jan 12, 2017, 3:53:42 PM1/12/17
to
Hi Kenny, sounds great.
However, please be aware that I can't contribute more than testing
Windows exe files. My C knowledge is too basic to provide any code.

Kenny McCormack

unread,
Jan 12, 2017, 5:23:59 PM1/12/17
to
In article <97ce89a0-4400-4768...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
...
>Hi Kenny, sounds great.
>However, please be aware that I can't contribute more than testing
>Windows exe files. My C knowledge is too basic to provide any code.

I'm imagining that Kaz would provide you with a .DLL, for you to test with
your GAWK.EXE.

I'm assuming, without any actual knowledge, that when you compile a GAWK
extension in Cygwin, you get a .DLL.

Note, as an almost total aside, that even though the system default
extension for shared libraries on Mac OSX is .dylib, when I build GAWK
extension libraries on OSX, I always give them an extension of .so
(basically, so it looks the same as in Linux) and it works just fine.

So, I could imagine that you could make a .so in Cygwin as well, and it
would all work just fine.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Security

Marc de Bourget

unread,
Apr 19, 2017, 10:01:29 AM4/19/17
to
One further great thing about fseek and ftell is that you can undo a getline by
setting back the file pointer to the desired position for the next getline call.
I have just had a case were no other solution worked. It's the only way to undo.

Kenny McCormack

unread,
Apr 20, 2017, 2:52:31 AM4/20/17
to
In article <30770cc7-2ae4-417c...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
Marc, I'm actually sort of interested enough to take a shot at writing an
extension for this. But I'd like to know what sort of AWK code you have in
mind for using it. Could you give me a sample AWK code that you have in
mind - say something that shows how you currently use these functions in TAWK?
(I.e., what you what like to see working in GAWK).

--
"I think I understand delicate, but why do I have to wash my hands, and
be standing in cold water when doing it?"

Kaz Kylheku <k...@kylheku.com> in comp.lang.c

Marc de Bourget

unread,
Apr 20, 2017, 4:10:58 AM4/20/17
to
Yes, here is an example how I use ftell and fseek with TAWK:

# ftellfseektest.awk
BEGIN {
# Jump to a specified line, using ftell and fseek
fnr = 0
lastftell = 0
while ((getline < ARGV[1]) > 0) {
++fnr
JUMPARRAY[fnr] = lastftell
lastftell = ftell(ARGV[1])
}
close(ARGV[1])
fopen(ARGV[1], "r") # Open file for later access with fseek

# Jump to a specified line, e.g. in this case to line 3:
fseek(ARGV[1], JUMPARRAY[3], 0)
# Print this line:
if ((getline < ARGV[1]) > 0) {
print $0
}
close(ARGV[1])
}
0 new messages