help with block i/o

Cross

unread,

Aug 8, 2009, 5:38:13 AM8/8/09

to

Hello

I am trying to parse rtf files. The specification says that an reader is a
character reader. However, we know that character read is low. So I want to read
blocks at a time into memory and scan them. I read some time ago that using
fread() a complete file can be read into memory without knowing its size. help
me with this. Also, suppose I am reading say pages 1024 chars at a time into
memory; the last page is quite linkely to have the EOF somewhere in between. How
can I detect that.

Regards,
Cross

bartc

unread,

Aug 8, 2009, 6:07:05 AM8/8/09

to

"Cross" <X...@X.tv> wrote in message news:h5jh29$uuo$1...@adenine.netfront.net...

What do your docs for fread() say?

According to mine, fread() on 1024 objects returns 1024, except on the last
block where it may be less.

On a file containing a multiple of 1024 objects, the last fread(), according
to the following test, will return 0 (rather than 1024) since eof is not
detectable until you try and read some more. (In C, eof status is an error
condition!)

#include <stdio.h>
#include <stdlib.h>

int main(void) {
FILE *f;
int n;
char buffer[1024];

f=fopen("input","rb");
if (f==NULL) exit(0);

while (1) { /* trademark bartc loop */

n=fread(buffer,1,1024,f);
printf("N = %d EOF = %d\n",n,feof(f));
if (n<1024) break;

}

fclose(f);
}

This was tested on successive input files containing 4095, 4096 and 4097
bytes. Input was in binary mode; in text mode anything could happen.

--
Bartc

Cross

unread,

Aug 8, 2009, 6:46:51 AM8/8/09

to

I am clear now. I can read fixed size pages and depending on the return value of
fread() I can detect EOF. I have to check with text files.

Ben Bacarisse

unread,

Aug 8, 2009, 8:05:51 AM8/8/09

to

Cross <X...@X.tv> writes:

> I am trying to parse rtf files. The specification says that an
> reader is a character reader. However, we know that character read
> is low. So I want to read blocks at a time into memory and scan
> them.

You don't give enough context (maybe you have a very odd C
implementation) but having your program read a character at a time
does not mean the IO will be done a character at a time. All sane C
implementations I have seen do buffered reads on the input for this
very reason.

It seems that one of your first programming design decision has been
to complicate things (my evidence for that is your post here!) based
on a hypothetical problem (is your current program too slow? by how
much?). If you design it well, you can plug in a block-reading input
layer and measure the improvement.

In short, do you really know there is a problem?

> I read some time ago that using fread() a complete file can be
> read into memory without knowing its size. help me with this. Also,
> suppose I am reading say pages 1024 chars at a time into memory; the
> last page is quite linkely to have the EOF somewhere in between. How
> can I detect that.

Are you working in an odd environment? A 1024 char buffer is tiny by
today's standards (heck, it was small on a PDP-11). I would not be
surprised to find that fread using that size is slower than
char-by-char using the library's own buffering.

--
Ben.

phil-new...@ipal.net

unread,

Aug 8, 2009, 11:22:12 AM8/8/09

to

On Sat, 08 Aug 2009 13:05:51 +0100 Ben Bacarisse <ben.u...@bsb.me.uk> wrote:

| You don't give enough context (maybe you have a very odd C
| implementation) but having your program read a character at a time
| does not mean the IO will be done a character at a time. All sane C
| implementations I have seen do buffered reads on the input for this
| very reason.
|
| It seems that one of your first programming design decision has been
| to complicate things (my evidence for that is your post here!) based
| on a hypothetical problem (is your current program too slow? by how
| much?). If you design it well, you can plug in a block-reading input
| layer and measure the improvement.
|
| In short, do you really know there is a problem?

[+1: Informative]

|> I read some time ago that using fread() a complete file can be
|> read into memory without knowing its size. help me with this. Also,
|> suppose I am reading say pages 1024 chars at a time into memory; the
|> last page is quite linkely to have the EOF somewhere in between. How
|> can I detect that.
|
| Are you working in an odd environment? A 1024 char buffer is tiny by
| today's standards (heck, it was small on a PDP-11). I would not be
| surprised to find that fread using that size is slower than
| char-by-char using the library's own buffering.

Maybe he could do a read(,,1) version, and a read(,,1048576) version,
and compare which one matches speed with a fgetc() version.

--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------

bartc

unread,

Aug 8, 2009, 12:01:01 PM8/8/09

to

"Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message
news:0.12560184bfaeace8eb9f.2009...@bsb.me.uk...

> Cross <X...@X.tv> writes:
>
>> I am trying to parse rtf files. The specification says that an
>> reader is a character reader. However, we know that character read
>> is low. So I want to read blocks at a time into memory and scan
>> them.
>
> You don't give enough context (maybe you have a very odd C
> implementation) but having your program read a character at a time
> does not mean the IO will be done a character at a time. All sane C
> implementations I have seen do buffered reads on the input for this
> very reason.
>
> It seems that one of your first programming design decision has been
> to complicate things (my evidence for that is your post here!) based
> on a hypothetical problem (is your current program too slow? by how
> much?). If you design it well, you can plug in a block-reading input
> layer and measure the improvement.

It's not that difficult to test. Just comparing fgetc and fread, and only
using them to measure the size of the file, gave speedups of from 5 times to
dozens of times, using fread instead of fgetc, and buffer sizes from 16
bytes to 16KB and a file size of 150MB, tested with lcc-win32, gcc and dmc
under Windows.

So I'd say forget character-at-a-time input even if the typical files are
small enough to show little difference. Using block-read now means the
program will scale later.

> Are you working in an odd environment? A 1024 char buffer is tiny by
> today's standards (heck, it was small on a PDP-11). I would not be
> surprised to find that fread using that size is slower than
> char-by-char using the library's own buffering.

Who knows what unknown overheads there are with fgetc in addition to some
internal fread-like function it must call internally? I knocked up my own
inefficient version of fgetc, with a 1KB buffer, and even that was six times
as fast as the standard fgetc, so I don't what the latter can be getting up
to.

--
Bart

Flash Gordon

unread,

Aug 8, 2009, 12:39:29 PM8/8/09

to

bartc wrote:
>
> "Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message
> news:0.12560184bfaeace8eb9f.2009...@bsb.me.uk...
>> Cross <X...@X.tv> writes:
>>
>>> I am trying to parse rtf files. The specification says that an
>>> reader is a character reader. However, we know that character read
>>> is low. So I want to read blocks at a time into memory and scan
>>> them.
>>
>> You don't give enough context (maybe you have a very odd C
>> implementation) but having your program read a character at a time
>> does not mean the IO will be done a character at a time. All sane C
>> implementations I have seen do buffered reads on the input for this
>> very reason.
>>
>> It seems that one of your first programming design decision has been
>> to complicate things (my evidence for that is your post here!) based
>> on a hypothetical problem (is your current program too slow? by how
>> much?). If you design it well, you can plug in a block-reading input
>> layer and measure the improvement.
>
> It's not that difficult to test. Just comparing fgetc and fread, and
> only using them to measure the size of the file, gave speedups of from 5
> times to dozens of times, using fread instead of fgetc, and buffer sizes
> from 16 bytes to 16KB and a file size of 150MB, tested with lcc-win32,
> gcc and dmc under Windows.

Well, if you *will* use the character get function that is most likely
to have the biggest inefficiencies... at least try with getc instead of
fgetc.

> So I'd say forget character-at-a-time input even if the typical files
> are small enough to show little difference. Using block-read now means
> the program will scale later.

Try getc and also try setting the buffer size.

>> Are you working in an odd environment? A 1024 char buffer is tiny by
>> today's standards (heck, it was small on a PDP-11). I would not be
>> surprised to find that fread using that size is slower than
>> char-by-char using the library's own buffering.
>
> Who knows what unknown overheads there are with fgetc in addition to
> some internal fread-like function it must call internally? I knocked up
> my own inefficient version of fgetc, with a 1KB buffer, and even that
> was six times as fast as the standard fgetc, so I don't what the latter
> can be getting up to.

Well, there will be a function call or potentially some inefficiencies
if it is a macro due to needing to avoid evaluating the parameter more
than once.

If, on the other hand, you use getc and play with buffer sizes you
should be able to avoid most of the inefficiencies.
--
Flash Gordon

Gene

unread,

Aug 8, 2009, 1:08:23 PM8/8/09

to

As others have said, modern stdio libraries are not going to be much
slower than something you write by hand. Probably the reverse. A
better reason to read an entire file is to avoid other copying costs.
For exampe if you are using getc to accumulate a token, you are often
copying the token string into a separate buffer. If you read the
entire file, you can represent the token with a couple of integers.
Many programs go farther to reduce copying by memory mapping the input
file.

Anyway, this is the kind of code you generally see to read a whole
file. Beware however. A gig of ram looks big until someone tries to
use your program on, say, a 500Mbyte manuscript.

Take care.

------------------

#include <stdio.h>
#include <stdlib.h>

// DEBUG: Replace these with error-catching versions.
#define safe_malloc malloc
#define safe_realloc realloc

// Ought to work with wide chars.
typedef unsigned char CHAR;

// Read a file into an array of CHAR. Return the number of chars read.
// Use the given initial buffer size and double it as needed.
CHAR *file_as_string(FILE *f, size_t *read_count_out, size_t
init_buf_char_count)
{
// Buffer index where next read should be placed.
size_t fill_ptr = 0;

// The read buffer.
CHAR *buf = safe_malloc(init_buf_char_count * sizeof(CHAR));

// Number of chars needed to fill the buffer;
size_t read_char_count = init_buf_char_count;

// Keep reading until we have the whole file.
for (;;) {
// Try to fill the buffer.
size_t n_chars_read = fread(&buf[fill_ptr], sizeof(CHAR),
read_char_count, f);

// Stop if the read retrieved fewer chars than requested.
if (n_chars_read < read_char_count) {
*read_count_out = fill_ptr + n_chars_read;
return safe_realloc(buf, *read_count_out * sizeof(CHAR));
}

// Move fill pointer to end of read.
fill_ptr += n_chars_read;

// Next read should be current size.
read_char_count = fill_ptr;

// Expand buffer to accomodate the next read.
buf = safe_realloc(buf, (fill_ptr + read_char_count) * sizeof
(CHAR));
}
}

int main(int argc, char *argv[])
{
size_t count;
CHAR *contents = file_as_string(stdin, &count, 1);
printf("%.*s<--eof--\n", count, contents);
return 0;
}

bartc

unread,

Aug 8, 2009, 1:58:01 PM8/8/09

to

"Flash Gordon" <sm...@spam.causeway.com> wrote in message
news:1bm0l6x...@news.flash-gordon.me.uk...

> bartc wrote:
>>
>> "Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message
>> news:0.12560184bfaeace8eb9f.2009...@bsb.me.uk...
>>> Cross <X...@X.tv> writes:
>>>
>>>> I am trying to parse rtf files. The specification says that an
>>>> reader is a character reader. However, we know that character read
>>>> is low. So I want to read blocks at a time into memory and scan
>>>> them.

>>> It seems that one of your first programming design decision has been

>>> to complicate things (my evidence for that is your post here!) based
>>> on a hypothetical problem (is your current program too slow? by how
>>> much?). If you design it well, you can plug in a block-reading input
>>> layer and measure the improvement.
>>
>> It's not that difficult to test. Just comparing fgetc and fread, and only
>> using them to measure the size of the file, gave speedups of from 5 times
>> to dozens of times, using fread instead of fgetc, and buffer sizes from
>> 16 bytes to 16KB and a file size of 150MB, tested with lcc-win32, gcc and
>> dmc under Windows.
>
> Well, if you *will* use the character get function that is most likely to
> have the biggest inefficiencies... at least try with getc instead of
> fgetc.
>
>> So I'd say forget character-at-a-time input even if the typical files are
>> small enough to show little difference. Using block-read now means the
>> program will scale later.
>
> Try getc and also try setting the buffer size.

Tried it again and oddly there was more disk activitity now, the OS not
cache-ing the file as well, and the fread timings were slower (but still
faster than fgetc)

I've tried getc (that's the macro one?) and yes it was about as fast as
using fread+1KB buffer, when using gcc. With the other two compilers it made
little difference.

But I don't know what the problem is with using fread(), which can if
necessary read the whole file at once. No need to mess about with the
system's xgetc() functions which may be a bottleneck, or maybe not; why take
the risk?

--
Bartc

Barry Schwarz

unread,

Aug 8, 2009, 2:17:05 PM8/8/09

to

In a production program, you should always check all I/O operations,
regardless of function used or type of file.

--
Remove del for email

Barry Schwarz

unread,

Aug 8, 2009, 2:42:11 PM8/8/09

to

Why would getc be any better than fgetc?

Gene

unread,

Aug 8, 2009, 2:44:22 PM8/8/09

to

On Aug 8, 1:58 pm, "bartc" <ba...@freeuk.com> wrote:
> "Flash Gordon" <s...@spam.causeway.com> wrote in message
>
> news:1bm0l6x...@news.flash-gordon.me.uk...
>
> > bartc wrote:
>
> >> "Ben Bacarisse" <ben.use...@bsb.me.uk> wrote in message

A couple of comments:

Are you running Windows? Fread and getc mechanisms on Windows text
files tend to be slowed down by the need to expand \n to \r\n. What
happens if you open in binary mode or use a low level _read()?

With some compilers fgetc() is implemented as an inlinable function,
where it's necessary to crank up optimization settings to get the
benefit.

Cheers

Flash Gordon

unread,

Aug 8, 2009, 2:48:00 PM8/8/09

to

So use the best compiler available ;-)

Obviously either the getc implementation is better on the gcc based
implementation you used or the optimiser is better with the options you
specified. Oh, and that is another thing, make sure you tell the
compiler to optimise.

> But I don't know what the problem is with using fread(), which can if
> necessary read the whole file at once. No need to mess about with the
> system's xgetc() functions which may be a bottleneck, or maybe not; why
> take the risk?

I would generally do whichever was easier to implement. I'm certainly
not going to bother going to the effort of implementing getc over the
top of fread unless I have a very good reason at the time.
--
Flash Gordon

Keith Thompson

unread,

Aug 8, 2009, 3:27:05 PM8/8/09

to

Barry Schwarz <schw...@dqel.com> writes:
> On Sat, 08 Aug 2009 17:39:29 +0100, Flash Gordon
> <sm...@spam.causeway.com> wrote:

[...]

>>Well, if you *will* use the character get function that is most likely
>>to have the biggest inefficiencies... at least try with getc instead of
>>fgetc.
>
> Why would getc be any better than fgetc?

The permission to implement getc as a macro that may evaluate its
stream argument more than once (in addition to the required
implementation as an ordinary function) may allow it to be more
efficient than fgetc.

Or it may not, depending on the implementation, but it's safe to
assume that getc is no slower than fgetc.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Flash Gordon

unread,

Aug 8, 2009, 3:13:50 PM8/8/09

to

Barry Schwarz wrote:
> On Sat, 08 Aug 2009 17:39:29 +0100, Flash Gordon
> <sm...@spam.causeway.com> wrote:

<snip>

>> Well, if you *will* use the character get function that is most likely
>> to have the biggest inefficiencies... at least try with getc instead of
>> fgetc.
>
> Why would getc be any better than fgetc?

Because the implementation is allowed more freedom in its macro
implementation of getc than of fgetc. getc is allowed to evaluate it's
argument multiple times.
--
Flash Gordon

Alan Curry

unread,

Aug 8, 2009, 7:29:37 PM8/8/09

to

In article <1jhfm.65357$OO7....@text.news.virginmedia.com>,

bartc <ba...@freeuk.com> wrote:
>
>Who knows what unknown overheads there are with fgetc in addition to some
>internal fread-like function it must call internally? I knocked up my own
>inefficient version of fgetc, with a 1KB buffer, and even that was six times
>as fast as the standard fgetc, so I don't what the latter can be getting up
>to.

Does your platform support pthreads? In glibc, all the stdio functions have
pthread locking overhead, wasting an unbelievable amount of time in standard
C programs which don't use pthreads. You can make it not do that by calling
getc_unlocked instead of getc, but then your program isn't standard C
anymore.

The pthread people have ruined the I/O performance of standard C programs.

How they did this without activating the pitchfork mob I'll never understand.

--
Alan Curry

Ben Bacarisse

unread,

Aug 8, 2009, 8:00:48 PM8/8/09

to

"bartc" <ba...@freeuk.com> writes:

> "Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message
> news:0.12560184bfaeace8eb9f.2009...@bsb.me.uk...
>> Cross <X...@X.tv> writes:
>>
>>> I am trying to parse rtf files. The specification says that an
>>> reader is a character reader. However, we know that character read
>>> is low. So I want to read blocks at a time into memory and scan
>>> them.
>>
>> You don't give enough context (maybe you have a very odd C
>> implementation) but having your program read a character at a time
>> does not mean the IO will be done a character at a time. All sane C
>> implementations I have seen do buffered reads on the input for this
>> very reason.
>>
>> It seems that one of your first programming design decision has been
>> to complicate things (my evidence for that is your post here!) based
>> on a hypothetical problem (is your current program too slow? by how
>> much?). If you design it well, you can plug in a block-reading input
>> layer and measure the improvement.
>
> It's not that difficult to test. Just comparing fgetc and fread, and
> only using them to measure the size of the file, gave speedups of from
> 5 times to dozens of times, using fread instead of fgetc, and buffer
> sizes from 16 bytes to 16KB and a file size of 150MB, tested with
> lcc-win32, gcc and dmc under Windows.

Easy to test which is faster on its own but that is not what I was
getting at. My question to the OP was "is your program too slow?" and
that is not easy for any of us to test.

I intent was to caution against pre-deciding where the effort should
go in making the program faster. When it is all finished, maybe the
parser itself could do with more work: that might pay bigger dividends
than working on the IO? Who knows?

I'd be more blazé about such choices being made early if everyone who
thinks like this wrote clean interfaces to their super-fast code, but
I've seen too many hair-ball programs with buffers sizes "just right"
for a VAX running BSD 4.2 -- I worry (just a little) every time
someone knows how slow getc is (or malloc or double or array indexing
or function calls or...)

Anyway, even if it turns out be the only thing lowing down an
otherwise super-fast program, I'd *still* do it only *after* the code
is written and tested. For one thing you'll have a solid set of test
result to do regression testing against -- buffer filling is just that
little bit easier to get wrong that char-at-a-time IO.

<snip>
--
Ben.

Kaz Kylheku

unread,

Aug 8, 2009, 8:11:34 PM8/8/09

to

On 2009-08-08, Alan Curry <pac...@kosh.dhis.org> wrote:
> In article <1jhfm.65357$OO7....@text.news.virginmedia.com>,
> bartc <ba...@freeuk.com> wrote:
>>
>>Who knows what unknown overheads there are with fgetc in addition to some
>>internal fread-like function it must call internally? I knocked up my own
>>inefficient version of fgetc, with a 1KB buffer, and even that was six times
>>as fast as the standard fgetc, so I don't what the latter can be getting up
>>to.
>
> Does your platform support pthreads? In glibc, all the stdio functions have
> pthread locking overhead, wasting an unbelievable amount of time in standard
> C programs which don't use pthreads.

Really? How do they miraculously call the libpthread functions,
given that you didn't supply the -pthread parameter to gcc, which therefore
didn't add -lpthread to the link?

pete

unread,

Aug 8, 2009, 9:03:46 PM8/8/09

to

And, perhaps consequently,
getc is more likely to be implemented as a macro
than fgetc is.

--
pete

Alan Curry

unread,

Aug 9, 2009, 3:42:40 AM8/9/09

to

In article <200908190...@gmail.com>,

You don't need -pthread or -lpthread to get the threaded version of stdio.
It's in libc. flockfile() is called for every character in your getc() loop.
(It's inlined so you won't see it as an extra function call in a debugger.)
You don't have to take my word for it. It's easy to measure:

#include <stdio.h>
int main(void)
{
int i=0;
while(getc/*_unlocked*/(stdin)!=EOF)
++i;
printf("%d\n", i);
return 0;
}

With getc, avg run time, 5 trials: 1.605 sec (wall time)
WIth getc_unlocked, avg run time, 5 trials: 0.128 sec (wall time)

input file for the tests was 8549258 bytes. 1.605 vs 0.128 - that's 1150%
overhead imposed by pthreads in a program that doesn't use them.

Adding -pthread -lpthread makes no difference. flockfile() is a pthread
function, but it's not in the libpthread library.

--
Alan Curry

James Kuyper

unread,

Aug 9, 2009, 7:34:52 AM8/9/09

to

Like most absolute statements, that's not quite right. Most of my
programs run in a production environment, but it's an environment in
which program runs only take a few minutes, restarting a program is
relatively trivial, and I/O failures are rare. As a result, I've found
it convenient for my programs to not bother checking for output failures
until just before closing the output file, using ferror(). Whether my
code detects the failure immediately or just before closing, there's
nothing it can do about the failure except report it to the production
system, which reports it to the operators; if they can't figure out what
went wrong, they report it to me.

In general, if there is an output error, it has nothing to do with the
particular place in the program where it first happened; it's usually
due to external factors like a hardware failure or running out of disk
space. Deferring the failure checking simplifies the code quite a bit,
making it easier to understand.

This depends upon the "sticky" nature of stdio error flags. A lot of the
output from my programs is through a third party library that hides the
actual file handle from me, preventing direct use of ferror(), and
provides no sticky error indication of it's own. When using that
library, my code has to check each output operation, just as you suggest.

This also does not apply to input operations, since when they fail, the
data that they were supposed to read in may be unusable; when using
fread(), it may even contain trap representations. Also, if an input
operation fails due to end of file, it may be due to a discrepancy
between the file layout expected and the actual layout of the input
file, so knowing exactly where it failed is very important. Therefore,
for input operations my code always tests immediately, and reacts
appropriately.

Cross

unread,

Aug 9, 2009, 7:37:08 AM8/9/09

to

Ben Bacarisse wrote:
> Cross <X...@X.tv> writes:
>
>> I am trying to parse rtf files. The specification says that an
>> reader is a character reader. However, we know that character read
>> is low. So I want to read blocks at a time into memory and scan
>> them.
>
> You don't give enough context (maybe you have a very odd C
> implementation) but having your program read a character at a time
> does not mean the IO will be done a character at a time. All sane C
> implementations I have seen do buffered reads on the input for this
> very reason.
>
> It seems that one of your first programming design decision has been
> to complicate things (my evidence for that is your post here!) based
> on a hypothetical problem (is your current program too slow? by how
> much?). If you design it well, you can plug in a block-reading input
> layer and measure the improvement.
>
> In short, do you really know there is a problem?

Yes I know there is one.

>
>> I read some time ago that using fread() a complete file can be
>> read into memory without knowing its size. help me with this. Also,
>> suppose I am reading say pages 1024 chars at a time into memory; the
>> last page is quite linkely to have the EOF somewhere in between. How
>> can I detect that.
>
> Are you working in an odd environment? A 1024 char buffer is tiny by
> today's standards (heck, it was small on a PDP-11). I would not be
> surprised to find that fread using that size is slower than
> char-by-char using the library's own buffering.
>

I will have to look into that then. Thank you.

James Kuyper

unread,

Aug 9, 2009, 7:41:33 AM8/9/09

to

Gene wrote:
> On Aug 8, 1:58 pm, "bartc" <ba...@freeuk.com> wrote:

...

>> I've tried getc (that's the macro one?) and yes it was about as fast as
>> using fread+1KB buffer, when using gcc. With the other two compilers it made
>> little difference.
>>
>> But I don't know what the problem is with using fread(), which can if
>> necessary read the whole file at once. No need to mess about with the
>> system's xgetc() functions which may be a bottleneck, or maybe not; why take
>> the risk?
>>
>
> A couple of comments:
>
> Are you running Windows? Fread and getc mechanisms on Windows text
> files tend to be slowed down by the need to expand \n to \r\n.

I think you mean "compress \r\n to \n"?
It doesn't matter; the fact that such compression needs to be done is
determined by the fact that the file was opened in text mode; it doesn't
matter whether you use getc(), fgetc() or fread() to perform the reads.

Cross

unread,

Aug 9, 2009, 7:46:44 AM8/9/09

to

Well Ben, my parser prototype is ready. I am gradually adding functionalities.
The code is at code.google.com/p/ertf. It works well for small rtf files. Now, I
have some work to be done in book-keeping and I/O. Reviews on my code are welcome.

Cross

unread,

Aug 9, 2009, 7:49:03 AM8/9/09

to

Thank you Gene. I had got the idea after bartc's initial reply. I shall soon put
new code at the site I mentioned in an earlier reply.

Ben Bacarisse

unread,

Aug 9, 2009, 10:08:44 AM8/9/09

to

Cross <X...@X.tv> writes:

> Ben Bacarisse wrote:
>> Cross <X...@X.tv> writes:
>>
>>> I am trying to parse rtf files. The specification says that an
>>> reader is a character reader. However, we know that character read
>>> is low. So I want to read blocks at a time into memory and scan
>>> them.

<snip>

>> In short, do you really know there is a problem?
> Yes I know there is one.

Combined with your original post, this suggests you know that the
problem is in getc. In a way this is good, in that there is then a
relatively simple route to getting better performance. Before you do
that, you might like to see if you are being hit particularly badly by
the locked getc issues described elsewhere.

<snip>
--
Ben.

Gene

unread,

Aug 9, 2009, 8:47:28 PM8/9/09

to

> matter whether you use getc(), fgetc() or fread() to perform the reads.- Hide quoted text -

Of course you're right. Thanks. I wasn't very clear. My experience
was that in a scanner I implemented (a long time ago) under Windows,
it turned out to be quicker to read binary and handle the \r myself
rather than read text mode and have the library do it. This was the
case with fread() as well as fgetc().

Cross

unread,

Aug 10, 2009, 5:33:12 AM8/10/09

to

Gene wrote:
> My experience
> was that in a scanner I implemented (a long time ago) under Windows,
> it turned out to be quicker to read binary and handle the \r myself
> rather than read text mode and have the library do it. This was the
> case with fread() as well as fgetc().
>

That reminds of a discussion I was having recently with a friend. I propsed
reading binary using fread() and using sscanf() on the read char buffer. He told
me to avoid it because sscanf() has loopholes and is unsafe to use.

Cross

unread,

Aug 10, 2009, 5:37:28 AM8/10/09

to

Yeah I noted that. I shall benchmark with both and see what suits me.

James Kuyper

unread,

Aug 10, 2009, 5:52:46 AM8/10/09

to

There are problems with using sscanf(), but they can be dealt with or
avoided. Any alternative to sscanf() is going to be pretty complicated,
essentially duplicating some the functionality of sscanf() while
avoiding the problems. I've never had that much free time on my hands,
but it might be the way to go for someone who does.

Ben Bacarisse

unread,

Aug 10, 2009, 10:34:50 AM8/10/09

to

Cross <X...@X.tv> writes:

> Ben Bacarisse wrote:
<snip>

>> Anyway, even if it turns out be the only thing lowing down an
>> otherwise super-fast program, I'd *still* do it only *after* the code
>> is written and tested. For one thing you'll have a solid set of test
>> result to do regression testing against -- buffer filling is just that
>> little bit easier to get wrong that char-at-a-time IO.
>>
>> <snip>
> Well Ben, my parser prototype is ready. I am gradually adding
> functionalities. The code is at code.google.com/p/ertf. It works
> well for small rtf files. Now, I have some work to be done in
> book-keeping and I/O. Reviews on my code are welcome.

I full review is beyond me at the moment and it is not easy when the
code can;y be quoted but I will say one thing: you are not using getc
alone to do the input so I don't know why you are so sure that that is
the problem. You also use fscanf with a %[...] format. This might be
a performance hit too -- I don't know.

You seem to use this format with no buffer size and this is like using
gets. I'd would change that as soon as it is practical to do so. It
may be that you are safe because the parts I did not look at restrict
the files passed to fscanf, but that seems unlikely.

--
Ben.

Flash Gordon

unread,

Aug 10, 2009, 2:10:40 PM8/10/09

to

You can generally write a customer built scanner for a specific task
that is a *lot* simpler than *scanf. In fact, in my experience, it is
often easier to write a custom scanner than to use *scanf.
--
Flash Gordon

Cross

unread,

Aug 11, 2009, 12:08:39 PM8/11/09

to

Interesting. well how is *scanf()'s presence in standard libraries justified then?

Flash Gordon

unread,

Aug 11, 2009, 4:00:58 PM8/11/09

to

Because some people disagree with me ;-)

It is a very useful tool, and very powerful, and like a lot of very
useful and powerful tools it can also be hard to use correctly. However
it can be used correctly. Also it was, I believe, common well before C
was standardised!
--
Flash Gordon

Cross

unread,

Aug 12, 2009, 3:34:54 PM8/12/09

to

Ben Bacarisse wrote:
> Cross <X...@X.tv> writes:
>
>> Ben Bacarisse wrote:
> <snip>
>>> Anyway, even if it turns out be the only thing lowing down an
>>> otherwise super-fast program, I'd *still* do it only *after* the code
>>> is written and tested. For one thing you'll have a solid set of test
>>> result to do regression testing against -- buffer filling is just that
>>> little bit easier to get wrong that char-at-a-time IO.
>>>
>>> <snip>
>> Well Ben, my parser prototype is ready. I am gradually adding
>> functionalities. The code is at code.google.com/p/ertf. It works
>> well for small rtf files. Now, I have some work to be done in
>> book-keeping and I/O. Reviews on my code are welcome.
>
> I full review is beyond me at the moment and it is not easy when the
> code can;y be quoted but I will say one thing: you are not using getc
> alone to do the input so I don't know why you are so sure that that is
> the problem. You also use fscanf with a %[...] format. This might be
> a performance hit too -- I don't know.

I am investigating the speed bottleneck due to use of getc and fgetc.

>
> You seem to use this format with no buffer size and this is like using
> gets. I'd would change that as soon as it is practical to do so. It
> may be that you are safe because the parts I did not look at restrict
> the files passed to fscanf, but that seems unlikely.
>

Apart from *getc functions, I am using scanf functions for input. poI intended
to replace all i/o by block i/o and replace fscanf with sscanf. however, I am
unsure about it now as stated in an earlier post due to security holes in sscanf.

Cross

unread,

Aug 12, 2009, 3:35:49 PM8/12/09

to

I intended to use setvbuf but was suggested to let the library do the buffering
rather.

Ben Bacarisse

unread,

Aug 12, 2009, 5:44:36 PM8/12/09

to

Cross <X...@X.tv> writes:
<snip>

> Apart from *getc functions, I am using scanf functions for
> input. poI intended to replace all i/o by block i/o and replace
> fscanf with sscanf. however, I am unsure about it now as stated in
> an earlier post due to security holes in sscanf.

I am not sure what you mean. Is there a specific concern you have
about sscanf? Unless I've misunderstood your code, you have a
security hole now with fscanf. sscanf won't make it any worse.

--
Ben.

Cross

unread,

Aug 12, 2009, 10:47:36 PM8/12/09

to

No, I was advised against sscanf(). I don't know what security holes it has, I
just took the suggestion.

About the problem with fscanf(), I shall look into it.
Thank you

Keith Thompson

unread,

Aug 12, 2009, 11:05:37 PM8/12/09

to

All the *scanf() functions have a problem with numeric overflow.
If you try to read a number that can't be represented in the target
object, the behavior is undefined. (Most implementations probably
just store some value in the object, but you can't be sure of that
if you want 100% portability.)

Ben Bacarisse

unread,

Aug 13, 2009, 7:58:41 AM8/13/09

to

Keith Thompson <ks...@mib.org> writes:

> Cross <X...@X.tv> writes:
>> Ben Bacarisse wrote:
>>> Cross <X...@X.tv> writes:
>>> <snip>
>>>> Apart from *getc functions, I am using scanf functions for
>>>> input. poI intended to replace all i/o by block i/o and replace
>>>> fscanf with sscanf. however, I am unsure about it now as stated in
>>>> an earlier post due to security holes in sscanf.
>>>
>>> I am not sure what you mean. Is there a specific concern you have
>>> about sscanf? Unless I've misunderstood your code, you have a
>>> security hole now with fscanf. sscanf won't make it any worse.
>>>
>> No, I was advised against sscanf(). I don't know what security holes
>> it has, I just took the suggestion.
>>
>> About the problem with fscanf(), I shall look into it.
>
> All the *scanf() functions have a problem with numeric overflow.
> If you try to read a number that can't be represented in the target
> object, the behavior is undefined. (Most implementations probably
> just store some value in the object, but you can't be sure of that
> if you want 100% portability.)

Just to be clear: I was more concerned by the unconstrained string
scans, though there are numerical scans as well.

--
Ben.

Chris M. Thomasson

unread,

Aug 13, 2009, 8:48:48 AM8/13/09

to

"Cross" <X...@X.tv> wrote in message news:h5jh29$uuo$1...@adenine.netfront.net...
> Hello
>

> I am trying to parse rtf files. The specification says that an reader is a
> character reader. However, we know that character read is low. So I want
> to read

> blocks at a time into memory and scan them. I read some time ago that
> using
> fread() a complete file can be read into memory without knowing its size.
> help

> me with this. Also, suppose I am reading say pages 1024 chars at a time
> into

> memory; the last page is quite linkely to have the EOF somewhere in
> between. How
> can I detect that.

Are you looking for something like this:
________________________________________________________________
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int
read_entire_file(
FILE* file,
char* buffer,
size_t size,
void* state,
int (*fp_on_read) (void*, char*, size_t)
) {
size_t read;
while ((read = fread(buffer, 1, size, file))) {
if (! fp_on_read(state, buffer, read)) break;
if (read < size) break;
}
return ferror(file) ? 0 : 1;
}

int
on_read(
void* state,
char* buffer,
size_t size
) {
unsigned long int* pcount = state;
*pcount += size;
printf("\ron_read(%p, %p, %lu) has read (%lu) bytes",
state, (void*)buffer, (unsigned long int)size, *pcount);
buffer[size] = '\0';
if (strchr(buffer, EOF)) {
puts("\ndetected an `EOF' in the buffer");
return 0;
}
return 1;
}

int main(void) {
int status = EXIT_FAILURE;
FILE* file = fopen("data.txt", "rb");
if (file) {
char buffer[16384];
unsigned long int bytes = 0;
status = EXIT_SUCCESS;
if (! read_entire_file(
file,
buffer,
sizeof(buffer) - 1,
&bytes,
on_read)) {
status = EXIT_FAILURE;
}
if (fclose(file)) {
status = EXIT_FAILURE;
}
putchar('\n');
}
return status;
}
________________________________________________________________

This crude little program will detect if a file happens to have an `EOF'
character located within it.

Ben Bacarisse

unread,

Aug 13, 2009, 9:53:57 AM8/13/09

to

"Chris M. Thomasson" <n...@spam.invalid> writes:

> "Cross" <X...@X.tv> wrote in message news:h5jh29$uuo$1...@adenine.netfront.net...
>> Hello
>>
>> I am trying to parse rtf files. The specification says that an
>> reader is a character reader. However, we know that character read
>> is low. So I want to read blocks at a time into memory and scan
>> them. I read some time ago that using fread() a complete file can
>> be read into memory without knowing its size. help me with
>> this. Also, suppose I am reading say pages 1024 chars at a time
>> into memory; the last page is quite linkely to have the EOF
>> somewhere in between. How can I detect that.
>
> Are you looking for something like this:

I don't think so. The OP appears to be talking about the actual end
of the file which one detects by looking at the size returned by fread
(so in that sense you code is an example) but your test looks for
something else altogether.

This is a strange non-portable test. It asks if the buffer contains a
byte whose value happens be the same as the implementation's
definition of the EOF macro (often -1 but all that we can be sure of
is that it is a constant expression of type int with a negative
value).

On some systems there may be no possible way this can match any
character. On others it match some character that has very little to
do with any end of file marker (either the real end of the file or
ill-conceived character that supposedly marks it on some systems).

> puts("\ndetected an `EOF' in the buffer");
> return 0;
> }
> return 1;
> }
>
>
> int main(void) {
> int status = EXIT_FAILURE;
> FILE* file = fopen("data.txt", "rb");
> if (file) {
> char buffer[16384];
> unsigned long int bytes = 0;
> status = EXIT_SUCCESS;
> if (! read_entire_file(
> file,
> buffer,
> sizeof(buffer) - 1,
> &bytes,
> on_read)) {
> status = EXIT_FAILURE;
> }
> if (fclose(file)) {
> status = EXIT_FAILURE;
> }
> putchar('\n');
> }
> return status;
> }
> ________________________________________________________________
>
> This crude little program will detect if a file happens to have an
> EOF' character located within it.

Sort of, but not in the way most people would interpret that phrase.

BTW, I find your style very hard to read and almost did not bother
look at this code at all. I have no desire to get you to change, but
you may think that you have arrived at a style that is a model of
clarity, so this remark is intended simply as a data point for your
consideration.

--
Ben.

Chris M. Thomasson

unread,

Aug 21, 2009, 11:17:49 AM8/21/09

to

"Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message

news:0.aca2127d6f448c22e96e.2009...@bsb.me.uk...

> "Chris M. Thomasson" <n...@spam.invalid> writes:
>
>> "Cross" <X...@X.tv> wrote in message
>> news:h5jh29$uuo$1...@adenine.netfront.net...
>>> Hello
>>>
>>> I am trying to parse rtf files. The specification says that an
>>> reader is a character reader. However, we know that character read
>>> is low. So I want to read blocks at a time into memory and scan
>>> them. I read some time ago that using fread() a complete file can
>>> be read into memory without knowing its size. help me with
>>> this. Also, suppose I am reading say pages 1024 chars at a time
>>> into memory; the last page is quite linkely to have the EOF
>>> somewhere in between. How can I detect that.
>>
>> Are you looking for something like this:
>
> I don't think so. The OP appears to be talking about the actual end
> of the file which one detects by looking at the size returned by fread
> (so in that sense you code is an example) but your test looks for
> something else altogether.

Ahhh. I was confused by the following sentence in the OP's initial post:

"Also, suppose I am reading say pages 1024 chars at a time into
memory; the last page is quite linkely to have the EOF somewhere in
between."

I thought that an EOF character for his particular platform was located
somewhere in between the last 1024 characters of the data and he wanted to
detect that.

[...]

>> This crude little program will detect if a file happens to have an
>> EOF' character located within it.
>
> Sort of, but not in the way most people would interpret that phrase.

Agreed.

> BTW, I find your style very hard to read and almost did not bother
> look at this code at all. I have no desire to get you to change, but
> you may think that you have arrived at a style that is a model of
> clarity, so this remark is intended simply as a data point for your
> consideration.

Thank you. Humm... Perhaps I need to add some more spaces. Is this any
better?
_____________________________________________________________________

size_t read;

if (read < size) break;
}

*pcount += size;

buffer[size] = '\0';

if (strchr(buffer, EOF)) {

puts("\ndetected an `EOF' in the buffer");

return 0;
}

return 1;
}

int main(void) {

int status = EXIT_FAILURE;

FILE* file = fopen("data.txt", "rb");

if (file) {

char buffer[16384];

unsigned long int bytes = 0;

status = EXIT_SUCCESS;

if (! read_entire_file(
file,
buffer,
sizeof(buffer) - 1,
&bytes,
on_read)) {

status = EXIT_FAILURE;
}

if (fclose(file)) {

status = EXIT_FAILURE;
}

putchar('\n');
}

return status;
}
_____________________________________________________________________

Or perhaps something like:
_____________________________________________________________________

size_t read;

if (read < size) break;
}

*pcount += size;

buffer[size] = '\0';

if (strchr(buffer, EOF))
{

puts("\ndetected an `EOF' in the buffer");

return 0;
}

return 1;
}

int main(void)
{

int status = EXIT_FAILURE;

FILE* file = fopen("data.txt", "rb");

if (file)
{

char buffer[16384];

unsigned long int bytes = 0;

status = EXIT_SUCCESS;

if (! read_entire_file(
file,
buffer,
sizeof(buffer) - 1,
&bytes,
on_read))
{

status = EXIT_FAILURE;
}

if (fclose(file))
{

status = EXIT_FAILURE;
}

putchar('\n');
}

return status;
}
_____________________________________________________________________

What don't you like about the original style? Do you have any suggestions?

again, thanks!

:^)

Ben Bacarisse

unread,

Aug 21, 2009, 7:25:44 PM8/21/09

to

"Chris M. Thomasson" <n...@spam.invalid> writes:

<snip>

> Thank you. Humm... Perhaps I need to add some more spaces. Is this any
> better?

<snip>

> What don't you like about the original style? Do you have any suggestions?

It is interesting that you added lots of vertical space to "open
things up" when one of the things I was having trouble with was the
very short indent. I would not dream of suggesting a number, but zero
is known to be too little and 1 is perilously close to zero :-)

The other thing that was not to my taste is:

int
function(
void *arg,
int another,
double more
) {

It took me a while to pin down why I prefer:

int function(void *arg, int another, double more)
{

and

int
function(void *arg, int another, double more)
{

and

int
function(void *arg,
int another,
double more)
{

and it relates to you indent. I like to run my eye down the margin to
find patterns that denote the function/macros/declarations and so on,
so I can see the overall shape of what I am looking at in one go
("three macros a typedef and two functions") and there was just too
much nose down the left side of your examples.

I suspect that there was also an unwelcome echo of the bad old days in
which un-type-checked functions looked similar:

int
function(arg, another, more)
void *arg;
int another;
double more;
{

Anyway, it really is no big deal, but I ventured to reply because you
ventured to ask.

--
Ben.

Keith Thompson

unread,

Aug 21, 2009, 8:52:01 PM8/21/09

to

"Chris M. Thomasson" <n...@spam.invalid> writes:

[...]

> Thank you. Humm... Perhaps I need to add some more spaces. Is this any
> better?
> _____________________________________________________________________

[88 lines deleted]

The above has way too much vertical whitespace and not enough
horizontal whitespace for my personal taste. You have a blank line
after each opening '{', and you're using a mix of 1-column and
2-column indentation; personally I like 4.

See below for how I might write it. Most of the changes relative to
your code are just tweaking whitespace; in addition, I've added some
braces (I like to use braces on all conditional statements) and made
some comparisons more explicit, such as (blah != NULL) or (blah != 0)
rather than just (blah).

A few more comments on things I haven't changed:

Searching for EOF in buffer is almost certainly the wrong thing to do.

Your input file is named "data.txt", but you open it in binary mode.

16384 is a magic number; you should declare it as a constant.

I wouldn't use "read" as a variable name. I might use "bytes_read".
("read" is the name of a POSIX function; you're free to use the same
name yourself, but it might cause problems later.)

In main(), you initialize status to EXIT_FAILURE, then, in some
circumstances, set it to EXIT_SUCCESS and back to EXIT_FAILURE again.
IMHO it would be cleaner to initialize it to EXIT_SUCCESS and set it
to EXIT_FAILURE when something goes wrong.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int read_entire_file( FILE* file,
char* buffer,
size_t size,
void* state,
int (*fp_on_read) (void*, char*, size_t) )
{
size_t read;

while ((read = fread(buffer, 1, size, file)) != 0) {

if (! fp_on_read(state, buffer, read)) {
break;
}
if (read < size) {
break;
}
}

return ! ferror(file);
}

int on_read( void* state,
char* buffer,
size_t size )
{
unsigned long int* pcount = state;
*pcount += size;
printf("\ron_read(%p, %p, %lu) has read (%lu) bytes",
state,
(void*)buffer,
(unsigned long int)size,
*pcount);
buffer[size] = '\0';

if (strchr(buffer, EOF) != NULL) {

puts("\ndetected an `EOF' in the buffer");
return 0;
}
return 1;
}

int main(void)
{
int status = EXIT_FAILURE;
FILE* file = fopen("data.txt", "rb");

if (file != NULL) {

char buffer[16384];
unsigned long int bytes = 0;
status = EXIT_SUCCESS;
if (! read_entire_file( file,
buffer,
sizeof(buffer) - 1,
&bytes,
on_read ))
{
status = EXIT_FAILURE;
}

if (fclose(file) != 0) {

status = EXIT_FAILURE;
}
putchar('\n');
}
return status;
}

--

Chris M. Thomasson

unread,

Aug 22, 2009, 6:14:36 AM8/22/09

to

"Keith Thompson" <ks...@mib.org> wrote in message
news:ln63cgr...@nuthaus.mib.org...

> "Chris M. Thomasson" <n...@spam.invalid> writes:
> [...]
>> Thank you. Humm... Perhaps I need to add some more spaces. Is this any
>> better?
>> _____________________________________________________________________

[...]

>> _____________________________________________________________________
>>
>> What don't you like about the original style? Do you have any
>> suggestions?
>
> The above has way too much vertical whitespace and not enough
> horizontal whitespace for my personal taste. You have a blank line
> after each opening '{', and you're using a mix of 1-column and
> 2-column indentation; personally I like 4.
>
> See below for how I might write it. Most of the changes relative to
> your code are just tweaking whitespace; in addition, I've added some
> braces (I like to use braces on all conditional statements) and made
> some comparisons more explicit, such as (blah != NULL) or (blah != 0)
> rather than just (blah).
>
> A few more comments on things I haven't changed:

[...]

Thank you Keith and Ben for all of you're suggestions. I appreciate and
respect them all and will try to incorporate them in any future code that I
post.

:^)

David Thompson

unread,

Aug 25, 2009, 12:22:55 AM8/25/09

to

On Thu, 13 Aug 2009 14:53:57 +0100, Ben Bacarisse
<ben.u...@bsb.me.uk> wrote:

> "Chris M. Thomasson" <n...@spam.invalid> writes:
>
> > "Cross" <X...@X.tv> wrote in message news:h5jh29$uuo$1...@adenine.netfront.net...
> >> Hello
> >>
> >> I am trying to parse rtf files. The specification says that an
> >> reader is a character reader. However, we know that character read
> >> is low. So I want to read blocks at a time into memory and scan

OP apparently meant 'slow'. Actually we don't know that in all cases;
on many systems getc() inlines code that is amortized almost as fast
as bulk copy (taking advantage of the standard's dispensation to
multiply evaluate the fp argument) and so does getchar() (since its fp
argument is idempotent). (And similarly putc() and putchar().) But
assuming for the sake of example it is too slow on a given system,
which is quite possible ...

> >> them. I read some time ago that using fread() a complete file can
> >> be read into memory without knowing its size. help me with

fread() can read variable amounts of data _up to the buffer size_
without knowing in advance how much is there. That may not be the
complete file, depending on the buffer size and file size.

> >> this. Also, suppose I am reading say pages 1024 chars at a time
> >> into memory; the last page is quite linkely to have the EOF
> >> somewhere in between. How can I detect that.
> >

And that is the solution to the previous point; given a size-X buffer,
you can fread() chunks of X, with the last one usually short. Actually
on most modern systems 1K is tiny; I would go at least 16K as Mr.
Thomasson did, and probably 64K or 256K or even 1M.

> > Are you looking for something like this:
>
> I don't think so. The OP appears to be talking about the actual end
> of the file which one detects by looking at the size returned by fread
> (so in that sense you code is an example) but your test looks for
> something else altogether.

Right.

<snip>

> > int
> > on_read(
> > void* state,
> > char* buffer,
> > size_t size
> > ) {
> > unsigned long int* pcount = state;
> > *pcount += size;
> > printf("\ron_read(%p, %p, %lu) has read (%lu) bytes",
> > state, (void*)buffer, (unsigned long int)size, *pcount);
> > buffer[size] = '\0';
> > if (strchr(buffer, EOF)) {
>
> This is a strange non-portable test. It asks if the buffer contains a
> byte whose value happens be the same as the implementation's
> definition of the EOF macro (often -1 but all that we can be sure of
> is that it is a constant expression of type int with a negative
> value).
>
> On some systems there may be no possible way this can match any

That shouldn't happen. I presume you're thinking of systems where
plain char is unsigned -- but on those systems, the argument to strchr
will be converted to that (unsigned) type and thus a possible value.

(It's the _assignment_ in while( (a_plain_char = getchar()) == EOF )
or similar that does have this problem of failing to match ever.)

> character. On others it match some character that has very little to
> do with any end of file marker (either the real end of the file or
> ill-conceived character that supposedly marks it on some systems).
>

Right. Further, even correcting it to search for the right value --
which is '\x1A' in the most common situation this is desirable -- it
doesn't find it when a null character '\0' occurs (earlier) in the
same chunk. Better to use memchr() -- and then you don't need to make
the chunks sizeof(buffer)-1 to leave room for a null.

> BTW, I find your style very hard to read and almost did not bother
> look at this code at all. I have no desire to get you to change, but
> you may think that you have arrived at a style that is a model of
> clarity, so this remark is intended simply as a data point for your
> consideration.

It is a bit unusual, but not too different from what I have sometimes
done (I think for good reason) so once I recognized it I didn't have
much trouble. Although I would choose some names differently.
De gustibus.