Storing variable length data in file -- Paging

Luiz Antonio Gomes Pican?o

unread,

Feb 25, 2005, 5:35:30 PM2/25/05

to

How i can store a variable length data in file ?
I want to do it using pure C, without existing databases.

I'm thinking to use pages to store data.

Anyone has idea for the file format ?

I want to store data like a database:
----------------------------------
Custumer:
id - integer
name - varchar(50)
email - varchar(45)
----------------------------------

___________________________________________________________
Luiz Antonio Gomes Picanço
Software Developer
www.luizantonio.com
lu...@luizantonio.com
___________________________________________________________

Walter Roberson

unread,

Feb 25, 2005, 6:08:09 PM2/25/05

to

In article <dde295bf.05022...@posting.google.com>,
Luiz Antonio Gomes Pican?o <lu...@luizantonio.com> wrote:
:How i can store a variable length data in file ?

:I want to do it using pure C, without existing databases.

There must be dozens of ways.

:I'm thinking to use pages to store data.

You could do that, but what happens if the variable length data
exceeds the page size?

:Anyone has idea for the file format ?

:I want to store data like a database:
:----------------------------------
:Custumer:
:id - integer
:name - varchar(50)
:email - varchar(45)

How would you like the program to operate in each of these cases?

1) froup - varchar(-42)
2) froup - varchar(0)
3) froup - varchar(4294967295)
4) froup - varchar(6.4)
5) froup - varchar(id)
6) ThisIsAVeryLongIdentifierThatIsLongerThanThe5CharacterPositionsYouImplyInYourExample - varchar(100)
7) - varchar(10)
8) varchar(20) - froup
9) .dot - varchar(17)
10) _foo - varchar(8)
11) - - varchar(21)
12) ------- varchar(83)
13) froup -
14) froup - varchar[71]
15) biff! - varchar(4)
16)
17) semi - varchar(23);
18) froup varchar(102)
19) froup - varchar(5:10)
20) froup - varchar(6,11)
21) froup - varchar(6)[11]

--
"Mathematics? I speak it like a native." -- Spike Milligan

jacob navia

unread,

Feb 25, 2005, 7:25:43 PM2/25/05

to

To store variable of any length in a file you do this:

You store the length of the data, then the data itself.
This supposes you have a maximum length, say what an
integer can hold (2^31 in most 32 bit systems, 2GB)
or 4GB if you go for unsigned int to store the length.

To read such a file you read the length, by reading an integer
from the file, then skip that number of bytes to access the
next record.

Many popular files formats are done that way, for instance
the object file format under windows (.obj).

For instance in your case suppose you have the record:
id 69988 (1 integer)
name Joe Smith (9 chars)
email j...@joe.com (11 chars)

You would store, supposing sizeof(int)=4
4 // The length of the record
4 // id of record, always fixed
4 // Length of name
9 // chars of name
4 // length of email
11 // chars of email.

11+4+9+4+4 == 32

Then you store:
fwrite(&len,1,4,file);
fwrite(&id,1,4,file);
fwrite(&namelen,1,4,file);
fwrite(name,1,namelen,file);
fwrite(&email_len,1,4,file);
fwrite(email,1,email_len,file);

You read it using the opposite function, fread().

This is a very compact way of storing the information.

jacob

Walter Roberson

unread,

Feb 25, 2005, 8:35:48 PM2/25/05

to

In article <421fc203$0$3106$8fcf...@news.wanadoo.fr>,
jacob navia <ja...@jacob.remcomp.fr> wrote:

:Luiz Antonio Gomes Pican?o wrote:

:> How i can store a variable length data in file ?
:> I want to do it using pure C, without existing databases.

:To store variable of any length in a file you do this:

:You store the length of the data, then the data itself.
:This supposes you have a maximum length, say what an
:integer can hold (2^31 in most 32 bit systems, 2GB)
:or 4GB if you go for unsigned int to store the length.

The OP asked for "pure C". Assuming 4 byte ints is not completely
portable, and hence not entirely "pure". ;-)

:To read such a file you read the length, by reading an integer

:from the file, then skip that number of bytes to access the
:next record.

Well, you wouldn't read an integer from the file, you would read
4 bytes from the file and convert it to sufficiently large integer type,
taking into account whatever ended-ness you had decreed for the file
type.

:For instance in your case suppose you have the record:

:id 69988 (1 integer)
:name Joe Smith (9 chars)
:email j...@joe.com (11 chars)

:Then you store:

: fwrite(&len,1,4,file);
: fwrite(&id,1,4,file);
: fwrite(&namelen,1,4,file);
: fwrite(name,1,namelen,file);
: fwrite(&email_len,1,4,file);
: fwrite(email,1,email_len,file);

:You read it using the opposite function, fread().

Though if you naively just do the opposite, then you run into the
small matter of null termination. Usually a 'varchar' size excludes
any required string terminator. There is no need to store the null
terminator when you store the length, and the code above would not
do so -- but when you read the string values back in, chances are that
you would want to null terminate them.

Of course a more generalized program would not assume a fixed-length
representation for the -size- of the strings. For example, using a
4 byte unsigned number as the size field artificially limits the
maximum field length to only (4 gigabytes minus 1), which is entirely
unsatisfactory if one wishes to store even moderately sized strings
such as an entire day's comp.lang.c flames and off-topic posts.
--
Positrons can be described as electrons traveling backwards in time.
Certainly many Usenet arguments about the past become clearer when they
are re-interpreted as uncertainty about the future.
-- Walter Roberson

Emmanuel Delahaye

unread,

Feb 26, 2005, 4:44:39 AM2/26/05

to

Luiz Antonio Gomes Pican?o wrote on 25/02/05 :

> How i can store a variable length data in file ?
> I want to do it using pure C, without existing databases.
>
> I'm thinking to use pages to store data.
>
> Anyone has idea for the file format ?
>
> I want to store data like a database:
> ----------------------------------
> Custumer:
> id - integer
> name - varchar(50)
> email - varchar(45)
> ----------------------------------

This is not a C question. It's a design issue.

<OT>
Use the Comma Separated Values (CVS) format, for example... Easy to
implement and to parse...
</OT>

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

"Mal nommer les choses c'est ajouter du malheur au
monde." -- Albert Camus.

jacob navia

unread,

Feb 26, 2005, 5:50:51 AM2/26/05

to

Walter Roberson wrote:
> In article <421fc203$0$3106$8fcf...@news.wanadoo.fr>,
> jacob navia <ja...@jacob.remcomp.fr> wrote:
> :Luiz Antonio Gomes Pican?o wrote:
>
> :> How i can store a variable length data in file ?
> :> I want to do it using pure C, without existing databases.
>
> :To store variable of any length in a file you do this:
>
> :You store the length of the data, then the data itself.
> :This supposes you have a maximum length, say what an
> :integer can hold (2^31 in most 32 bit systems, 2GB)
> :or 4GB if you go for unsigned int to store the length.
>
> The OP asked for "pure C". Assuming 4 byte ints is not completely
> portable, and hence not entirely "pure". ;-)
>

I said: "WHAT AN INTEGER CAN HOLD", and *then* I explained as an
example 4 byte ints, the most common case. Anyway you *have* to
assume some length. A 4GB limit for names, email-addresses looks
reasonable to me.

More interesting would be a limitation to 255 for both. This would
save 6 bytes/record.

More clever schemas can be used:
The length is 1-254 in one byte. If the length is more than 254
then a 255 is written, and the next two bytes are read. If they
are between 1-65534 that is the length. If the value read is
65535 then you read the next 4 bytes, and so on. This is a
virtually ilimited format.

This way you can store very efficiently the data since most names
and emails will be under 254.

This schema is used for instance in the encoding of the numeric
constants in Microsoft's debug info.

It supposes a slightly more complex decoder.

> :To read such a file you read the length, by reading an integer
> :from the file, then skip that number of bytes to access the
> :next record.
>
> Well, you wouldn't read an integer from the file, you would read
> 4 bytes from the file and convert it to sufficiently large integer type,
> taking into account whatever ended-ness you had decreed for the file
> type.
>

Correct.

>
> :For instance in your case suppose you have the record:
> :id 69988 (1 integer)
> :name Joe Smith (9 chars)
> :email j...@joe.com (11 chars)
>
> :Then you store:
> : fwrite(&len,1,4,file);
> : fwrite(&id,1,4,file);
> : fwrite(&namelen,1,4,file);
> : fwrite(name,1,namelen,file);
> : fwrite(&email_len,1,4,file);
> : fwrite(email,1,email_len,file);
>
> :You read it using the opposite function, fread().
>
> Though if you naively just do the opposite, then you run into the
> small matter of null termination. Usually a 'varchar' size excludes
> any required string terminator. There is no need to store the null
> terminator when you store the length, and the code above would not
> do so -- but when you read the string values back in, chances are that
> you would want to null terminate them.
>

Correct. You should allocate 1 byte more than you will read
and then zero terminate the string when reading.

>
> Of course a more generalized program would not assume a fixed-length
> representation for the -size- of the strings. For example, using a
> 4 byte unsigned number as the size field artificially limits the
> maximum field length to only (4 gigabytes minus 1),

Well, this is a limitation of the format as described.
An easy way to overcome this and make it more efficient is the
algorithm outlined above, or you can go to long long
for the length, allowing to store strings up to
18 446 744 073 709 551 616 bytes.

> which is entirely
> unsatisfactory if one wishes to store even moderately sized strings
> such as an entire day's comp.lang.c flames and off-topic posts.

Well, my flames folder *is* overflowing...

flames >/dev/null

CBFalconer

unread,

Feb 26, 2005, 10:43:43 AM2/26/05

to

jacob navia wrote:
>
... snip ...

>
> I said: "WHAT AN INTEGER CAN HOLD", and *then* I explained as an
> example 4 byte ints, the most common case. Anyway you *have* to
> assume some length. A 4GB limit for names, email-addresses looks
> reasonable to me.
>
> More interesting would be a limitation to 255 for both. This would
> save 6 bytes/record.
>
> More clever schemas can be used:
> The length is 1-254 in one byte. If the length is more than 254
> then a 255 is written, and the next two bytes are read. If they
> are between 1-65534 that is the length. If the value read is
> 65535 then you read the next 4 bytes, and so on. This is a
> virtually ilimited format.
>
> This way you can store very efficiently the data since most names
> and emails will be under 254.
>
> This schema is used for instance in the encoding of the numeric
> constants in Microsoft's debug info.

A simpler scheme is to assign the high bit of each byte to mean
'more', and extract 7 bits per byte. This would be something like:

#define HIBIT 128

ch = getc(f); value = 0;
while (ch & HIBIT) {
value = HIBIT * value + (ch & (HIBIT - 1));
ch = getc(f);
}
value = HIBIT * value + (ch & (HIBIT - 1));

which is independant of CHAR_BIT and works nicely with streams.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Walter Roberson

unread,

Feb 26, 2005, 12:13:40 PM2/26/05

to

In article <42208920...@yahoo.com>,
CBFalconer <cbfal...@worldnet.att.net> wrote:
:A simpler scheme is to assign the high bit of each byte to mean

:'more', and extract 7 bits per byte. This would be something like:

: #define HIBIT 128

Minor phrasing nit. For this scheme, you do not want to assign
the "high" bit of each byte to mean 'more', you want to assign
a -fixed- platform-independant bit for that purpose. Which is
what your code does.

: ch = getc(f); value = 0;

This discussion leads me to ponder platform independance of the
database a bit more.

I do not have my copy of the C89 standard with me and cannot get
it today (building construction work), so perhaps someone could
answer to these points?

a) getc() is defined in terms of 'characters'. For this purpose,
is 'character' equivilent to 'byte' (that is, basic fetchable
units), or is 'character' potentially a multi-byte entity
(e.g., unicode). [I've probably been reading too much about
perl streams...]

b) Are fread() and fwrite() defined in terms of specific numbers
of bits per byte? If not, then if a file is written using
fwrite() on one platform, and is then taken to another platform,
the data that is read in may not have the same byte or character
boundaries. The result would then depend upon the nature of the
process of "taking it to another platform" (e.g., if I recall
correctly, ftp would normally do bytewise renormalization). This,
of course, in addition to the usual problem that
fwrite( (char *)&value, sizeof(value), 1, STREAM) is going to
write in a platform-dependant byte order.

I guess all this was why xdr was invented...
--
Admit it -- you peeked ahead to find out how this message ends!

Walter Roberson

unread,

Feb 26, 2005, 12:48:25 PM2/26/05

to

In article <42205486$0$3111$8fcf...@news.wanadoo.fr>,

jacob navia <ja...@jacob.remcomp.fr> wrote:
|Walter Roberson wrote:
|> In article <421fc203$0$3106$8fcf...@news.wanadoo.fr>,
|> jacob navia <ja...@jacob.remcomp.fr> wrote:

|> :You store the length of the data, then the data itself.
|> :This supposes you have a maximum length, say what an
|> :integer can hold (2^31 in most 32 bit systems, 2GB)
|> :or 4GB if you go for unsigned int to store the length.

|I said: "WHAT AN INTEGER CAN HOLD", and *then* I explained as an

|example 4 byte ints, the most common case.

You closed the () a bit early. If you remove the part in () from
your sentance, your sentance implies that if you use an unsigned
int to hold the length then the limit will be 4 GB no matter what
the platform.

In any case, the point is that you should be chosing the limit
first and -then- choosing the representation. Choosing a platform-
dependant limit is not the best of coding practices. There is
more excuse for chosing the limit of 2^32-1, as one can be
sure that one's unsigned long will hold at least that much. That is
a different choice than chosing "what an integer can hold".

:An easy way to overcome this and make it more efficient is the

:algorithm outlined above, or you can go to long long
:for the length,

long long is not part of C89.
--
So you found your solution
What will be your last contribution?
-- Supertramp (Fool's Overture)

Walter Roberson

unread,

Feb 26, 2005, 12:51:19 PM2/26/05

to

In article <421fc203$0$3106$8fcf...@news.wanadoo.fr>,
jacob navia <ja...@jacob.remcomp.fr> wrote:

:Then you store:
: fwrite(&len,1,4,file);

The byte order that results is not fixed in the standard.
You would not use this method of writing the file if you
are trying to write portable files.
--
I predict that you will not trust this prediction.

Joe Wright

unread,

Feb 26, 2005, 2:02:17 PM2/26/05

to

I don't think the 7-bit thing works today. Everybody 'extends' the set
to eight bits. Many years ago I learned that Pascal 'strings' were
arrays of characters, the first of which was the length of the string.
As above, this allowed short strings of 0..255 characters. Then someone
(I don't remember who) 'extended' the system. The short string was
limited to 254 characters. If the 'length' byte (s[0]) was 255 it
indicated that the string continued and that the length of the second
string could be found at s[255] and the first character of the second
string is at s[256]. If s[255] == 255 we just keep going (and going).

--
Joe Wright mailto:joeww...@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Luiz Antonio Gomes Pican?o

unread,

Feb 26, 2005, 6:26:20 PM2/26/05

to

jacob navia <ja...@jacob.remcomp.fr> wrote in message news:<421fc203$0$3106$8fcf...@news.wanadoo.fr>...

But, when the record grow, i should rewrite the all data to file. I'm
thinking in do this:

struct page
{
unsigned int nextpage;
unsigned int priorpage;
unsigned int length;
char pageoverflow;
}

If the record becomes larger then page, a overflow page will be used.

jacob navia

unread,

Feb 26, 2005, 6:45:40 PM2/26/05

to

Luiz Antonio Gomes Pican?o wrote:
> But, when the record grow, i should rewrite the all data to file.

True, but I thought that people's names/emails address do not change.
For growing a record the best strategy within my schema is to leave
a record marked as empty and then add the bigger record at the
end of the file.

This pressumes periodic "garbage collection" where all the empty
records are erased in a copy operation. Some years ago, when
I used Outlook, I had to periodcally tell it to "compress folder"
because probably a similar schema was used.

Your page schema is better when records grow by moderate
amounts but wastes a lot of place in the not 100% filled
pages. Besides, your schema doesn't organize the records within a
page.

Each possible schema has drawbacks. It depends on the data and
whether it changes often or not, whether space is at a premium
or not, whether reading speed is important, whether software
complexity is an issue, etc etc.

jacob

Richard Bos

unread,

Feb 28, 2005, 3:40:58 AM2/28/05

to

jacob navia <ja...@jacob.remcomp.fr> wrote:

> Luiz Antonio Gomes Pican?o wrote:
> > But, when the record grow, i should rewrite the all data to file.
>
> True, but I thought that people's names/emails address do not change.

You're wrong. People not only _do_ change their names occasionally (for
example, Reginald Kenneth Dwight officially changed his name to the one
we all know him by anyway), but they change their e-mail addresses quite
often. I've posted to this group under at least three mail addresses,
and now have yet another (which I don't use here). Then there's the case
that your database is used for the contact information of your customers
- and one of them quits, to be replaced by someone with a longer name.
I'd expect this to happen regularly to any company with even a moderate
number of business partners.
Besides, in general, it is always better to plan for what could, but
might not change, than to be caught unawares when something that should
not have changed, does.

> For growing a record the best strategy within my schema is to leave
> a record marked as empty and then add the bigger record at the
> end of the file.

Or to assign fields not in bytes, but in "natural-sized" chunks,
allowing you to replace their contents if they don't grow _too_ much.
You'd still need to move the data to a new, larger record if your data
grows by too much, but for regular but small changes you might save a
lot of re-writing. The hardest part is, of course, to determine what
"natural-sized" is. (And remember to use binary modes!)

> This pressumes periodic "garbage collection"

All workable database schemes presume this anyway, either in the
background or triggerable by the user.

> Your page schema is better when records grow by moderate
> amounts but wastes a lot of place in the not 100% filled
> pages.

In the general case, a bit of space lost is well worth a decrease in
re-indexing; and with a decent organisation, you can reduce the amount
of slack to an acceptable minimum.

> Each possible schema has drawbacks. It depends on the data and
> whether it changes often or not, whether space is at a premium
> or not, whether reading speed is important, whether software
> complexity is an issue, etc etc.

This is entirely true.

Richard

Lawrence Kirby

unread,

Mar 1, 2005, 8:56:06 AM3/1/05

to

On Sat, 26 Feb 2005 17:13:40 +0000, Walter Roberson wrote:

...

> : ch = getc(f); value = 0;
>
> This discussion leads me to ponder platform independance of the
> database a bit more.
>
> I do not have my copy of the C89 standard with me and cannot get
> it today (building construction work), so perhaps someone could
> answer to these points?
>
> a) getc() is defined in terms of 'characters'. For this purpose,
> is 'character' equivilent to 'byte' (that is, basic fetchable
> units), or is 'character' potentially a multi-byte entity
> (e.g., unicode). [I've probably been reading too much about
> perl streams...]

Yes, it reads the next byte from the file. It reads a value as an unsigned
char and converts that to int for the return value. It is the wide
character functions that interpret multibyte sequences.

> b) Are fread() and fwrite() defined in terms of specific numbers
> of bits per byte?

No.

> If not, then if a file is written using
> fwrite() on one platform, and is then taken to another platform,
> the data that is read in may not have the same byte or character
> boundaries.

That is a possibility irrespective of what the standard C library does.

> The result would then depend upon the nature of the
> process of "taking it to another platform" (e.g., if I recall
> correctly, ftp would normally do bytewise renormalization).

E.g. should a 16 bit platform pack 2 octets per byte or just one?

> This,
> of course, in addition to the usual problem that
> fwrite( (char *)&value, sizeof(value), 1, STREAM) is going to
> write in a platform-dependant byte order.

But that problem is easily avoided.

Lawrence

Dave Thompson

unread,

Mar 7, 2005, 3:44:33 AM3/7/05

to

On 26 Feb 2005 17:13:40 GMT, robe...@ibd.nrc-cnrc.gc.ca (Walter
Roberson) wrote:
<snip>

> a) getc() is defined in terms of 'characters'. For this purpose,
> is 'character' equivilent to 'byte' (that is, basic fetchable
> units), or is 'character' potentially a multi-byte entity
> (e.g., unicode). [I've probably been reading too much about
> perl streams...]
>

getc, fgetc, getchar do a byte, as do putc, fputc, putchar.
getwc, fgetwc, getwchar and putwc, fputwc, getwchar do a wide
character to/from multibyte in a 'wide-oriented' file (if supported).

> b) Are fread() and fwrite() defined in terms of specific numbers
> of bits per byte? If not, then if a file is written using
> fwrite() on one platform, and is then taken to another platform,
> the data that is read in may not have the same byte or character
> boundaries. The result would then depend upon the nature of the
> process of "taking it to another platform" (e.g., if I recall
> correctly, ftp would normally do bytewise renormalization). This,

Exactly. Actually, FTP has several options that mattered in the days
when there were significant 36-bitters on the 'net like PDP-10's.
Nowadays people forget to do even ASCII/BINARY, or with (typical?)
browsers don't even have the choice.

> of course, in addition to the usual problem that
> fwrite( (char *)&value, sizeof(value), 1, STREAM) is going to
> write in a platform-dependant byte order.
>

The cast isn't needed; fwrite is declared to take const void*.
(Well, unless you need to cast away volatile from your buffer, and
even then it's not really the right way.)

> I guess all this was why xdr was invented...

And ASN.1. And for simple cases network byte order.
And one reason many Internet protocols use text.

- David.Thompson1 at worldnet.att.net