Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

file sizes

14 views
Skip to first unread message

Bill Cunningham

unread,
Aug 10, 2012, 3:24:31 PM8/10/12
to
Does the system call data from the stat struct defined in stat.h in the
sys directory allow for the various sizes that a file will have in memory
and the size on the storage media like a HD? stat seems to get file data
that is needed by the *nix OS.

Bill


Jorgen Grahn

unread,
Aug 10, 2012, 3:41:51 PM8/10/12
to
Do you mean st_size? An ordinary file contains N bytes of data, and
nothing else, irrelevant of the storage media or anything else.
It's this number N that is reported.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Barry Margolin

unread,
Aug 10, 2012, 3:43:02 PM8/10/12
to
In article <k03n52$o1f$1...@speranza.aioe.org>,
There are two size-related fields.

st_size: This is the number of bytes in the file. If you were to call
read(file, buffer, 1) starting at the beginning, this is the number of
times you could call it before getting EOF (assuming the file doesn't
get modified while your loop is running). It includes all-zero pages
that the filesystem might optimize away.

st_blocks: This is the number of blocks used to store the file. If the
filesystem is able to optimize away some parts of the file (e.g.
all-zero blocks), it won't include them. The size of blocks is
implementation-dependent, and may differ from filesystem to filesystem,
but 512 bytes is common. It should be at most the smallest granularity
of file storage available.

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

Bill Cunningham

unread,
Aug 10, 2012, 3:52:36 PM8/10/12
to
Jorgen Grahn wrote:

> Do you mean st_size? An ordinary file contains N bytes of data, and
> nothing else, irrelevant of the storage media or anything else.
> It's this number N that is reported.

I know about st_size member of the struct. I thought a file depending on
how the system used it might have so many bytes in memory at one point and
another at another time.

Bill


Bill Cunningham

unread,
Aug 10, 2012, 3:55:16 PM8/10/12
to
Barry Margolin wrote:
> There are two size-related fields.
>
> st_size: This is the number of bytes in the file. If you were to call
> read(file, buffer, 1) starting at the beginning, this is the number of
> times you could call it before getting EOF (assuming the file doesn't
> get modified while your loop is running). It includes all-zero pages
> that the filesystem might optimize away.
>
> st_blocks: This is the number of blocks used to store the file. If the
> filesystem is able to optimize away some parts of the file (e.g.
> all-zero blocks), it won't include them. The size of blocks is
> implementation-dependent, and may differ from filesystem to
> filesystem, but 512 bytes is common. It should be at most the
> smallest granularity of file storage available.

So if called these two members might be different. The next question is
which function(s) call stat and its 2 related functions like lstat.

Bill


Scott Lurndal

unread,
Aug 10, 2012, 4:01:55 PM8/10/12
to
The st_size field in struct stat indicates the offset of the last readable byte
in the file, plus one[*]. It doesn't matter if the file is partially or fully
contained within memory, the st_size field will describe the largest offset
value (plus one) that can be used with lseek(2)/read(2) or pread(2) to read
a byte from the file without receiving an end-of-file indication.

Note that st_size does _not_ indicate how many bytes are present either in
memory or on the storage media, just the highest addressable byte[**].

[*] if st_stat == 4096, then you can issue pread(fd, buf, 1, 4095) without
error, but pread(fd, buf, 1, 4096) will return an End-Of-File indication.

[**] if you:
fd = creat("file", 0660);
pwrite(fd, "A", 1, 1048575);
close(fd);

after this, st_size will be == 1048576, but only one filesystem block
(typically 1024, 2048 or 4096 bytes) will have been allocated on the
storage media. So a file that appears to be 1 megabyte will only consume
4096 bytes (or less depending on how the filesystem was created).

see "ls -lhs"

If an application attempts to read from the beginning of the file, the read
will be satisfied by storing zeros into the read buffer by the read system
call - the zeros don't take up space on the media.

Bill Cunningham

unread,
Aug 10, 2012, 4:19:34 PM8/10/12
to
So if I use ls -la and the file size is reported as 8226 that's not how
many bytes are in the file? 8225 is the offset of the last byte. Hum. I
guess you learn something new everyday. I'm using ext4. Will it contain what
I'm looking for? I what to know how many bytes are composing the file and
how the OS treats the file in RAM. That might be the kernel's memory
mangement job rather than sys calls. But don't all system calls call on
kernel functions?

Bill


Scott Lurndal

unread,
Aug 10, 2012, 5:05:18 PM8/10/12
to
Jorgen Grahn <grahn...@snipabacken.se> writes:
>On Fri, 2012-08-10, Bill Cunningham wrote:
>> Does the system call data from the stat struct defined in stat.h in the
>> sys directory allow for the various sizes that a file will have in memory
>> and the size on the storage media like a HD? stat seems to get file data
>> that is needed by the *nix OS.
>
>Do you mean st_size? An ordinary file contains N bytes of data, and
>nothing else, irrelevant of the storage media or anything else.
>It's this number N that is reported.

Actually st_size reports the last addressable byte (+1) of the file. It doesn't
report how many bytes are allocated to the file.

scott

Scott Lurndal

unread,
Aug 10, 2012, 5:21:43 PM8/10/12
to
There may or may not be 8226 bytes allocated from the filesystem for the
file, depending on how it was written.

>8225 is the offset of the last byte.

Yes.

> Hum. I
>guess you learn something new everyday.

>I'm using ext4. Will it contain what I'm looking for?

Yes. ext2/3/4 all use sparse allocation.

> I what to know how many bytes are composing the file and
>how the OS treats the file in RAM.

When an application accesses a portion of a file, that portion
(rounded down to the nearest page boundary and rounded up to include
one or more whole pages) is loaded into memory by the operating system.

The read system call will access the in-core inode for the file, to
which the in-memory copies of the file content are linked, and will copy
the data from the kernel buffer to the application buffer (which if you're
using stdio, will then get copied again to the programmer defined buffer).

The OS may flush the data back out to disk (or discard it if it was never
modified) to make room for application allocations (e.g. malloc/sbrk/mmap)
or to make room for data from other disk files/network packets/et. al.

> That might be the kernel's memory
>mangement job rather than sys calls.

>But don't all system calls call on
>kernel functions?

Not necessarily. gettimeofday(2) is a good example of a "system call" that
doesn't always call a kernel function.

scott

Bill Cunningham

unread,
Aug 10, 2012, 5:50:29 PM8/10/12
to
Scott Lurndal wrote:
> "Bill Cunningham" <nos...@nspam.invalid> writes:
>> Scott Lurndal wrote:

[snip]

>> So if I use ls -la and the file size is reported as 8226 that's
>> not how many bytes are in the file?
>
> There may or may not be 8226 bytes allocated from the filesystem for
> the file, depending on how it was written.
>
>> 8225 is the offset of the last byte.
>
> Yes.
>
>> Hum. I
>> guess you learn something new everyday.
>
>> I'm using ext4. Will it contain what I'm looking for?
>
> Yes. ext2/3/4 all use sparse allocation.

OK. <OT> What development headers would I use for ext4 functions elf.h
? Maybe I need to look at ext4 instead of the unix system API </OT>

>> I what to know how many bytes are composing the file and
>> how the OS treats the file in RAM.
>
> When an application accesses a portion of a file, that portion
> (rounded down to the nearest page boundary and rounded up to include
> one or more whole pages) is loaded into memory by the operating
> system.

I was not aware of the above. All I've ever heard about paging has to do
with swapping out of memory.

> The read system call will access the in-core inode for the file, to
> which the in-memory copies of the file content are linked, and will
> copy the data from the kernel buffer to the application buffer (which
> if you're using stdio, will then get copied again to the programmer
> defined buffer).

Are there different inode numbers? Here's what I want the actual size of
a file in bytes stored on the media. Then I'll worry about in memory paging.

Barry Margolin

unread,
Aug 10, 2012, 6:00:10 PM8/10/12
to
In article <k03qc9$vm4$1...@speranza.aioe.org>,
"Bill Cunningham" <nos...@nspam.invalid> wrote:

> So if I use ls -la and the file size is reported as 8226 that's not how
> many bytes are in the file? 8225 is the offset of the last byte.

Offsets start at zero. So if there's 1 byte in the file, you read it by
seeking to offset 0.

Jorgen Grahn

unread,
Aug 10, 2012, 7:02:08 PM8/10/12
to
I don't understand what "bytes allocated to the file" means, but
as I understood it, the former was what the OP asked. For example:

% echo -n 'foobar' > /tmp/foo
% stat /tmp/foo
File: /tmp/foo'
Size: 6 Blocks: 8 IO Block: 4096 regular file
...

That size will never be reported as anything else but 6: the file
contains the six bytes representing f, o, o, b, a and r.

Jorgen Grahn

unread,
Aug 10, 2012, 7:07:41 PM8/10/12
to
On Fri, 2012-08-10, Bill Cunningham wrote:
Depends on exactly what you mean -- you are terribly vague. For
example, if the system never used the file at all, it probably takes
up no bytes in memory. If you modify the file, its size may obviously
change ... and so on.

Rainer Weikusat

unread,
Aug 10, 2012, 7:08:01 PM8/10/12
to
Jorgen Grahn <grahn...@snipabacken.se> writes:
> On Fri, 2012-08-10, Scott Lurndal wrote:
>> Jorgen Grahn <grahn...@snipabacken.se> writes:
>>>On Fri, 2012-08-10, Bill Cunningham wrote:
>>>> Does the system call data from the stat struct defined in stat.h in the
>>>> sys directory allow for the various sizes that a file will have in memory
>>>> and the size on the storage media like a HD? stat seems to get file data
>>>> that is needed by the *nix OS.
>>>
>>>Do you mean st_size? An ordinary file contains N bytes of data, and
>>>nothing else, irrelevant of the storage media or anything else.
>>>It's this number N that is reported.
>>
>> Actually st_size reports the last addressable byte (+1) of the file. It doesn't
>> report how many bytes are allocated to the file.
>
> I don't understand what "bytes allocated to the file" means,

It is (usually/ often) possible to create a so-called 'sparse file' by
using lseek to move the current I/O position to some location beyond
the end of the file and then write some data. The 'uinititialized'
intermediate bytes lseek skipped over will read as 0 but until
something is written to these locations, no actual 'disk space' will
be allocated to them.

Gordon Burditt

unread,
Aug 11, 2012, 4:14:56 AM8/11/12
to
> So if I use ls -la and the file size is reported as 8226 that's not how
> many bytes are in the file? 8225 is the offset of the last byte.

Logically, there are 8226 bytes in the file (with offsets 0 thru
8225, inclusive, for a total of 8226 bytes).

Physically, there may be more than that because disk space is
allocated in blocks and any unused fractional disk block is wasted.
Some file systems use a dual-blocksize scheme: if a file is smaller
than one BigBlock, disk allocated is a multiple of SmallBlock, but
if the file is larger than one BigBlock, disk is allocated in
multiples of BigBlock. Typical sizes for a modern desktop might
be SmallBlock = 2048 bytes and BigBlock = 16384 bytes.

Unless you have to micromanage disk space because you're always
critically short, you really don't care about the physical details.

Physically, there may be less than that in a "sparse file" because
disk blocks may not have been allocated for unwritten portions of
a file. (This enables such oddities as "terabyte-long" files on a
1.44MB floppy disk, at least according to "ls -l".)

A (larger) file may require "indirect blocks" which keep track of
blocks that are part of the file. Some of these are stored in an
inode. Back in the UNIX V7 days, the block size was 512 bytes, and
you got 10 block numbers in an inode, so if a file was larger than
5,120 bytes, it needed an indirect block.

The st_blocks value reported times the block size may not equal the
st_size value rounded up to the next higher block size due to
unwritten data blocks and indirect blocks.

It is not meaningful to talk about a "file size in system memory"
since a sufficiently large file will not *FIT* in memory, and even
if it does, the system will balance the needs for this file and
other files in use at the same time, and you can expect that amount
to change without notice. For example, if the program is spending
a long time waiting for console input, its resident memory usage
may go to nearly zero.

A C text file when read into memory may differ from the file size
reported by a stat() equivalent on Windows because C translates
\r\n line endings to \n line endings on reading, and the reverse
on writing.


A File System Accountant can use all sorts of measures to properly
bill you for the file according to policy, which is likely to have
all the convenience and understandability of IRS tax forms. That
might include billing for the size of the inode (Unix) and the size
of the directory entry (which depends on the length of the *name*
of the file), and sharing the cost of the inode between users with
different links to the same file..



> Hum. I
> guess you learn something new everyday. I'm using ext4. Will it contain what
> I'm looking for? I what to know how many bytes are composing the file and

*Logically*, the number of bytes in the file is given by "ls -l".
The amount of disk space used to store it may not be the same value.

> how the OS treats the file in RAM. That might be the kernel's memory
> mangement job rather than sys calls. But don't all system calls call on
> kernel functions?

How much RAM is used for a particular file is subject to constant change,
and if you REALLY, REALLY need to know this EXACTLY, you're in deep,
deep trouble, because you will be wrong.


0 new messages