Why does FILE-LENGTH take a stream rather a pathname?

Peter Seibel

unread,

Feb 28, 2004, 8:44:39 PM2/28/04

to

I'm guessing that on some historically important filesystem the
natural way to get the length of a file was the Lisp way: opening it
and getting the information from the stream. If so, what was that OS?

I only ask becuase it may seem strange to folks used to Unix-centric
languages where that information is available without opening the
file.

-Peter

--
Peter Seibel pe...@javamonkey.com

Lisp is the red pill. -- John Fraser, comp.lang.lisp

Artie Gold

unread,

Feb 28, 2004, 9:01:16 PM2/28/04

to

Peter Seibel wrote:
> I'm guessing that on some historically important filesystem the
> natural way to get the length of a file was the Lisp way: opening it
> and getting the information from the stream. If so, what was that OS?
>
> I only ask becuase it may seem strange to folks used to Unix-centric
> languages where that information is available without opening the
> file.
>

Actually that's a misconception; in standard C (a language whose roots
are undoubtedly `unix-centric') the only way to get the length of a file
(without using some platform specific library) is to `read it and count'.

HTH,
--ag

--
Artie Gold -- Austin, Texas

"Yeah. It's an urban legend. But it's a *great* urban legend!"

Steven M. Haflich

unread,

Feb 28, 2004, 10:40:42 PM2/28/04

to

Peter Seibel wrote:
> I'm guessing that on some historically important filesystem the
> natural way to get the length of a file was the Lisp way: opening it
> and getting the information from the stream. If so, what was that OS?

Think about what the integer returned by file-length means. Then think
about what the integer printed by Unix "ls -l" means. They mean quite
different things.

Peter Seibel

unread,

Feb 28, 2004, 11:15:06 PM2/28/04

to

"Steven M. Haflich" <smh_no_s...@alum.mit.edu> writes:

> Peter Seibel wrote:
> > I'm guessing that on some historically important filesystem the
> > natural way to get the length of a file was the Lisp way: opening it
> > and getting the information from the stream. If so, what was that OS?
>
> Think about what the integer returned by file-length means. Then think
> about what the integer printed by Unix "ls -l" means. They mean quite
> different things.

I don't follow you. At least on GNU/Linux the integer printed in the
fourth column of the output of "ls -l" is the size of the file in
bytes. Based on previous discussions of FILE-LENGTH here, my
understanding is that is what FILE-LENGTH returns.

Unless you are telling me that there actually are Lisp implementations
where FILE-LENGTH measures the length of the file in element-type of
the stream (which, as I'm sure you know would require reading the
whole file for some character encodings) I don't see how those numbers
mean anything different at all. At any rate here are the numbers I get
from Allegro:

CL-USER> (with-open-file (out "/tmp/utf8test.txt"
:direction :output
:element-type 'character
:external-format :utf8)
(loop for code from 0 to char-code-limit
when (code-char code)
do (format out "~c" (code-char code)) and count t))

65536
CL-USER> (with-open-file (out "/tmp/utf8test.txt" :element-type 'character :external-format :utf8) (file-length out))
194434
CL-USER> (with-open-file (out "/tmp/utf8test.txt" :element-type '(unsigned-byte 8)) (file-length out))
194434

And here's what I get from ls -l:

[peter@xeon lisp-book]$ ls -l /tmp/utf8test.txt
-rw-rw-r-- 1 peter peter 194434 Feb 28 20:07 /tmp/utf8test.txt

Pascal Bourguignon

unread,

Feb 29, 2004, 1:03:07 AM2/29/04

to

Artie Gold <arti...@austin.rr.com> writes:

> Peter Seibel wrote:
> > I'm guessing that on some historically important filesystem the
> > natural way to get the length of a file was the Lisp way: opening it
> > and getting the information from the stream. If so, what was that OS?
> > I only ask becuase it may seem strange to folks used to Unix-centric
> > languages where that information is available without opening the
> > file.
> >
>
> Actually that's a misconception; in standard C (a language whose roots
> are undoubtedly `unix-centric') the only way to get the length of a
> file (without using some platform specific library) is to `read it and
> count'.

Peter did not ask about languages (like C), but about systems (like
unix). In unix, the file size is an attribute of the file (inode) and
is accessible without having to open the file. You don't even have to
have read or search access rights on the file you want to know the
size! You just have to have read access on a directory where an entry
for that file (inode) is stored.

FILE-LENGTH:

file-length returns the length of stream, or nil if the length
cannot be determined.

For a binary file, the length is measured in units of the element
type of the stream.

The standard is totally uninformative about the definition of a non
binary file length. Is that the number of bytes? Is that the number
of characters? Is that the number of records? Are there non-binary
files that are not character files? We only know that whe applied to
binary files, they have a unit. But would that be the number of
elements one can read from the file? Or would that be the size
allocated to the file? Or some other random value?

The page on FILE-POSITION give some constraints on file positions, but
nothing really relates file positions with file lenghts. For example,
when you open a file, you could very well have:

(with-open-file (in "sample-file-name" :direction :input)
(assert (= (file-position in) (length "sample-file-name"))))

or:

(with-open-file (in "sample-file-name" :direction :input)
(assert (= (file-position in) (* 3 (file-length in)))))

While the differences in file positions are specified (= 1 for binary
byte, >= 1 for character), there is not enough information to infer
anything about file-length and file-position at the end of the file.

People like me have a length. When people like me have peanuts in
their pockets, their length is measured in units of apple. For
example, I've got two peanuts in my left pocket, and my length is 6
apples. But my friend has no peanuts, and his length (he still has
one), is only 5 nano seconds.

--
__Pascal_Bourguignon__ http://www.informatimago.com/
There is no worse tyranny than to force a man to pay for what he doesn't
want merely because you think it would be good for him.--Robert Heinlein
http://www.theadvocates.org/

David Steuber

unread,

Feb 29, 2004, 1:55:30 AM2/29/04

to

Artie Gold <arti...@austin.rr.com> writes:

> Peter Seibel wrote:
> > I'm guessing that on some historically important filesystem the
> > natural way to get the length of a file was the Lisp way: opening it
> > and getting the information from the stream. If so, what was that OS?
> > I only ask becuase it may seem strange to folks used to Unix-centric
> > languages where that information is available without opening the
> > file.
> >
>
> Actually that's a misconception; in standard C (a language whose roots
> are undoubtedly `unix-centric') the only way to get the length of a
> file (without using some platform specific library) is to `read it and
> count'.

It sure doesn't look that way to the user. In C, the stat function
does the job. Admittedly it is defined in sys/stat.h.

--
One Emacs to rule them all. One Emacs to find them,
One Emacs to take commands and to the keystrokes bind them,

All other programming languages wish they were Lisp.

Artie Gold

unread,

Feb 29, 2004, 11:41:48 AM2/29/04

to

Pascal Bourguignon wrote:
> Artie Gold <arti...@austin.rr.com> writes:
>
>
>>Peter Seibel wrote:
>>
>>>I'm guessing that on some historically important filesystem the
>>>natural way to get the length of a file was the Lisp way: opening it
>>>and getting the information from the stream. If so, what was that OS?
>>>I only ask becuase it may seem strange to folks used to Unix-centric

^^^^^^^^^^^^^^^^^^^^^^^^^^

>>>languages where that information is available without opening the

^^^^^^^^^

>>>file.
>>>
>>
>>Actually that's a misconception; in standard C (a language whose roots
>>are undoubtedly `unix-centric') the only way to get the length of a
>>file (without using some platform specific library) is to `read it and
>>count'.
>
>
> Peter did not ask about languages (like C), but about systems (like
> unix). In unix, the file size is an attribute of the file (inode) and
> is accessible without having to open the file. You don't even have to
> have read or search access rights on the file you want to know the
> size! You just have to have read access on a directory where an entry
> for that file (inode) is stored.
>

[snip]

See above; he *did* ask about languages. And my response was directed
toward that.

Your response, on the other hand, was merely informative and
enlightening, covering the entire subject.

;-)

Cheers,

Artie Gold

unread,

Feb 29, 2004, 11:52:49 AM2/29/04

to

David Steuber wrote:
> Artie Gold <arti...@austin.rr.com> writes:
>
>
>>Peter Seibel wrote:
>>
>>>I'm guessing that on some historically important filesystem the
>>>natural way to get the length of a file was the Lisp way: opening it
>>>and getting the information from the stream. If so, what was that OS?
>>>I only ask becuase it may seem strange to folks used to Unix-centric
>>>languages where that information is available without opening the
>>>file.
>>>
>>
>>Actually that's a misconception; in standard C (a language whose roots
>>are undoubtedly `unix-centric') the only way to get the length of a
>>file (without using some platform specific library) is to `read it and
>>count'.
>
>
> It sure doesn't look that way to the user. In C, the stat function
> does the job. Admittedly it is defined in sys/stat.h.
>

Right. It's *not* part of C. It's part of the interface supplied by a
certain class of operating systems. What it *looks* like is irrelevant.

<OT>
Does anyone know if there was ever a Lisp machine that hosted a C
implementation <gasp>?
</OT>

Cheers,

Martti Halminen

unread,

Feb 29, 2004, 12:30:03 PM2/29/04

to

On Sun, 29 Feb 2004 10:52:49 -0600, Artie Gold wrote:

> <OT>
> Does anyone know if there was ever a Lisp machine that hosted a C
> implementation <gasp>?
> </OT>

Don't know about the others, but at least Symbolics did. Pascal, Fortran,
Ada and Prolog, too, IIRC.

--

Bruno Haible

unread,

Feb 29, 2004, 12:24:59 PM2/29/04

to

Peter Seibel <pe...@javamonkey.com> wrote:
>> Think about what the integer returned by file-length means. Then think
>> about what the integer printed by Unix "ls -l" means. They mean quite
>> different things.
>
> I don't follow you. At least on GNU/Linux the integer printed in the
> fourth column of the output of "ls -l" is the size of the file in
> bytes.

"Size" can mean either
- N: the number of bytes you can read() from a file,
- M: the maximum allowed file position to which you can lseek(),
- L: the st_size value, shown by "ls -l".

On DOS and Woe32 systems, you are accustomed to N < M = L, due to the fact
that the OS may convert CR/LF to LF when reading bytes.

But even on Linux you have files where N < M. The /proc/<pid>/maps are
an example. There is no requirement that a read() call which returns
'count' bytes of data increases the file position by exactly 'count'.
I'm not sure whether L should be = N or = M in this case; in the
/proc/<pid>/maps Linus decided to set it to 0.

Bruno

Peter Seibel

unread,

Feb 29, 2004, 2:22:30 PM2/29/04

to

br...@clisp.org (Bruno Haible) writes:

> Peter Seibel <pe...@javamonkey.com> wrote:
> >> Think about what the integer returned by file-length means. Then think
> >> about what the integer printed by Unix "ls -l" means. They mean quite
> >> different things.
> >
> > I don't follow you. At least on GNU/Linux the integer printed in the
> > fourth column of the output of "ls -l" is the size of the file in
> > bytes.
>
> "Size" can mean either
> - N: the number of bytes you can read() from a file,
> - M: the maximum allowed file position to which you can lseek(),
> - L: the st_size value, shown by "ls -l".

Okay, but doesn't FILE-LENGTH still return L? By experimentation I see
that using different sizes of unsigned-byte causes it to return
different answers, at least in Allegro. But as my previous post
showed, it doesn't do that with characters which is the case where
there is another meaning of "length" that is not a simple matter of
arithmetic. When I open a stream with '(unsigned-byte 32) instead of
'(unsigned-byte 8) the length is indeed 1/4 of the value returned by
ls -l but it appears to return just as fast regardless of the length
of the file so I suspect it is doing a stat and dividing st_size by 4,
not actually reading the file.

Tim Bradshaw

unread,

Feb 29, 2004, 2:39:43 PM2/29/04

to

* Pascal Bourguignon wrote:

> Peter did not ask about languages (like C), but about systems (like
> unix). In unix, the file size is an attribute of the file (inode)
> and is accessible without having to open the file. You don't even
> have to have read or search access rights on the file you want to
> know the size! You just have to have read access on a directory
> where an entry for that file (inode) is stored.

This is wrong. You need execute permission on the directory: read
permission will let you have the names of the files but no other
information about them.

--tim

Tim Bradshaw

unread,

Feb 29, 2004, 2:53:16 PM2/29/04

to

* Peter Seibel wrote:

> Okay, but doesn't FILE-LENGTH still return L? By experimentation I
> see that using different sizes of unsigned-byte causes it to return
> different answers, at least in Allegro. But as my previous post
> showed, it doesn't do that with characters which is the case where
> there is another meaning of "length" that is not a simple matter of
> arithmetic.

Because the OS makes it very expensive to compute the real length in
characters. Essentially the only way you can do so on a Unix or
Windows system is to read the whole file and see how long it is.

--tim

Tim Bradshaw

unread,

Feb 29, 2004, 2:50:11 PM2/29/04

to

* Peter Seibel wrote:

> I don't follow you. At least on GNU/Linux the integer printed in the
> fourth column of the output of "ls -l" is the size of the file in
> bytes. Based on previous discussions of FILE-LENGTH here, my
> understanding is that is what FILE-LENGTH returns.

ls -l does indeed tell you the length in bytes. This is very often
not useful. If you've ever used an obscure OS called `Windows' (or
its predecessor, `MSDOS'), you'll know that the byte length of a text
file generally does not correspond to the length of the file in
characters as read by most Lisp implementations, because the
end-of-line sequence, which reads as one character, is represented in
the file as two.

Even stranger: there are countries (yes, really) which not only need
more than 7 bit characters to represent their alphabets, but need more
than *8*. Worse, there's more than one of these countries, with
mutually incompatible alphabets. You might need 16 or 32 bits to
represent a character. Of course, you might then encode data in files
to keep it reasonably compact. The length of the file in bytes then
bears almost no relation to its length in characters.

Of course, FILE-LENGTH generally doesn't solve this problem either,
but it has a better chance of providing the right answer than ls -l:
on a system which kept better metadata about files (lengths in octets,
characters &c), then once the file is opened there is probably enough
information to return a meaningful length.

--tim

Peter Seibel

unread,

Feb 29, 2004, 8:31:20 PM2/29/04

to

Tim Bradshaw <t...@cley.com> writes:

> Of course, FILE-LENGTH generally doesn't solve this problem either,
> but it has a better chance of providing the right answer than ls -l:
> on a system which kept better metadata about files (lengths in
> octets, characters &c), then once the file is opened there is
> probably enough information to return a meaningful length.

Which brings me back to my original question: is there or was there
some operating system which worked this way--where the natural way to
get the length of a file required performing the same operation one
performed to read data from the file?

Pascal Bourguignon

unread,

Feb 29, 2004, 8:33:59 PM2/29/04

to

Peter Seibel <pe...@javamonkey.com> writes:

On CLISP, (which is a good european, universalist software and knows
about UTF-8, chinese, ethiopian, korean, and all kind of scripts ;-):

[65]> (let
((path "/local/users/pascal/src/miscellaneous/tests/misc/UTF-8-demo.utf-8"))
(with-open-file (in path :direction :input)
(do ((i 0 (1+ i))
(ch (read-char in nil nil)(read-char in nil nil)))
((null ch)
(format t "~&clisp says it read ~6D characters~%" i)))
(format t "~&clisp says last file position is ~6D~%" (file-position in))
(format t "~&clisp says file length is ~6D~%" (file-length in)))
(with-open-file (in path :direction :input
:element-type '(unsigned-byte 32))
(format t "~&clisp says binary file length is ~6D~%" (file-length in)))
(format t "~&unix says file size is ~6D~%"
(multiple-value-bind (res stat) (linux:|stat| path)
(if (zerop res)
(linux:|stat-st_size| stat)
0))))
clisp says it read 7627 characters
clisp says last file position is 14058
clisp says file length is 14058
clisp says binary file length is 3514
unix says file size is 14058
NIL

On SBCL, (which is a bad american, closed-minded, imperialist, pure
ASCII software and knows only about 8bit characters :-)):

* (let
((path "/local/users/pascal/src/miscellaneous/tests/misc/UTF-8-demo.utf-8"))
(with-open-file (in path :direction :input)
(do ((i 0 (1+ i))
(ch (read-char in nil nil)(read-char in nil nil)))
((null ch)
(format t "~&sbcl says it read ~6D characters~%" i)))
(format t "~&sbcl says last file position is ~6D~%" (file-position in))
(format t "~&sbcl says file length is ~6D~%" (file-length in)))
(with-open-file (in path :direction :input
:element-type '(unsigned-byte 32))
(format t "~&sbcl says binary file length is ~6D~%" (file-length in)))
(format t "~&unix says file size is ~6D~%"
(multiple-value-bind (s a b c d u g f size) (sb-unix:unix-stat path)
(declare (ignore a b c d u g f))
(if s size 0))))
sbcl says it read 14058 characters
sbcl says last file position is 14058
sbcl says file length is 14058
sbcl says binary file length is 3514
unix says file size is 14058

Since the standard does not say what a file length is, any answer is good.
Note that (mod 14058 4) == 2.

I'd say that the safest best would be to use:

(with-open-file (in path :direction :input :element-type '(unsigned-byte 8))
(file-length in))

to get the unix file size, at least on any sane implementation. But if
you want any guarantee, you'd rather use FFI and stat(2).

Pascal Bourguignon

unread,

Feb 29, 2004, 8:40:19 PM2/29/04

to

Tim Bradshaw <t...@cley.com> writes:

> * Peter Seibel wrote:
>
> > I don't follow you. At least on GNU/Linux the integer printed in the
> > fourth column of the output of "ls -l" is the size of the file in
> > bytes. Based on previous discussions of FILE-LENGTH here, my
> > understanding is that is what FILE-LENGTH returns.
>
> ls -l does indeed tell you the length in bytes. This is very often
> not useful.

NO. It is very useful. The need to open the file to get a size is
nefast because it implies a specific assumption on what is wanted.
Sorry, but I NEVER need to know how many characters are in a file.
Even journalists generally don't need to know how many characters
they've typed, they just need to know how many WORDS they've got.

But, I often need to know how many BYTES a file takes, because I often
need to copy files to make backups and it's important to be able to
efficiently sum up the byte sizes of several files.

The way the file system API is defined in Common-Lisp implies that you
cannot program system tools such as simple backup program in
Common-Lisp. You must rely on a OS specific layer.

Christopher C. Stacy

unread,

Feb 29, 2004, 10:22:23 PM2/29/04

to

>>>>> On Mon, 01 Mar 2004 01:31:20 GMT, Peter Seibel ("Peter") writes:

Peter> Tim Bradshaw <t...@cley.com> writes:
>> Of course, FILE-LENGTH generally doesn't solve this problem either,
>> but it has a better chance of providing the right answer than ls -l:
>> on a system which kept better metadata about files (lengths in
>> octets, characters &c), then once the file is opened there is
>> probably enough information to return a meaningful length.

Peter> Which brings me back to my original question: is there or was
Peter> there some operating system which worked this way--where the
Peter> natural way to get the length of a file required performing
Peter> the same operation one performed to read data from the file?

The operating systems of the 1970/80s often had file systems whose
directories knew a file's character size. Just not Unix and DOS.
You did not need to open the file and read it in order to guess.
But the Lisp implementation might want to open up the file and peek
at the first few bytes in order to discover that it was using some
particular character encoding (of known uniform size characters).

I think the LispM might have done that under some circumstances,
but I can't really remember. (I was thinking of the extended
character set with font encodings understood by ZMACS, but hmmm,
weren't those variable? So I don't think that's right...)

However, you could also place arbitrary metadata in the LispM's native
filesystem directories. (Directory entries could have a plist).

Obviously once you're talking about newline encoding or other
variable-length randomly occuring characters, you just have to
count them up unless the file system already counted them up.
I think the file system (aka "Record Management System") on
VMS would keep track for you, if you told it that each line
was a "record".

Pascal Bourguignon

unread,

Feb 29, 2004, 10:50:48 PM2/29/04

to

Peter Seibel <pe...@javamonkey.com> writes:

> Tim Bradshaw <t...@cley.com> writes:
>
> > Of course, FILE-LENGTH generally doesn't solve this problem either,
> > but it has a better chance of providing the right answer than ls -l:
> > on a system which kept better metadata about files (lengths in
> > octets, characters &c), then once the file is opened there is
> > probably enough information to return a meaningful length.
>
> Which brings me back to my original question: is there or was there
> some operating system which worked this way--where the natural way to
> get the length of a file required performing the same operation one
> performed to read data from the file?

That's the wrong question. (The obvious answer being yes).

The right question, is: what the hell does Common-Lisp exactly
understands as being the FILE-LENGTH???

Now, on old systems, where there is no notion of files as stream of
bytes, you could have for example 80-column hollerith card image
files. On sectors of 512 bytes (not _so_ old systems, just systems
with an _old_ heritage, I'm thinking in this case of the ICL S25),
they would store 6 records of 80 characters (and leave 32 bytes per
sector lost or for the file system usage (file-ID, deleted-flags,
next/previous pointers for example). So, when you consider such a
mess^W file, what do you call its size?

- the number of 512-byte sectors allocated?
- the number of non-deleted "cards"?
- the number of characters?

How do you count these characters? On a perforated card, when there
are no perforations you have a space character (a null character =
space). So you have usually 80 character per line/card. But a
program that would want to process variable sized lines could write a
line termination character, and ignore all character after this line
termination (that could well not appear if all 80 columns where used).

Obviously, to count the characters, or even to count the cards, you'd
have to open the file and read it.

AFAIK, unix was the first system to have file systems with a notion of
inode. I may be wrong here, but I've got the impression that before
unix, you could have directories with some meta data, but usually you
had to "open" the files to get their properties. On unix you can read
all the properties, the whole inode, with stat(2) without having to
open the file.

Christopher C. Stacy

unread,

Feb 29, 2004, 11:16:20 PM2/29/04

to

>>>>> On 01 Mar 2004 04:50:48 +0100, Pascal Bourguignon ("Pascal") writes:

Pascal> AFAIK, unix was the first system to have file systems with a notion of
Pascal> inode. I may be wrong here, but I've got the impression that before
Pascal> unix, you could have directories with some meta data, but usually you
Pascal> had to "open" the files to get their properties. On unix you can read
Pascal> all the properties, the whole inode, with stat(2) without having to
Pascal> open the file.

Where in the hell do people get such crazy ideas?

Pierpaolo BERNARDI

unread,

Mar 1, 2004, 3:12:16 AM3/1/04

to

"Artie Gold" <arti...@austin.rr.com> ha scritto nel messaggio news:c1t5d4$1kj0oj$1...@ID-219787.news.uni-berlin.de...

The Zeta C compiler, C on symbolics, written by Scott L. Burson,
has been recently put in the public domain by its author.

I don't have an url handy, the distribution is called ZETA-C-PD.tgz.

P.

Rahul Jain

unread,

Mar 1, 2004, 9:58:05 AM3/1/04

to

"Pierpaolo BERNARDI" <pierpaolo...@hotmail.com> writes:

> The Zeta C compiler, C on symbolics, written by Scott L. Burson,
> has been recently put in the public domain by its author.

Wow. That's gotta be some cool lisp code to look at.

> I don't have an url handy, the distribution is called ZETA-C-PD.tgz.

It's at www.spies.com/~aek/explorer/zeta-c/ (no surprise there :).

--
Rahul Jain
rj...@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist

Pascal Bourguignon

unread,

Mar 1, 2004, 3:28:22 PM3/1/04

to

cst...@news.dtpq.com (Christopher C. Stacy) writes:

> >>>>> On 01 Mar 2004 04:50:48 +0100, Pascal Bourguignon ("Pascal") writes:
>
> Pascal> AFAIK, unix was the first system to have file systems with

> Pascal> a notion of inode. I may be wrong here, but I've got the
> Pascal> impression that before unix, you could have directories
> Pascal> with some meta data, but usually you had to "open" the
> Pascal> files to get their properties. On unix you can read all
> Pascal> the properties, the whole inode, with stat(2) without
> Pascal> having to open the file.

>
> Where in the hell do people get such crazy ideas?

AFAIK = As Far As I Know.

Perhaps I don't know enought. For sure, I don't know everything.

Ray Dillinger

unread,

Mar 2, 2004, 9:14:42 PM3/2/04

to

Tim Bradshaw wrote:
>

> Even stranger: there are countries (yes, really) which not only need
> more than 7 bit characters to represent their alphabets, but need more
> than *8*. Worse, there's more than one of these countries, with
> mutually incompatible alphabets. You might need 16 or 32 bits to
> represent a character. Of course, you might then encode data in files
> to keep it reasonably compact. The length of the file in bytes then
> bears almost no relation to its length in characters.
>
> Of course, FILE-LENGTH generally doesn't solve this problem either,
> but it has a better chance of providing the right answer than ls -l:
> on a system which kept better metadata about files (lengths in octets,
> characters &c), then once the file is opened there is probably enough
> information to return a meaningful length.

Oh, it's even stranger than that; recently I've been involved in a
spirited debate as to whether a "character" is actually a single
unicode codepoint, or a base codepoint plus nondefective sequence
of zero or more combining codepoints.

And my conclusion is that when you ask "read-char" for a character
on a unicode-enabled system, it ought to give you the latter, not
the former. Likewise characters in strings and character values
should be unicode sequences.

And FILE-LENGTH? If you want the length in readable characters you
have to open it and read it. If you want the length in octets, take
what the operating system gives you.

Bear

Pascal Bourguignon

unread,

Mar 3, 2004, 8:14:54 PM3/3/04

to

Ray Dillinger <be...@sonic.net> writes:
> Oh, it's even stranger than that; recently I've been involved in a
> spirited debate as to whether a "character" is actually a single
> unicode codepoint, or a base codepoint plus nondefective sequence
> of zero or more combining codepoints.
>
> And my conclusion is that when you ask "read-char" for a character
> on a unicode-enabled system, it ought to give you the latter, not
> the former. Likewise characters in strings and character values
> should be unicode sequences.

Indeed.

> And FILE-LENGTH? If you want the length in readable characters you
> have to open it and read it. If you want the length in octets, take
> what the operating system gives you.

And what's worse, I don't think it's prudent (portable) (ie, I think
there exists some system where it's not possible) to open a text file
as binary or a binary file as a text file. That is, when you have a
text file you cannot use FILE-LENGTH to get the number of bytes in it.
Ok, I'm not concerned since I use unix and don't even hope to get
anything better before long, if ever.

Tim Bradshaw

unread,

Mar 4, 2004, 7:59:45 PM3/4/04

to

* Ray Dillinger wrote:

> And FILE-LENGTH? If you want the length in readable characters you
> have to open it and read it. If you want the length in octets, take
> what the operating system gives you.

Only on systems which don't know this information. For systems which
do, it may well be enough to tell it how you want to read the file (or
let the system tell you how the file is encoded), and then just ask.
Progress is possible.

--tim

Tim Bradshaw

unread,

Mar 4, 2004, 7:43:27 PM3/4/04

to

* Pascal Bourguignon wrote:
> NO. It is very useful. The need to open the file to get a size is
> nefast because it implies a specific assumption on what is wanted.
> Sorry, but I NEVER need to know how many characters are in a file.
> Even journalists generally don't need to know how many characters
> they've typed, they just need to know how many WORDS they've got.

So, for instance, you never want to efficiently read a file into a
string. OK.

--tim

Ray Dillinger

unread,

Mar 6, 2004, 1:54:12 PM3/6/04

to

Are there any filesystems that allow us to "decorate" file entries
with information like unicode grapheme (base char plus combining sequence)
counts?

Bear

Pascal Bourguignon

unread,

Mar 6, 2004, 5:50:48 PM3/6/04

to

Ray Dillinger <be...@sonic.net> writes:
> Are there any filesystems that allow us to "decorate" file entries
> with information like unicode grapheme (base char plus combining sequence)
> counts?

The details of the mechanism don't import. (You could use resource
forks on MacHFS, attributes on some file systems I believe
MS-Windows-NT has, or just store it in a file named
".${filename}.attributes" or whatever).

The problem is to keep the consistency of this attribute with the
contents of the file.

Either the system maintains it itself automatically and
autoritatively, or you (the applications) will have to read the file
anyway to count the characters or the graphemes.

Note that nowadays, there is so much RAM that we usually load the
whole files (or even whole databases!) in RAM before working on them,
so it should not be too costly to count the characters each time we
"open" a file. That's when you realize that Multics segments were
really what we really need (I've known since 1984/MacOS), instead of
open/read/write/close. Happily, we can use memory mapping on modern
unix systems. Unfortunately, there's not much support for memory
mapping in Lisp. How to memory map and garbage collect "file"
segments?

This should be an area of some experimentation and of standardization
for Common-Lisp-2010 !

Pascal Bourguignon

unread,

Mar 9, 2004, 8:14:55 AM3/9/04

to

Tim Bradshaw <t...@cley.com> writes:

When I want to load a whole file in memory, either I memory map it, or
I gather it as a list of lines. But this is not the point.

We could define abstract attributes for files. Perhaps we need high
level attributes such as number of characters, but then specifies it
explicitely and precisely:

#+COMMON-LISP-2010 "

FILE-SIZE: return the number of bytes in the file.

(file-size path) === (with-open-file (in path :direction :input
:element-type 'unsigned-byte)
(do ((size 0 (1+ size))
(byte (read-byte in nil nil)
(read-byte in nil nil)))
((null byte) size)))

FILE-CHARACTER-COUNT: return the number of characters in the file.

(file-character-count path) === (with-open-file (in path :direction :input
:element-type 'character)
(do ((size 0 (1+ size))
(char (read-char in nil nil)
(read-char in nil nil)))
((null char) size)))

FILE-LINE-COUNT: return the number of lines in the file.

(file-character-count path) === (with-open-file (in path :direction :input
:element-type 'character)
(do ((size 0 (1+ size))
(line (read-line in nil nil)
(read-line in nil nil)))
((null line) size)))

These functions would be defined on the PATH of the files because the
file system may very well keep their values as meta data. If not, the
implementation would open and read the file.

"

To finish with your question:

> So, for instance, you never want to efficiently read a file into a
> string. OK.

anyway the current COMMON-LISP standard DOES NOT SPECIFY ANYTHING
WORTHWHILE as result for FILE-LENGTH, when the file has been open with
a binary element-type.

file-length returns the length of stream, or nil if the length
cannot be determined.

For a binary file, the length is measured in units of the element
type of the stream.

so if you want to "efficiently" read a file into a string, you will
have in any case to read it twice:

(defun efficient-read-file-into-a-string (path)
(let ((string (make-string (with-open-file (in path :direction :input
:element-type 'character)
(do ((size 0 (1+ size))
(char (read-char in nil nil)
(read-char in nil nil)))
((null char) size))))))
(with-open-file (in path :direction :input :element-type 'character)
(read-sequence string in))
string))

But I bet that it will be more efficient to keep reading the file once
and extend the string as needed. I could even apply this _heuristic_:
allocate a string that can hold as many character as the number of
bytes in the file, read the characters from the file, extending the
string if ever needed. When the file is read, reduce the size of the
string to fit.

Tim Bradshaw

unread,

Mar 10, 2004, 5:26:05 AM3/10/04

to

Pascal Bourguignon <sp...@thalassa.informatimago.com> wrote in message news:<87ad2q2...@thalassa.informatimago.com>...

> anyway the current COMMON-LISP standard DOES NOT SPECIFY ANYTHING
> WORTHWHILE as result for FILE-LENGTH, when the file has been open with
> a binary element-type.
>

It's good that it doesn't specify anything useful, because if it did
this would make it a very expensive operation on many current OSs for
character streams. However it leaves open the *possibility* of
returning a useful result if, for instance, OSs which keep this kind
of data were ever to appear. I kind of like that, but I guess that's
because I'm a human being, not a computer scientist.

Pascal Bourguignon

unread,

Mar 10, 2004, 1:06:22 PM3/10/04

to

tfb+g...@tfeb.org (Tim Bradshaw) writes:

Yep. If you were a computer programmer, you'd be lazy.

The problem with such kinds of specificatio, is that it gives you (a
little) more work. Since you cannot count on conforming
implementations to provide the service, you have to implement it
yourself anyway. The little more work is the time you spend reading
the feature specification and realizing that it's useless.