Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Fast CSV reading in Common Lisp

1,277 views
Skip to first unread message

Alexandre

unread,
Jun 3, 2015, 5:29:44 AM6/3/15
to
What is the fastest way to read a csv file in CL in a way such that:
1) all fields in the first line go into one array called column-names
2) the first field of each of all following lines goes into another array called row-names
3) all other fields go into another array called values ?

My file has the following form, just with a lot more columns and rows:

"";"ES1 Index";"VG1 Index";"TY1 Comdty";"RX1 Comdty";"GC1 Comdty"
"1999-01-04";1391.12;3034.53;66.515625;86.2;441.39
"1999-01-05";1404.86;3072.41;66.3125;86.17;440.63
"1999-01-06";1435.12;3156.59;66.4375;86.32;441.7
"1999-01-07";1432.32;3106.08;66.25;86.22;447.67

And the result I would like is:

#("1999-01-04" "1999-01-05" "1999-01-06" "1999-01-07" )
#("" "ES1 Index" "VG1 Index" "TY1 Comdty" "RX1 Comdty" "GC1 Comdty")
#(1391.12 3034.53 66.515625 86.2 441.39 1404.86 3072.41 66.3125 86.17 440.63
1435.12 3156.59 66.4375 86.32 441.7 1432.32 3106.08 66.25 86.22 447.67)

Are you aware of some CL library that does so already (fast)?
Are there any general issues regarding I/O performance, maybe compiler-specific, that I should be aware of?

Here is the way I am doing it now:

(with-open-file (stream "my-file.csv" :direction :input)
(let* ((header (read-line stream nil))
(columns-list (mapcar #'read-from-string
(cl-ppcre:split ";" header)))
(number-of-columns (length columns-list))
(column-names (make-array number-of-columns
:initial-contents columns-list))
(row-names (make-array 1 :adjustable t :fill-pointer 0))
(values (make-array 1 :adjustable t :fill-pointer 0)))
(set-syntax-from-char #\; #\ )
(loop
:for reader = (read stream nil stream)
:until (eq reader stream)
:do (progn (vector-push-extend reader row-names)
(loop
:for count :from 2 :upto number-of-columns
:do (vector-push-extend (read stream nil)
values)))
:finally (return (values row-names
column-names
values)))))

Note: I wouldn't use set-syntax-from-char in real code, I am using it just for the sake of this example.

CL-USER> (lisp-implementation-type)
"SBCL"
CL-USER> (lisp-implementation-version)
"1.2.4.debian"

Running the same test with equivalent code in JavaScript runs a lot faster at the moment.

Dimitri Fontaine

unread,
Jun 3, 2015, 5:34:34 AM6/3/15
to
Alexandre <ala...@lrde.epita.fr> writes:
> Are you aware of some CL library that does so already (fast)?

https://github.com/AccelerationNet/cl-csv

I'm a happy user of it in pgloader, and contributed some code because
CSV is not a file format, it's like a can of worms of slightly different
and incompatible formats, more or less convoluted, and full of corner
cases.

--
dim

Pascal J. Bourguignon

unread,
Jun 3, 2015, 9:40:16 AM6/3/15
to
Alexandre <ala...@lrde.epita.fr> writes:

> What is the fastest way to read a csv file in CL in a way such that:
> 1) all fields in the first line go into one array called column-names
> 2) the first field of each of all following lines goes into another array called row-names
> 3) all other fields go into another array called values ?
>
> My file has the following form, just with a lot more columns and rows:
>
> "";"ES1 Index";"VG1 Index";"TY1 Comdty";"RX1 Comdty";"GC1 Comdty"
> "1999-01-04";1391.12;3034.53;66.515625;86.2;441.39
> "1999-01-05";1404.86;3072.41;66.3125;86.17;440.63
> "1999-01-06";1435.12;3156.59;66.4375;86.32;441.7
> "1999-01-07";1432.32;3106.08;66.25;86.22;447.67
>
> And the result I would like is:
>
> #("1999-01-04" "1999-01-05" "1999-01-06" "1999-01-07" )
> #("" "ES1 Index" "VG1 Index" "TY1 Comdty" "RX1 Comdty" "GC1 Comdty")
> #(1391.12 3034.53 66.515625 86.2 441.39 1404.86 3072.41 66.3125 86.17 440.63
> 1435.12 3156.59 66.4375 86.32 441.7 1432.32 3106.08 66.25 86.22 447.67)

(let* ((data (com.informatimago.common-lisp.csv.csv:load-records #P"/tmp/test.csv" :field-separator #\;))
(colhead (rest (first data)))
(rowhead (mapcar (function first) (rest data))))
(values colhead rowhead (mapcar (function rest) (rest data))))
--> ("ES1 Index" "VG1 Index" "TY1 Comdty" "RX1 Comdty" "GC1 Comdty")
("1999-01-04" "1999-01-05" "1999-01-06" "1999-01-07")
(("1391.12" "3034.53" "66.515625" "86.2" "441.39")
("1404.86" "3072.41" "66.3125" "86.17" "440.63")
("1435.12" "3156.59" "66.4375" "86.32" "441.7")
("1432.32" "3106.08" "66.25" "86.22" "447.67"))


> Are you aware of some CL library that does so already (fast)?

It was fast: 2 minutes to type it, no more.


> Are there any general issues regarding I/O performance, maybe
> compiler-specific, that I should be aware of?

On the data you provided it's instantaneous.
There are several other CSV libraries.
Perhaps you'd want to write a proper benchmark?



--
__Pascal Bourguignon__ http://www.informatimago.com/
“The factory of the future will have only two employees, a man and a
dog. The man will be there to feed the dog. The dog will be there to
keep the man from touching the equipment.” -- Carl Bass CEO Autodesk

WJ

unread,
Jun 3, 2015, 2:25:36 PM6/3/15
to
Why this is marked as abuse? It has been marked as abuse.
Report not abuse
Gauche Scheme:

(define (parse-csv-line line :optional (sep #\,))
(map
(lambda (s)
(if (rxmatch #/^".*"$/ s)
(substring s 1 (- (string-length s) 1))
(or (string->number s) s)))
(string-split line sep)))

(define data
(with-input-from-file "data.txt"
(cut generator-map (cut parse-csv-line <> #\;) read-line)))

(values (car data) (map car (cdr data)) (map cdr (cdr data)))

===>
("" "ES1 Index" "VG1 Index" "TY1 Comdty" "RX1 Comdty" "GC1 Comdty")
("1999-01-04" "1999-01-05" "1999-01-06" "1999-01-07")
((1391.12 3034.53 66.515625 86.2 441.39)
(1404.86 3072.41 66.3125 86.17 440.63)
(1435.12 3156.59 66.4375 86.32 441.7)
(1432.32 3106.08 66.25 86.22 447.67))

--
From the New York Times of October 11, 1991, page 15, we learn that ... a
committee of researchers at Boston University admitted that, "There is no
question but that Dr. King plagiarized in the dissertation." ...
"Dr. Martin Luther King, Jr." [Michael King] spent his last night on Earth
having sexual intercourse with two women at the motel and physically beating
and abusing a third. -- K. A. Strom

Alberto Riva

unread,
Jun 3, 2015, 6:56:36 PM6/3/15
to
Note that this is not exactly what the OP asked for: he wanted the data
in a 1-dimensional array, apparently.

But of course it's a lot easier to read the data this way, and then copy
them to a 1-dimensional array, if really necessary, when the total
number of values is known.

Alberto

Alexandre Landi

unread,
Jun 5, 2015, 2:27:11 PM6/5/15
to
Le mercredi 3 juin 2015 15:40:16 UTC+2, informatimago a écrit :

> On the data you provided it's instantaneous.
> There are several other CSV libraries.
> Perhaps you'd want to write a proper benchmark?

On a 80MB of the same form, this is what happens with SBCL:

Evaluation took:
20.715 seconds of real time
15.620000 seconds of total run time (15.420000 user, 0.200000 system)
[ Run times consist of 2.344 seconds GC time, and 13.276 seconds non-GC time. ]
75.40% CPU
12 lambdas converted
62,145,390,456 processor cycles

With equivalent code in JavaScript, the very same result takes:

real:0m0.475s,user:0m0.436s,sys:0m0.056s *

Here's the script in JS:

var
lazy=require('lazy'),fs=require('fs'),values=[],rownames=[],colnames=[],fl='Ct.csv',i=0;
new lazy(fs.createReadStream(fl)).lines.forEach(function(line){
var tmp=line.toString().split(';'); i++;
if(i==1) colnames=tmp;
else{ rownames.push(tmp[0]); for(var k=1;k<tmp.length;k++)
values.push(tmp[k]); }
}).on('pipe',function(){console.log(rownames.length+":"+colnames.length+":"+values.length);});

Did you mean something like this, Pascal?

I find it hard to believe there is no way around to CL running 40 times slower than JS. My intuition suggests me there must be some way to close (or at least reduce) this gap, but I am not sure I am proficient enough in CL to know how.

Any insight?

Dimitri Fontaine

unread,
Jun 5, 2015, 2:43:57 PM6/5/15
to
Alexandre Landi <ala...@lrde.epita.fr> writes:
> I find it hard to believe there is no way around to CL running 40
> times slower than JS.

What results do you get with cl-csv?

--
dim

Zach Beane

unread,
Jun 5, 2015, 3:37:00 PM6/5/15
to
Alexandre Landi <ala...@lrde.epita.fr> writes:

> I find it hard to believe there is no way around to CL running 40
> times slower than JS. My intuition suggests me there must be some way
> to close (or at least reduce) this gap, but I am not sure I am
> proficient enough in CL to know how.

It's possible that SBCL's internal representation of characters as
32-bit values, and the conversion time that entails, is causing some of
the increased runtime.

Zach

Pascal J. Bourguignon

unread,
Jun 5, 2015, 5:02:44 PM6/5/15
to
Alexandre Landi <ala...@lrde.epita.fr> writes:

> Le mercredi 3 juin 2015 15:40:16 UTC+2, informatimago a écrit :
>
>> On the data you provided it's instantaneous.
>> There are several other CSV libraries.
>> Perhaps you'd want to write a proper benchmark?
>
> On a 80MB of the same form, this is what happens with SBCL:
>
> Evaluation took:
> 20.715 seconds of real time
> 15.620000 seconds of total run time (15.420000 user, 0.200000 system)
> [ Run times consist of 2.344 seconds GC time, and 13.276 seconds non-GC time. ]
> 75.40% CPU
> 12 lambdas converted
> 62,145,390,456 processor cycles

What are you measuring?


> With equivalent code in JavaScript, the very same result takes:
>
> real:0m0.475s,user:0m0.436s,sys:0m0.056s *
>
> Here's the script in JS:
>
> var
> lazy=require('lazy'),fs=require('fs'),values=[],rownames=[],colnames=[],fl='Ct.csv',i=0;
> new lazy(fs.createReadStream(fl)).lines.forEach(function(line){
> var tmp=line.toString().split(';'); i++;
> if(i==1) colnames=tmp;
> else{ rownames.push(tmp[0]); for(var k=1;k<tmp.length;k++)
> values.push(tmp[k]); }
> }).on('pipe',function(){console.log(rownames.length+":"+colnames.length+":"+values.length);});
>
> Did you mean something like this, Pascal?
>
> I find it hard to believe there is no way around to CL running 40
> times slower than JS. My intuition suggests me there must be some way
> to close (or at least reduce) this gap, but I am not sure I am
> proficient enough in CL to know how.
>
> Any insight?

1- For text files, lisp will _decode_ bytes into characters, and store
characters. Most other languages will work with bytes, and store
bytes (possibly utf-8 byte sequences).

2- all the CSV lisp libraries I've seen (and other similar ones), work
with text files, not caring about particuliar optimization aspect.

One way to achieve similar I/O performance, is to implement similar
algorithms, using binary files, and processing vectors of bytes.


You may have a look at com.informatimago.common-lisp.cesarum.ascii
to see how you could do that without too much of boring.


(ql:quickload :com.informatimago.common-lisp.cesarum)
(use-package :com.informatimago.common-lisp.cesarum.ascii
:com.informatimago.common-lisp.cesarum.file)

(set-dispatch-macro-character #\# #\"
(function ascii-dispatch-macro) *readtable*)

#"HELLO"
--> #(72 69 76 76 79)

(bytes= #"HELLO" #(72 69 76 76 79))
--> T

(bytes= #"\"\"" #(34 34))
--> T

(with-open-file (bin "/tmp/test.csv" :element-type '(unsigned-byte 8))
(read-ascii-line bin t nil :lf))
--> #(34 34 59 34 69 83 49 32 73 110 100 101 120 34 59 34 86 … 121 34 10)


If the file stands in your RAM, it might be better to read it at once in a
buffer, and then process it from here:

(binary-file-contents "/tmp/test.csv")
--> #(34 34 59 34 69 83 49 32 73 110 … 54 55 10)

(mapcar (lambda (line)
;; proper parsing of the line is required to deal correctly with ";";","
;; but this will do for the demo:
(split-sequence (aref #";" 0) line))
(split-sequence LF (binary-file-contents "/tmp/test.csv") :remove-empty-subseqs t))
--> ((#(34 34) #(34 69 83 49 32 73 110 100 101 120 34)
#(34 86 71 49 32 73 110 100 101 120 34)
#(34 84 89 49 32 67 111 109 100 116 121 34)
#(34 82 88 49 32 67 111 109 100 116 121 34)
#(34 71 67 49 32 67 111 109 100 116 121 34))
(#(34 49 57 57 57 45 48 49 45 48 52 34) #(49 51 57 49 46 49 50)
#(51 48 51 52 46 53 51) #(54 54 46 53 49 53 54 50 53) #(56 54 46 50)
#(52 52 49 46 51 57))
(#(34 49 57 57 57 45 48 49 45 48 53 34) #(49 52 48 52 46 56 54)
#(51 48 55 50 46 52 49) #(54 54 46 51 49 50 53) #(56 54 46 49 55)
#(52 52 48 46 54 51))
(#(34 49 57 57 57 45 48 49 45 48 54 34) #(49 52 51 53 46 49 50)
#(51 49 53 54 46 53 57) #(54 54 46 52 51 55 53) #(56 54 46 51 50)
#(52 52 49 46 55))
(#(34 49 57 57 57 45 48 49 45 48 55 34) #(49 52 51 50 46 51 50)
#(51 49 48 54 46 48 56) #(54 54 46 50 53) #(56 54 46 50 50)
#(52 52 55 46 54 55)))


(split-sequence (aref #";" 0)
(with-open-file (bin "/tmp/test.csv" :element-type '(unsigned-byte 8))
(read-ascii-line bin t nil :lf)))
--> (#(34 34)
#(34 69 83 49 32 73 110 100 101 120 34)
#(34 86 71 49 32 73 110 100 101 120 34)
#(34 84 89 49 32 67 111 109 100 116 121 34)
#(34 82 88 49 32 67 111 109 100 116 121 34)
#(34 71 67 49 32 67 111 109 100 116 121 34 10))
66

Of course, if you need to convert to numbers, without going thru
characters, you will have to implement the parsing and conversion
routines yourself.


With ccl:

cl-user> (time (length (binary-file-contents "/tmp/80G.data")))
(length (binary-file-contents "/tmp/80G.data"))
took 4,734,542 microseconds (4.734542 seconds) to run.
295,313 microseconds (0.295313 seconds, 6.24%) of which was spent in GC.
During that period, and with 1 available CPU core,
3,048,190 microseconds (3.048190 seconds) were spent in user mode
12,001 microseconds (0.012001 seconds) were spent in system mode
244,810,928 bytes of memory allocated.
90,272 minor page faults, 0 major page faults, 0 swaps.
--> 81602538


with bash:

% time (while read line ; do true ; done) </tmp/80G.data

real 0m8.432s
user 0m5.212s
sys 0m0.360s

WJ

unread,
Jun 6, 2015, 4:28:07 AM6/6/15
to
Why this is marked as abuse? It has been marked as abuse.
Report not abuse
Pascal J. Bourguignon wrote:

> ("1432.32" "3106.08" "66.25" "86.22" "447.67"))

Wrong.

Those objects should be numbers, not strings.

--
An important part of the explanation is the role of mass media in Sweden. Not a
single TV-program, radio program, or big newspaper would give space to critics
of the multicultural project.
fjordman.blogspot.ca/2005/05/is-swedish-democracy-collapsing.html

Alexandre Landi

unread,
Jun 6, 2015, 9:11:56 AM6/6/15
to
Le vendredi 5 juin 2015 20:43:57 UTC+2, Dimitri Fontaine a écrit :

> What results do you get with cl-csv?

I am getting this error for now:

We finished reading a quoted value and got more characters before a separator or EOL 5
"";""@AALBE4";...;"YUM";"ZION";"ZMRH"
[Condition of type CL-CSV:CSV-PARSE-ERROR]

Struggling a bit to figure out what the problem is, as the file looks very much like the piece of csv file I enclosed in the original post...

Alexandre Landi

unread,
Jun 6, 2015, 9:33:49 AM6/6/15
to
Le vendredi 5 juin 2015 23:02:44 UTC+2, informatimago a écrit :

> > On a 80MB of the same form, this is what happens with SBCL:
> >
> > Evaluation took:
> > 20.715 seconds of real time
> > [...]
>
> What are you measuring?

I was measuring how long it takes to run the CL script I wrote in the original post on a 80MB csv file with 3836 rows and 1550 columns instead of 5 rows and 5 columns of the original post.

> 1- For text files, lisp will _decode_ bytes into characters, and store
> characters. Most other languages will work with bytes, and store
> bytes (possibly utf-8 byte sequences).
>
> 2- all the CSV lisp libraries I've seen (and other similar ones), work
> with text files, not caring about particuliar optimization aspect.
>
> One way to achieve similar I/O performance, is to implement similar
> algorithms, using binary files, and processing vectors of bytes.
>
> [...]

Thanks a lot Pascal, I will go through your post over and over again and try to implement this solution on the file I am working with.

Raymond Wiker

unread,
Jun 6, 2015, 9:43:11 AM6/6/15
to
Probably because that is not a valid csv file... it uses semicolon
instead of comma as a separator. Further, the sequence ""@ is not valid:
if the idea is to have a value that contains a double-quote followed by
a an at-sign, it should be represented as """@...

A correct (according to RFC4180) representation of the line above should
be something like

"","""@AALBE4",...,"YUM","ZION","ZMRH"

or even

,"""@AALBE4",...,YUM,ZION,ZMRH

This particular kind of idiocy is commonly found in "csv" exported by
Microsoft Excel, which alters the separator character according to the
local settings, and even unable to reliably read back files generated by
itself.

Alexandre Landi

unread,
Jun 6, 2015, 10:58:30 AM6/6/15
to
Le samedi 6 juin 2015 15:43:11 UTC+2, Raymond Wiker a écrit :

> > Struggling a bit to figure out what the problem is, as the file looks
> > very much like the piece of csv file I enclosed in the original
> > post...
>
> Probably because that is not a valid csv file... it uses semicolon
> instead of comma as a separator.

Actually that should not be an issue as long you specify the separator.

(cl-csv:read-csv #P"your-file.csv" :separator #\;)

> Further, the sequence ""@ is not valid:

I realized that for some reason in the very second field there were 2 double quotes (""@AALBE4"). If I remove one, the error doesn't show up any more. However, if I compile the following expression:

(time (progn (cl-csv:read-csv #P"big-prices.csv" :separator #\;) t))

it says ; compiling (TIME (PROGN # ...))(:CL-CSV) but goes on and on until I kill the buffer.

The same happens if I substitute all @ characters in my file with any ordinary character. So I am not sure the issue is the @ character.

Pascal J. Bourguignon

unread,
Jun 6, 2015, 2:26:01 PM6/6/15
to
This is the reason why CSV is a bad choice of a file format.
There is too much idiocy surrounding it, and thus those errors.

Basically, your data is written wrong. You will need to perform ad-hoc
cleanup on its format (such as removing stray double quotes as in
""@AALBE4" -> "@AALBE4"). Normally, one would patch the program that
produce this data, but since often this program is proprietary, and it
would cost you several hundred million dollars to get the sources to be
able to correct it, or have Microsoft correct it for you, it's hopeless.

There is no issue with the @ character, just with the rest of the file
format.

WJ

unread,
Jun 7, 2015, 4:32:04 PM6/7/15
to
Why this is marked as abuse? It has been marked as abuse.
Report not abuse
Gauche Scheme:

(use text.csv)

(define reader (make-csv-reader #\,))
(time
(car
(call-with-input-file "standard-data-big"
(lambda (port) (port->list reader port)))))

Alexandre Landi

unread,
Jun 22, 2015, 10:47:22 AM6/22/15
to
Le vendredi 5 juin 2015 23:02:44 UTC+2, informatimago a écrit :

> One way to achieve similar I/O performance, is to implement similar
> algorithms, using binary files, and processing vectors of bytes.

I am wondering whether there exists some "higher-level" solution to this problem.

Given a file of this form,

"1999-01-04";1391.12;3034.53;66.515625;86.2;441.39
"1999-01-05";1404.86;3072.41;66.3125;86.17;440.63
"1999-01-06";1435.12;3156.59;66.4375;86.32;441.7
"1999-01-07";1432.32;3106.08;66.25;86.22;447.67

we know the first field on each line is a string and all subsequent fields are floating points. I am guessing that much of the slowness I am observing can be due to the fact the read function is generic and infers the type of what is being read.

It would be delightful if there existed two functions read-string and read-float such that we could write something like the following:

(with-open-file (stream "my-file.csv" :direction :input)
(loop :for count = 0 :then (1+ count)
:do (if (zerop (mod count number-of-columns))
(read-string stream nil)
(read-float stream nil))))

My imagination suggests that it should run a lot faster than the generic read. Are you aware of any such function equivalent to the read-string and read-float I wrote in my wishful script?

Pascal J. Bourguignon

unread,
Jun 22, 2015, 2:13:22 PM6/22/15
to
If your floats always have this syntax [0-9]*.[0-9]* (or
[0-9]+.[0-9]+) then it seems that your imagination is not wrong.




(defun read-float (stream)
"Assumes [0-9]*.[0-9]*/[^0-9]"
(let ((float 0.0))
(loop
:for ch = (read-char stream)
:until (char= #\. ch)
:do (let ((digit (digit-char-p ch)))
(setf float (+ (* 10.0 float) digit))))
(loop
:for digit = (digit-char-p (read-char stream nil #\;))
:for order = 0.1 :then (/ order 10.0)
:while digit
:do (incf float (* digit order)))
float))


(defun generate-data (n)
(format nil "~{~A;~}"
(loop
:repeat n
:collect (format nil "~A.~A" (random 10000) (random 10000)))))

(loop :repeat 3
:for data = (generate-data 100000)
:collect (list
(with-input-from-string (in data) ; use the same data
(com.informatimago.common-lisp.cesarum.utility:chrono
(loop :while (peek-char nil in nil nil) ; measure the same loop
:do (read-float in))))

(with-input-from-string (in data) ; use the same data
(let ((*readtable* (copy-readtable nil)))
(set-syntax-from-char #\; #\space)
(com.informatimago.common-lisp.cesarum.utility:chrono
(loop :while (peek-char nil in nil nil) ; measure the same loop
:do (read in)))))))
--> ((0.116007D0 0.276017D0) (0.116008D0 0.276017D0) (0.116007D0 0.276017D0))

tar...@google.com

unread,
Jun 22, 2015, 2:13:51 PM6/22/15
to
Well, I would think that if speed is your primary concern and you have such
a carefully fixed format, that you could write something along the lines of
what you have above, but with a few minor modifications.

First, you could get each line using READ-LINE, then use a loop through the positions of the delimiter character #\; to find the field positions. Now,
unfortunately, Common Lisp doesn't have a PARSE-FLOAT function, so you would
have to import a library do do that.
See PARSE-NUMBER from http://cliki.net/parse-number or the older PARSE-FLOAT
from http://www.cs.cmu.edu/Groups/AI/util/lang/lisp/code/math/atof/0.html
(The latter doesn't have the nice floating format support, though).

So, you could do something along the lines of (untested):

(with-open-file (stream "my-file.csv" :direction :input)
(loop :for line :in (read-line stream nil nil)
:while line
:nconc (loop :for column-number :upfrom 0
:as start = 0 then end
:as end = (position #\; line :start (1+ start))
:collect (if (zerop column-number)
(subseq line (1+ start) (1- end))
(parse-number line (1+ start) end))
:when (null end) :do (loop-finish))))



One could also use the SPLIT-SEQUENCE function from
http://cliki.net/SPLIT-SEQUENCE, but if you really want speed, using offsets
into the first string will be better than
creating new substrings using SPLIT-SEQUENCE.

Teemu Likonen

unread,
Jun 22, 2015, 3:31:41 PM6/22/15
to
Alexandre Landi [2015-06-22 07:47:18-07] wrote:

> Le vendredi 5 juin 2015 23:02:44 UTC+2, informatimago a écrit :
>> One way to achieve similar I/O performance, is to implement similar
>> algorithms, using binary files, and processing vectors of bytes.
>
> I am wondering whether there exists some "higher-level" solution to
> this problem.

I don't know where's your program's bottleneck but if it's about disk IO
or character conversions then maybe switch to binary processing and use
READ-SEQUENCE, as suggested. Read something like 10000 bytes to a buffer
and parse the buffer. Here's one example:


(defparameter *field-sep* (char-code #\;))
(defparameter *line-sep* (char-code #\newline))
(defparameter *quote* (char-code #\"))
(defparameter *buffer* (make-array 10000 :element-type '(unsigned-byte 8)))

(defun main ()
(flet ((make-line ()
(make-array 1600 :element-type 'vector
:adjustable t :fill-pointer 0))
(make-field ()
(make-array 15 :element-type 'character
:adjustable t :fill-pointer 0)))

(with-open-file (file "~/file.csv"
:direction :input
:element-type '(unsigned-byte 8))

(loop :with line := (make-line)
:with field := (make-field)
:with string := nil
:with first-line := t
:for end := (read-sequence *buffer* file)

:do (loop :for i :from 0 :below end
:for byte := (aref *buffer* i)
:do (cond ((= byte *quote*)
(setf string (not string)))
((and (not string)
(= byte *field-sep*))
;; Process input field and push it to
;; current LINE array.
(vector-push-extend field line)
(setf field (make-field)))

((= byte *line-sep*)
;; Process line's last input field,
;; push it to the current LINE array...
(vector-push-extend field line)
;; ...and process the line array.
(if first-line
(progn
;; First line. Push LINE array to
;; COLUMN-NAMES.
)
(progn
;; Other lines. Push first element
;; of LINE array to ROW-NAMES and
;; other elements to VALUES. Maybe
;; parse floats before that.
))
;; This line is now finished. Reset.
(setf field (make-field)
line (make-line)
string nil
first-line nil))

(t
;; Push the byte as character to FIELD
;; array (string).
(vector-push-extend (code-char byte) field))))

:if (< end (length *buffer*)) :do (loop-finish)))))
signature.asc

Pascal J. Bourguignon

unread,
Jun 22, 2015, 3:49:48 PM6/22/15
to
Congratulation for your "optimization" job!


(defun read-float (stream)
"Assumes [0-9]*.[0-9]*/[^0-9]"
(let ((float 0.0))
(loop
:for ch = (read-char stream)
:until (char= #\. ch)
:do (let ((digit (digit-char-p ch)))
(setf float (+ (* 10.0 float) digit))))
(loop
:for digit = (digit-char-p (read-char stream nil #\;))
:for order = 0.1 :then (/ order 10.0)
:while digit
:do (incf float (* digit order)))
float))


(defun generate-data (n)
(format nil "~{~A;~}"
(loop
:repeat n
:collect (format nil "~A.~A" (random 10000) (random 10000)))))


(ql:quickload :parse-number)
(import 'parse-number:parse-number)
(defun tarrus (line)
(loop :for column-number :upfrom 0
:as start = 0 then end
:as end = (position #\; line :start (1+ start))
:collect (if (zerop column-number)
(subseq line (1+ start) (1- end))
(parse-number line :start (1+ start) :end end))
:when (null end) :do (loop-finish)))

(loop :repeat 3
:for data = (generate-data 10000)
:collect (list
(with-input-from-string (in data)
(com.informatimago.common-lisp.cesarum.utility:chrono
(loop :while (peek-char nil in nil nil)
:do (read-float in))))
(with-input-from-string (in (subseq data 0 (1- (length data))))
(com.informatimago.common-lisp.cesarum.utility:chrono
(loop :while (peek-char nil in nil nil)
:do (tarrus (read-line in)))))
(with-input-from-string (in data)
(let ((*readtable* (copy-readtable nil)))
(set-syntax-from-char #\; #\space)
(com.informatimago.common-lisp.cesarum.utility:chrono
(loop :while (peek-char nil in nil nil)
:do (read in)))))))
--> ((0.028002D0 0.248016D0 0.048003D0)
(0.024001D0 0.232015D0 0.048003D0)
(0.024002D0 0.228014D0 0.048003D0))

Kaz Kylheku

unread,
Jun 22, 2015, 4:04:27 PM6/22/15
to
There will be overhead in the reader in scanning [0-9]+. because
of symbol syntax. A sequence of digits like 123 could be the prefix
of an integer, floating-point value or a symbol like 123\45 or 123ABC
or 123|45|. So escapes have to be recognized and treated. If any digit is
escaped, then a symbol is interned, and that is probably done by collecting the
entire token (with the escapes removed) and keeping track of escaped/unescaped
state. It's also worth noting that the reader is actually collecting token
constituents (as programmed in the current read table) until a terminating
character (pardon me for not looking up the details and citing).

I think this can all be fast for extracting floats from a stream, but someone
has to put the effort into it, after first caring about that use case.

knobo

unread,
Jun 25, 2015, 10:42:44 AM6/25/15
to
On Wednesday, June 3, 2015 at 11:29:44 AM UTC+2, Alexandre Landi wrote:
> What is the fastest way to read a csv file in CL in a way such that:

I did not read the whole thread. But you should read the whole file in to memory before you start parsing. If it is many hundred MB, then maybe you could split the job in chunks.

Then your program will be as fast as what ever other language you are comparing to :)

Regards
Knut.

Matthew Carter

unread,
Jun 25, 2015, 11:41:41 AM6/25/15
to
I think he was testing on a 100G+ file to begin with. We may be a bit
off until computers have that much memory on a typical desktop.

--
Matthew Carter (m...@ahungry.com)
http://ahungry.com

knobo

unread,
Jun 25, 2015, 4:28:24 PM6/25/15
to
Ok, so read 50 or 100MB at the time, then. My guess is that this solves the whole issue.

Best regards
Knut

Alexandre Landi

unread,
Jun 26, 2015, 8:43:04 AM6/26/15
to
Le lundi 22 juin 2015 20:13:22 UTC+2, informatimago a écrit :

> If your floats always have this syntax [0-9]*.[0-9]* (or
> [0-9]+.[0-9]+) then it seems that your imagination is not wrong.

In this test, I am using the generic read for strings but Pascal's read-float for numeric values. Here's the difference for now. Using the read-float instead of the generic read, the gain is a factor of 3+.

With generic read everywhere:

; compiling (TIME (PROGN # ...))
Evaluation took:
32.841 seconds of real time
32.924000 seconds of total run time (32.520000 use r, 0.404000 system)
[ Run times consist of 1.572 seconds GC time, and 31.352 seconds non-GC time. ]
100.25% CPU
62,253,767,593 processor cycles
10,394,274,496 bytes consed

With read-float for numeric values:

; compiling (TIME (PROGN # ...))
Evaluation took:
10.677 seconds of real time
10.712000 seconds of total run time (10.476000 user, 0.236000 system)
[ Run times consist of 0.640 seconds GC time, and 10.072 seconds non-GC time. ]
100.33% CPU
20,239,041,659 processor cycles
3,796,872,552 bytes consed

Alexandre Landi

unread,
Jun 26, 2015, 8:45:05 AM6/26/15
to
Le jeudi 25 juin 2015 22:28:24 UTC+2, knobo a écrit :

> Ok, so read 50 or 100MB at the time, then. My guess is that this solves the whole issue.

I am actually testing on some 80MB file at the moment.

jimka...@gmail.com

unread,
Jun 26, 2015, 9:15:30 AM6/26/15
to
Can you also post the profiling output?
Where is the time being spent?
It is really curious that java-script does this in half a second.

Alexandre Landi

unread,
Jun 26, 2015, 9:43:15 AM6/26/15
to
This is the output of the most performant so far:

(sb-sprof:with-profiling (:max-samples 1000
:report :flat
:loop t
:show-progress t)
*)


===> 0 of 1000 samples taken.
Profiler sample vector full (269 traces / 10000 samples), doubling the size
Profiler sample vector full (536 traces / 20000 samples), doubling the size
===> 651 of 1000 samples taken.

Number of samples: 1000
Sample interval: 0.01 seconds
Total sampling time: 10.0 seconds
Number of cycles: 0
Sampled threads:
#<SB-THREAD:THREAD "worker" RUNNING {1003C46913}>

Self Total Cumul
Nr Count % Count % Count % Calls Function
------------------------------------------------------------------------
1 119 11.9 128 12.8 119 11.9 - SB-IMPL::FD-STREAM-READ-N-CHARACTERS/ASCII
2 108 10.8 207 20.7 227 22.7 - READ-CHAR
3 97 9.7 526 52.6 324 32.4 - READ-FLOAT
4 71 7.1 214 21.4 395 39.5 - SB-IMPL::%READ-PRESERVING-WHITESPACE
5 70 7.0 114 11.4 465 46.5 - SB-IMPL::READ-STRING
6 67 6.7 67 6.7 532 53.2 - SB-KERNEL:TWO-ARG-*
7 56 5.6 56 5.6 588 58.8 - DIGIT-CHAR-P
8 53 5.3 53 5.3 641 64.1 - SB-KERNEL:TWO-ARG-+
9 44 4.4 52 5.2 685 68.5 - SB-VM::GENERIC-+
10 44 4.4 44 4.4 729 72.9 - SB-KERNEL:TWO-ARG-/
11 33 3.3 33 3.3 762 76.2 - TRUNCATE
12 26 2.6 131 13.1 788 78.8 - VECTOR-PUSH-EXTEND
13 25 2.5 981 98.1 813 81.3 - (LET* ((SB-SPROF:*SAMPLE-INTERVAL* SB-SPROF:*S\
AMPLE-INTERVAL*) (SB-SPROF:*ALLOC-INTERVAL* SB-SPROF:*ALLOC-INTERVAL*) (SB-SPROF::*SAMPLING* NIL) (S\
B-SPROF::*SAMPLING-MODE* SB-SPROF::*SAMPLING-MODE*) (SB-SPROF:*MAX-SAMPLES* 1000)) (UNWIND-PROTECT (\
PROGN (SB-SPROF:START-PROFILING :MAX-DEPTH 4611686018427387903 :THREADS ???) (LOOP ??? ??? ??? ???))\
(SB-SPROF:STOP-PROFILING)) (SB-SPROF:REPORT :TYPE :FLAT))
14 18 1.8 18 1.8 831 83.1 - SB-IMPL::GET-CAT-ENTRY
15 16 1.6 43 4.3 847 84.7 - REPLACE
16 16 1.6 16 1.6 863 86.3 - SB-IMPL::READ-BUFFER-TO-STRING
17 15 1.5 15 1.5 878 87.8 - SB-KERNEL:HAIRY-DATA-VECTOR-SET
18 15 1.5 15 1.5 893 89.3 - SB-IMPL::SLOW-HAIRY-DATA-VECTOR-SET
19 14 1.4 23 2.3 907 90.7 - LENGTH
20 12 1.2 12 1.2 919 91.9 - MAKE-ARRAY
21 10 1.0 10 1.0 929 92.9 - SB-KERNEL:SEQUENCEP
22 9 0.9 9 0.9 938 93.8 - EQL
23 8 0.8 8 0.8 946 94.6 - SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
24 8 0.8 8 0.8 954 95.4 - SB-IMPL::OPTIMIZED-DATA-VECTOR-SET
25 8 0.8 8 0.8 962 96.2 - (FLET #:CLEANUP-FUN-48866 :IN SB-IMPL::REFILL-\
INPUT-BUFFER)
26 7 0.7 7 0.7 969 96.9 - SB-KERNEL:HAIRY-DATA-VECTOR-REF
27 7 0.7 7 0.7 976 97.6 - LIST
28 5 0.5 224 22.4 981 98.1 - READ
29 2 0.2 2 0.2 983 98.3 - SB-IMPL::CHECK-FOR-RECURSIVE-READ
30 1 0.1 129 12.9 984 98.4 - SB-INT:FAST-READ-CHAR-REFILL
31 1 0.1 59 5.9 985 98.5 - ADJUST-ARRAY
32 1 0.1 2 0.2 986 98.6 - SB-IMPL::%READ-FROM-STRING
33 1 0.1 1 0.1 987 98.7 - SB-ALIEN::C-STRING-EXTERNAL-FORMAT
34 1 0.1 1 0.1 988 98.8 - "foreign function __lxstat"
35 1 0.1 1 0.1 989 98.9 - SB-IMPL::%MAKE-PATHNAME
36 1 0.1 1 0.1 990 99.0 - SB-KERNEL:VALUES-SPECIFIER-TYPE
37 1 0.1 1 0.1 991 99.1 - (FLET SB-FORMAT::GET-CHAR :IN SB-FORMAT::PARSE\
-DIRECTIVE)
38 1 0.1 1 0.1 992 99.2 - SB-IMPL::GET-COERCED-CMT-ENTRY
39 1 0.1 1 0.1 993 99.3 - SB-IMPL::%MAKE-FD-STREAM
40 1 0.1 1 0.1 994 99.4 - SB-IMPL::OUTPUT-TO-C-STRING/ASCII
41 1 0.1 1 0.1 995 99.5 - (FLET #:WITHOUT-INTERRUPTS-BODY-48841 :IN SB-I\
MPL::REFILL-INPUT-BUFFER)
42 1 0.1 1 0.1 996 99.6 - SB-IMPL::%WALK-DISPLACED-ARRAY-BACKPOINTERS
43 1 0.1 1 0.1 997 99.7 - PATHNAME-TYPE
44 1 0.1 1 0.1 998 99.8 - (FLET #:CLEANUP-FUN-49651 :IN SB-IMPL::RELEASE\
-FD-STREAM-RESOURCES)
45 1 0.1 1 0.1 999 99.9 - CHAR-UPCASE
46 1 0.1 1 0.1 1000 100.0 - (LAMBDA (STRING CL-PPCRE::START CL-PPCRE::END)\
:IN CL-PPCRE::CREATE-SCANNER-AUX)
47 0 0.0 997 99.7 1000 100.0 - SB-FASL::LOAD-FASL-GROUP
48 0 0.0 997 99.7 1000 100.0 - (FLET SB-THREAD::WITH-RECURSIVE-LOCK-THUNK :IN\
SB-FASL::LOAD-AS-FASL)
49 0 0.0 997 99.7 1000 100.0 - (FLET #:WITHOUT-INTERRUPTS-BODY-88874 :IN SB-T\
HREAD::CALL-WITH-RECURSIVE-LOCK)
50 0 0.0 997 99.7 1000 100.0 - SB-THREAD::CALL-WITH-RECURSIVE-LOCK
51 0 0.0 997 99.7 1000 100.0 - SB-FASL::LOAD-AS-FASL
52 0 0.0 997 99.7 1000 100.0 - (FLET SB-FASL::LOAD-STREAM :IN LOAD)
53 0 0.0 997 99.7 1000 100.0 - LOAD
54 0 0.0 997 99.7 1000 100.0 - (FLET SB-THREAD::WITH-RECURSIVE-LOCK-THUNK :IN\
SB-C::%WITH-COMPILATION-UNIT)
55 0 0.0 997 99.7 1000 100.0 - (FLET SB-C::WITH-IT :IN SB-C::%WITH-COMPILATIO\
N-UNIT)
56 0 0.0 997 99.7 1000 100.0 - (FLET SWANK-BACKEND:CALL-WITH-COMPILATION-HOOK\
S :IN "/usr/share/common-lisp/source/slime/swank-sbcl.lisp")
57 0 0.0 997 99.7 1000 100.0 - (FLET SWANK-BACKEND:SWANK-COMPILE-STRING :IN "\
/usr/share/common-lisp/source/slime/swank-sbcl.lisp")
58 0 0.0 997 99.7 1000 100.0 - (LAMBDA NIL :IN SWANK:COMPILE-STRING-FOR-EMACS\
)
59 0 0.0 997 99.7 1000 100.0 - (LAMBDA NIL :IN SWANK::COLLECT-NOTES)
60 0 0.0 997 99.7 1000 100.0 - SWANK::MEASURE-TIME-INTERVAL
61 0 0.0 997 99.7 1000 100.0 - SWANK::COLLECT-NOTES
62 0 0.0 997 99.7 1000 100.0 - SWANK::CALL-WITH-BUFFER-SYNTAX
63 0 0.0 997 99.7 1000 100.0 - SB-INT:SIMPLE-EVAL-IN-LEXENV
64 0 0.0 997 99.7 1000 100.0 - EVAL
65 0 0.0 997 99.7 1000 100.0 - SWANK:EVAL-FOR-EMACS
66 0 0.0 997 99.7 1000 100.0 - (LAMBDA NIL :IN SWANK::SPAWN-WORKER-THREAD)
67 0 0.0 997 99.7 1000 100.0 - SWANK-BACKEND::CALL-WITH-BREAK-HOOK
68 0 0.0 997 99.7 1000 100.0 - (FLET SWANK-BACKEND:CALL-WITH-DEBUGGER-HOOK :I\
N "/usr/share/common-lisp/source/slime/swank-sbcl.lisp")
69 0 0.0 997 99.7 1000 100.0 - SWANK::CALL-WITH-BINDINGS
70 0 0.0 997 99.7 1000 100.0 - (FLET #:WITHOUT-INTERRUPTS-BODY-237377 :IN SB-\
THREAD:MAKE-THREAD)
71 0 0.0 997 99.7 1000 100.0 - (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREA\
D:MAKE-THREAD)
72 0 0.0 997 99.7 1000 100.0 - (FLET #:WITHOUT-INTERRUPTS-BODY-88845 :IN SB-T\
HREAD::CALL-WITH-MUTEX)
73 0 0.0 997 99.7 1000 100.0 - SB-THREAD::CALL-WITH-MUTEX
74 0 0.0 997 99.7 1000 100.0 - SB-THREAD::INITIAL-THREAD-FUNCTION
75 0 0.0 997 99.7 1000 100.0 - "foreign function call_into_lisp"
76 0 0.0 997 99.7 1000 100.0 - "foreign function new_thread_trampoline"
77 0 0.0 12 1.2 1000 100.0 - SB-IMPL::DATA-VECTOR-FROM-INITS
78 0 0.0 9 0.9 1000 100.0 - SB-IMPL::REFILL-INPUT-BUFFER
79 0 0.0 5 0.5 1000 100.0 - OPEN
80 0 0.0 3 0.3 1000 100.0 - SB-IMPL::QUERY-FILE-SYSTEM
81 0 0.0 3 0.3 1000 100.0 - PROBE-FILE
82 0 0.0 3 0.3 1000 100.0 - SB-SYS:MAKE-FD-STREAM
83 0 0.0 2 0.2 1000 100.0 - SB-FORMAT::PARSE-DIRECTIVE
84 0 0.0 2 0.2 1000 100.0 - SB-FORMAT::TOKENIZE-CONTROL-STRING
85 0 0.0 2 0.2 1000 100.0 - SB-FORMAT::%FORMAT
86 0 0.0 2 0.2 1000 100.0 - FORMAT
87 0 0.0 1 0.1 1000 100.0 - SB-UNIX:UNIX-STAT
88 0 0.0 1 0.1 1000 100.0 - "foreign function sb_realpath"
89 0 0.0 1 0.1 1000 100.0 - SB-UNIX:UNIX-REALPATH
90 0 0.0 1 0.1 1000 100.0 - SB-IMPL::%PARSE-NAMESTRING
91 0 0.0 1 0.1 1000 100.0 - PARSE-NAMESTRING
92 0 0.0 1 0.1 1000 100.0 - PATHNAME
93 0 0.0 1 0.1 1000 100.0 - MERGE-PATHNAMES
94 0 0.0 1 0.1 1000 100.0 - SB-KERNEL:SPECIFIER-TYPE
95 0 0.0 1 0.1 1000 100.0 - SUBTYPEP
96 0 0.0 1 0.1 1000 100.0 - SB-UNIX:UNIX-OPEN
97 0 0.0 1 0.1 1000 100.0 - SB-KERNEL:SET-ARRAY-HEADER
98 0 0.0 1 0.1 1000 100.0 - SB-IMPL::UNPARSE-NATIVE-UNIX-NAMESTRING
99 0 0.0 1 0.1 1000 100.0 - SB-IMPL::RELEASE-FD-STREAM-RESOURCES
100 0 0.0 1 0.1 1000 100.0 - (SB-PCL::FAST-METHOD SB-GRAY::PCL-CLOSE (SB-KE\
RNEL:ANSI-STREAM))
101 0 0.0 1 0.1 1000 100.0 - (FLET #:CLEANUP-FUN-13 :IN #:DROP-THRU-TAG-1)
102 0 0.0 1 0.1 1000 100.0 - (SB-PCL::FAST-METHOD CL-PPCRE:SCAN (FUNCTION T\
))
103 0 0.0 1 0.1 1000 100.0 - CL-PPCRE:SPLIT
------------------------------------------------------------------------
0 0.0 elsewhere

Jim Newton

unread,
Jun 26, 2015, 9:53:03 AM6/26/15
to
If I'm reading this correctly, it looks like 25% is taken up in math operations. + * truncate.
I'm not sure whether this is in addition to the 9.7% taken up in read-float.
It also looks like 22% of the time is taking up simply reading characters.

How does javascript read this so fast?

Matthew Carter

unread,
Jun 26, 2015, 10:29:53 AM6/26/15
to
A lot of node.js is asynchronous, maybe its reading multiple lines at
once (non-blocking) and then combining them?

Like, spawn 20 threads, each one reading a line and then jumping 20
lines down the list.

tar...@google.com

unread,
Jun 26, 2015, 2:47:55 PM6/26/15
to
On Monday, June 22, 2015 at 12:49:48 PM UTC-7, informatimago wrote:
I will point out that this isn't comparing exactly the same things.
The middle value is collecting the read floats on a line-by-line basis,
while the 1st and 2nd are reading and then discarding the values.

Perhaps just calling parse-number would be more illuminating.
Or just using parse-float inside the collection function.


(BTW, thanks for fixing the PARSE-NUMBER call by adding the keywords)
Message has been deleted

Pascal J. Bourguignon

unread,
Jun 26, 2015, 4:32:44 PM6/26/15
to
Alexandre Landi <ala...@lrde.epita.fr> writes:

> Le vendredi 26 juin 2015 15:15:30 UTC+2, jimka...@gmail.com a écrit :
>> Can you also post the profiling output?
>> Where is the time being spent?
>> It is really curious that java-script does this in half a second.
>
> This is the output of the most performant so far:
>
> (sb-sprof:with-profiling (:max-samples 1000
> :report :flat
> :loop t
> :show-progress t)
> *)
>
>
> Self Total Cumul
> Nr Count % Count % Count % Calls Function
> ------------------------------------------------------------------------
> 1 119 11.9 128 12.8 119 11.9 - SB-IMPL::FD-STREAM-READ-N-CHARACTERS/ASCII
> 2 108 10.8 207 20.7 227 22.7 - READ-CHAR
> 3 97 9.7 526 52.6 324 32.4 - READ-FLOAT


Now, you could add type declarations in read-float to switch to
specialized float operations.

Pascal J. Bourguignon

unread,
Jun 26, 2015, 4:43:20 PM6/26/15
to
There's a single input line, so it should be the same.
You can hook it to a file, if you'd prefer to test it there.

But we can easily pass directly the line to tarrus, and you can see that
it doesn't make a big difference:

(loop :repeat 3
:for data = (generate-data 10000)
:collect (list
(with-input-from-string (in data)
(com.informatimago.common-lisp.cesarum.utility:chrono
(loop :while (peek-char nil in nil nil)
:do (read-float in))))
(let ((line (subseq data 0 (1- (length data)))))
(com.informatimago.common-lisp.cesarum.utility:chrono
(tarrus line)))
(with-input-from-string (in (subseq data 0 (1- (length data))))
(com.informatimago.common-lisp.cesarum.utility:chrono
(loop :while (peek-char nil in nil nil)
:do (tarrus (read-line in)))))
(with-input-from-string (in data)
(let ((*readtable* (copy-readtable nil)))
(set-syntax-from-char #\; #\space)
(com.informatimago.common-lisp.cesarum.utility:chrono
(loop :while (peek-char nil in nil nil)
:do (read in)))))))

--> ((0.016001D0 0.148009D0 0.16401D0 0.036002D0)
(0.016001D0 0.180011D0 0.180011D0 0.040003D0)
(0.020001D0 0.192012D0 0.200013D0 0.040002D0))



Now where you might complain, perhaps, is that since I wanted to compare
the speed of reading floats, I made a big line of floats, but perhaps
the actual file will contain shorter lines, and then the line processing
overhead will be more important.

It's up to the OP to do his benchmarking…


> Perhaps just calling parse-number would be more illuminating.
> Or just using parse-float inside the collection function.
>
>
> (BTW, thanks for fixing the PARSE-NUMBER call by adding the keywords)

Madhu

unread,
Jun 27, 2015, 9:43:02 AM6/27/15
to

* Matthew Carter <87616a4...@ahungry.com> :
Wrote on Fri, 26 Jun 2015 10:29:50 -0400:
On a 40MB file, on 1Ghz, an equivalent basic C program with getline,
sscanf, and realloc takes about 9 seconds. The lisp version (either
split-string with line positions, or with Alain's csv-parser) takes
about 50 seconds, without significant difference (note: TAR). I can't
test a javascript version, but have difficulty believing it is an order
of magnitude faster than C and Lisp.

If it is indeed that much faster, it would be worth understanding
exactly how the node.js solution works and how their infrastructure
helps. The input is line oriented, so I imagine that parallelizing
input and recombining would only increase the overhead in any
traditional scheme. ---Madhu

Marco Antoniotti

unread,
Jun 27, 2015, 1:06:23 PM6/27/15
to
Maybe it mmap's the file first?

--
MA

Pascal J. Bourguignon

unread,
Jun 27, 2015, 2:49:32 PM6/27/15
to
Marco Antoniotti <mar...@gmail.com> writes:

> Maybe it mmap's the file first?

mmap doesn't magically copy the file contents into memory.

Marco Antoniotti

unread,
Jun 29, 2015, 2:58:59 PM6/29/15
to
On Saturday, June 27, 2015 at 2:49:32 PM UTC-4, informatimago wrote:
> Marco Antoniotti <mar...@gmail.com> writes:
>
> > Maybe it mmap's the file first?
>
> mmap doesn't magically copy the file contents into memory.
>

Meaning?

--
MA

Pascal J. Bourguignon

unread,
Jun 29, 2015, 4:20:24 PM6/29/15
to
Meaning that it takes the same time to access the file data using read
than using mmap.

tar...@google.com

unread,
Jun 29, 2015, 5:36:29 PM6/29/15
to
On Friday, June 26, 2015 at 1:43:20 PM UTC-7, informatimago wrote:
Actually, where I complain is in comparing
collect (parse-number)
vs
do (read-float)

since I would expect there to be overhead from collecting a list of the floats.

>
> It's up to the OP to do his benchmarking...

Stefan Monnier

unread,
Jun 30, 2015, 9:28:17 AM6/30/15
to
> Meaning that it takes the same time to access the file data using read
> than using mmap.

Not necessarily, since the OS's prefetching mechanism may work
differently for the two cases.
In some cases, mmap may save the OS the trouble of performing an extra
copy, OTOH the "simple sequential read" is the most common use case for
"read" so the prefetching is likely very finely tuned.


Stefan

Pascal J. Bourguignon

unread,
Jun 30, 2015, 10:08:54 AM6/30/15
to

Marco Antoniotti

unread,
Jun 30, 2015, 10:12:00 AM6/30/15
to
On Tuesday, June 30, 2015 at 10:08:54 AM UTC-4, informatimago wrote:
> Stefan Monnier <mon...@iro.umontreal.ca> writes:
>
> >> Meaning that it takes the same time to access the file data using read
> >> than using mmap.
> >
> > Not necessarily, since the OS's prefetching mechanism may work
> > differently for the two cases.
> > In some cases, mmap may save the OS the trouble of performing an extra
> > copy, OTOH the "simple sequential read" is the most common use case for
> > "read" so the prefetching is likely very finely tuned.
>
> You need to read http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>

Very cool link. Thanks.

MA

Stefan Monnier

unread,
Jun 30, 2015, 10:24:36 AM6/30/15
to
This supports my argument that prefetching can make a big difference.


Stefan

Pascal J. Bourguignon

unread,
Jun 30, 2015, 12:02:13 PM6/30/15
to
Stefan Monnier <mon...@iro.umontreal.ca> writes:

>> You need to read http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>
> This supports my argument that prefetching can make a big difference.

I fail to see how: the I/O time is several orders of magnitude higher
than the time needed to call and syscall read. And by the way, the page
miss trap handler is certainly not faster than a syscall.

Also, the open/read/write/close API doesn't prevent prefetching. If the
kernel does it for open/mmap/close, it can do it for the other too.

Alexandre Landi

unread,
Jul 1, 2015, 5:28:47 AM7/1/15
to
Le vendredi 5 juin 2015 21:37:00 UTC+2, Zach Beane a écrit :

> It's possible that SBCL's internal representation of characters as
> 32-bit values, and the conversion time that entails, is causing some of
> the increased runtime.

Given what I found here: http://www.sbcl.org/manual/#External-Formats and since I know all of my data is encoded as ASCII, I tried wrapping my script into

(with-open-file (stream "my-file.csv" :external-format :ascii)
body)

The difference between the two looks negligible to my eyes:

without external-format option:

Evaluation took:
40.848 seconds of real time
40.880000 seconds of total run time (40.532000 user, 0.348000 system)
[ Run times consist of 4.508 seconds GC time, and 36.372 seconds non-GC time. ]
100.08% CPU
16 lambdas converted
77,431,263,665 processor cycles
7,104,200,096 bytes consed

with external-format option:

; compiling (TIME (WITH-OPEN-FILE # ...))
Evaluation took:
40.128 seconds of real time
40.148000 seconds of total run time (39.848000 user, 0.300000 system)
[ Run times consist of 4.400 seconds GC time, and 35.748 seconds non-GC time. ]
100.05% CPU
76,066,872,855 processor cycles
7,103,713,384 bytes consed

Any insight on whether to conclude that such option does not make much difference or that I am simply missing something?

Pascal J. Bourguignon

unread,
Jul 1, 2015, 7:22:13 AM7/1/15
to
Alexandre Landi <ala...@lrde.epita.fr> writes:

> Any insight on whether to conclude that such option does not make much
> difference or that I am simply missing something?

The character types hierarchy in Common Lisp is as follow;

character
|
+--- extended-char
|
+--- base-char
|
+--- standard-char

But when storing standard-char in arrays (strings), the extended type is
base-char, so base-char is the smallest character type you may use that
could make a performance (time or space) difference.

Given that you only have ascii characters in your file, you could keep
them with base-string of base-char, instead of using string of
character.

Perhaps you should tell your CL to use :element-type 'base-char along
with the :external-format :ascii (but I assume that once you've given
:element-type 'base-char, the ascii code is the default external format,
since base-char is specified to hold at least the printable ascii
characters. Actually, specifying :ascii could be counter productive,
since ASCII is a 7-bit encoding, so some filtering of the high bit could
occur, and as some mapping to 8-bit.

From this:

$ clall -r '(with-open-file (stream "/tmp/test.csv"
:element-type (quote standard-char))
(type-of (read-line stream)))'

Armed Bear Common Lisp Unsupported element type STANDARD-CHAR.
Clozure Common Lisp --> (SIMPLE-BASE-STRING 65)
CLISP --> (SIMPLE-BASE-STRING 65)
CMU Common Lisp --> (SIMPLE-BASE-STRING 65)
ECL --> (SIMPLE-ARRAY CHARACTER (65))
SBCL --> (SIMPLE-ARRAY CHARACTER (65))

you can see that in ECL and SBCL, read-line still allocates a string
containing characters (ie. 32-bit per character) instead of base-char
(which could be 8-bit per character). And ABCL cannot deal with
base-char (since it's based on the jvm that only has unicode
characters).


On the other hand, this:

$ clall -r '(subtypep (quote character) (quote base-char))'

Armed Bear Common Lisp --> T, T
Clozure Common Lisp --> T, T
CLISP --> T, T
CMU Common Lisp --> T, T
ECL --> NIL, T
SBCL --> NIL, T

shows that it wouldn't make a difference, to use BASE-CHAR instead of
CHARACTER, in all implementations but ecl and sbcl.

This sounds to be contradictory to the previous result, unless no
implementation has actually character types of different size.


In anycase, you can try to allocate the STANDARD-CHAR buffer yourself
and read it with READ-SEQUENCE (which incidentally, is the operator
we've told to use since the beginning anyways). This could remove the
overhead of parsing utf-8 data and converting bytes into 32-bit
characters.

$ clall -r '(with-open-file (stream "/tmp/test.csv" :element-type (quote standard-char))
(let ((buffer (make-array 80
:element-type (quote standard-char)
:initial-element #\space
:fill-pointer t
:adjustable t)))
(list (read-sequence buffer stream) (type-of buffer))))'
> > > > > >
Armed Bear Common Lisp Unsupported element type STANDARD-CHAR.
Clozure Common Lisp --> (80 (BASE-STRING 80))
CLISP --> (80 (BASE-STRING 80))
CMU Common Lisp --> (80 (BASE-STRING 80))
ECL --> (80 (ARRAY BASE-CHAR (80)))
SBCL --> (80
(AND (BASE-STRING 80) (NOT SIMPLE-ARRAY)))


Also, IIRC there's the option of compiling cmucl with 8-bit characters
instead of unicode.

Teemu Likonen

unread,
Jul 1, 2015, 8:05:11 AM7/1/15
to
Alexandre Landi [2015-07-01 02:28:39-07] wrote:

> Given what I found here: http://www.sbcl.org/manual/#External-Formats
> and since I know all of my data is encoded as ASCII, I tried wrapping
> my script into

> Any insight on whether to conclude that such option does not make much
> difference or that I am simply missing something?

This is just a guess but probably because you're using the character
type and reading by small buffers (lines). For ASCII file I suggest
trying a binary byte stream, i.e. element-type (unsigned-byte 8), and
reading the content with READ-SEQUENCE into a large byte array (not the
whole file at a time though). Then parse the array as fields and lines,
and repeat READ-SEQUENCE. I gave a quick example before. It's not
complete or best possible solution but you'll get an idea.
signature.asc

paul wallich

unread,
Jul 1, 2015, 1:05:15 PM7/1/15
to
On 7/1/15 5:28 AM, Alexandre Landi wrote:
> Le vendredi 5 juin 2015 21:37:00 UTC+2, Zach Beane a écrit :
>
>> It's possible that SBCL's internal representation of characters as
>> 32-bit values, and the conversion time that entails, is causing some of
>> the increased runtime.
>
> Given what I found here: http://www.sbcl.org/manual/#External-Formats and since I know all of my data is encoded as ASCII, I tried wrapping my script into
>
> (with-open-file (stream "my-file.csv" :external-format :ascii)

[...]
> ; compiling (TIME (WITH-OPEN-FILE # ...))
> Evaluation took:
> 40.128 seconds of real time
> 40.148000 seconds of total run time (39.848000 user, 0.300000 system)
> [ Run times consist of 4.400 seconds GC time, and 35.748 seconds non-GC time. ]
> 100.05% CPU
> 76,066,872,855 processor cycles
> 7,103,713,384 bytes consed

This is almost certainly a stupid question, but I think it needs to be
asked, especially with SBCL:

Exactly what are you compiling in the above? Because if you're not
timing the execution of already-compiled code, the overhead might swamp
whatever changes you're making.

paul

Matthew Carter

unread,
Jul 1, 2015, 2:28:30 PM7/1/15
to
This has got to be one of the largest threads for something in awhile.

Anyways, to me, it seems the real bottleneck on reading a large file
(testing locally with a 103MB file) is in the typically used
with-open-file / read loop code segment.

Using my horribly race condition prone code of reading the file in
10 threaded segments and recombining at the end is giving me a result in
less than 1 seconds consistently (much more speed boost than any
combination of toggling byte/line/chacter reading and completing the
file in 40 seconds).

Reading it un-threaded and my SBCL can't even finish without the 'heap
exhausted, game over' error (just using standard sbcl via slime, no
memory CLI options).

Someone could probably extend this into more useful solution for a
threaded csv reader (since reading very large csv's very quickly is
apparently big business these days :p )

Also due to laziness I hardcoded the line count, for a real solution
you'd want to stop on a set byte in the 'until' loop clause.

(defparameter *lines* '())

;; File size is 107100000 (103M) with 2100000 lines
(defun get-file-contents (segment divisor lines)
(with-open-file (s "~/src/lisp/csv-powah/data.csv")
(file-position s (* segment (/ (file-length s) divisor)))
(let ((lines
(loop for line = (read-line s nil 'eof)
for c from 0
until (eq c (/ lines divisor))
collect line)))
(push lines *lines*))))

(defun threaded-read ()
(setf *lines* '())
(dotimes (x 10)
(make-thread (lambda () (get-file-contents x 10 2100000))))
(sleep 1)
(format t "Read ~a lines~%" (reduce #'+ (mapcar #'length *lines*))))

Dimitri Fontaine

unread,
Jul 1, 2015, 2:44:45 PM7/1/15
to
Matthew Carter <m...@ahungry.com> writes:
> Someone could probably extend this into more useful solution for a
> threaded csv reader (since reading very large csv's very quickly is
> apparently big business these days :p )

See pgloader about that.

http://pgloader.io
https://github.com/dimitri/pgloader

It's CL code and it's able to process CSV files. It's multi-threaded
with a single reader thead and a single write thread, because its job is
to load the CSV content into a PostgreSQL database.

Finally, any “interesting” CSV file will present itself with single
logical lines expanding on more than a single physical line.

So the idea I have to increase how many workers might work together on
loading a single CSV source file in pgloader is to have a single-reader
multiple-writers (and multiple queues) implementation.

As I'm using the lparallel lib to help me handle parallelism, it will be
easy enough to implement. Just didn't get the occasion yet.

http://lparallel.org

> Also due to laziness I hardcoded the line count, for a real solution
> you'd want to stop on a set byte in the 'until' loop clause.

My guess is that a real solution will look nothing as simple as what you
offered here (data processing, error handling, etc).

Regards,
--
dim

Stefan Monnier

unread,
Jul 3, 2015, 10:55:07 AM7/3/15
to
>> This supports my argument that prefetching can make a big difference.
> I fail to see how: the I/O time is several orders of magnitude higher
> than the time needed to call and syscall read.

If your file is layed out perfectly on disk, and the time between
reception of one block and sending the request for the other is short
enough (shorter than the time it takes for the disk's head to fly over
the separation between those blocks, more or less), and there are no
other interfering accesses to the disk, it indeed makes no difference.

In other cases, it can make the difference between streaming and
non-streaming access.

> Also, the open/read/write/close API doesn't prevent prefetching. If the
> kernel does it for open/mmap/close, it can do it for the other too.

My point was that prefetching is *easier* for read/write (where 99% of
the use-cases are sequential, so the risk of prefetching too
aggressively is low) than for mmap (where many use cases will do
non-sequential accesses, so the risk of prefetching too aggressively is
much higher).


Stefan

Madhu

unread,
Jul 4, 2015, 2:52:39 PM7/4/15
to

* Matthew Carter <877fqjk...@ahungry.com> :
Wrote on Wed, 01 Jul 2015 14:28:22 -0400:

| Using my horribly race condition prone code of reading the file in
| 10 threaded segments and recombining at the end is giving me a result in
| less than 1 seconds consistently (much more speed boost than any
| combination of toggling byte/line/chacter reading and completing the
| file in 40 seconds).

You are almost surely measuring the wrong thing because of the race
conditions and are being mislead. Did you verify your program worked?

| Reading it un-threaded and my SBCL can't even finish without the 'heap
| exhausted, game over' error (just using standard sbcl via slime, no
| memory CLI options).
|
| Someone could probably extend this into more useful solution for a
| threaded csv reader (since reading very large csv's very quickly is
| apparently big business these days :p )
|
| Also due to laziness I hardcoded the line count, for a real solution
| you'd want to stop on a set byte in the 'until' loop clause.
|
| (defparameter *lines* '())
|
| ;; File size is 107100000 (103M) with 2100000 lines
| (defun get-file-contents (segment divisor lines)
| (with-open-file (s "~/src/lisp/csv-powah/data.csv")
| (file-position s (* segment (/ (file-length s) divisor)))
| (let ((lines
| (loop for line = (read-line s nil 'eof)
| for c from 0
| until (eq c (/ lines divisor))

You can't use EQ to compare an integer with a rational, if divisor
doesn't divide lines exactly.

| collect line)))
| (push lines *lines*))))
|
| (defun threaded-read ()
| (setf *lines* '())
| (dotimes (x 10)
| (make-thread (lambda () (get-file-contents x 10 2100000))))

This is wrong. All the threads capture the same DOTIMES variable X -
which is modified in THREAD-READ but used in GET-FILE-CONTENTS, the
value which is used is likely to be almost always surely wrong. you end
up seeking to a wrong position in the file. You must pass X as an
argument to the worker lambda.

| (sleep 1)

This is why you are getting 1 second times. Because you do not wait for
the threads to finish. Even if you did, the threads would
asynchronously seek to arbitrary positions and finish randomly.

| (format t "Read ~a lines~%" (reduce #'+ (mapcar #'length *lines*))))


There are other major problems with your approach: The input is line You
assumed you already know the number of lines. Even if you split on
bytes, you read incomplete lines.


Here is a lispworks sketch of your idea which just reads the lines and
counts the bytes. It is pretty useless for using as a CSV reader
because READ-LINE invariably reads incomplete-lines in the subsequent
worker-segments (the first line read by these workers starts from the
middle of some line), and there is overlap between lines.

So All this example does is count the number of bytes read by READ-LINE.

NON-THREADED-READ doesnt use threads and is used as a control,
THREADED-READ uses condition-variables to wait for the threads to
finish. Perhaps there is a bug here because I'm seeing absolutely 0
SPEEDUP between NON-THREADED-READ and THREADED-READ after increasing the
number of threads NPROCS from 1 to 10 (say), and the idea of speeding up
READ-LINE by reading segments of the file in parallel just does not seem
to work (with LW). ---Madhu



(defvar $sum 0)

(defun non-threaded-read (file)
(declare (optimize (speed 3) (space 0) (debug 0)))
(setq $sum 0)
(with-open-file (stream file :external-format :utf-8)
(let* ((line nil)
(bytes-read 0))
(loop (unless (and (setq line (read-line stream nil))
(incf bytes-read (length line)))
(return (SYSTEM:ATOMIC-INCF $sum bytes-read)))))))


;; wc -c "file.dat"
;; (time (progn (non-threaded-read "filedat") $sum))

(defun get-file-contents (file segment divisor)
"Misnamed. Modifies $sum to count number of bytes this worker reads
calling READ-LINE on a segment of FILE..."
(declare (optimize (speed 3) (space 0) (debug 0)))
(with-open-file (stream file :external-format :utf-8)
(let* ((file-size (file-length stream))
(bytes-to-read (truncate file-size divisor))
(line nil)
(bytes-read 0))
(assert (file-position stream (* segment bytes-to-read)))
(loop (unless (and (setq line (read-line stream nil))
(< (incf bytes-read (length line)) bytes-to-read))
(format t "bar[~D]: returning ~D bytes-read of ~D (size=~D position=~D)~&"
segment bytes-read bytes-to-read file-size (file-position stream))
(return (SYSTEM:ATOMIC-INCF $sum bytes-read)))))))

(defun threaded-read (file nprocs)
(setq $sum 0)
(let ((ncond (mp:make-condition-variable))
(lock (mp:make-lock :recursivep nil))
(ncount 0))
(flet ((fn (x)
(prog1 (get-file-contents file x nprocs)
(MP:WITH-LOCK (lock)
(if (= (incf ncount) nprocs)
(MP:CONDITION-VARIABLE-SIGNAL ncond))))))
(MP:WITH-LOCK (lock) ; move this outside
(dotimes (x nprocs)
(MP:PROCESS-RUN-FUNCTION "mcarter read" () #'fn x))
(MP:CONDITION-VARIABLE-WAIT ncond lock)))))

;; (progn (time (threaded-read "file.dat" 10)) $sum)

Pascal J. Bourguignon

unread,
Jul 4, 2015, 7:27:19 PM7/4/15
to
Madhu <eno...@meer.net> writes:

> * Matthew Carter <877fqjk...@ahungry.com> :
> Wrote on Wed, 01 Jul 2015 14:28:22 -0400:
> You can't use EQ to compare an integer with a rational, if divisor
> doesn't divide lines exactly.

No if. Even (EQ 1 1) can be false.


t
> | collect line)))
> | (push lines *lines*))))
> |
> | (defun threaded-read ()
> | (setf *lines* '())
> | (dotimes (x 10)
> | (make-thread (lambda () (get-file-contents x 10 2100000))))
>
> This is wrong. All the threads capture the same DOTIMES variable X -
> which is modified in THREAD-READ but used in GET-FILE-CONTENTS, the
> value which is used is likely to be almost always surely wrong. you end
> up seeking to a wrong position in the file. You must pass X as an
> argument to the worker lambda.

No. The standard explicitely says:

It is implementation-dependent whether dotimes establishes a new
binding of var on each iteration or whether it establishes a binding
for var once at the beginning and then assigns it on any subsequent
iterations.

The code above is not conforming: the function make thread will receive
semantically different closures in different implementations.

Matthew Carter

unread,
Jul 5, 2015, 12:04:59 AM7/5/15
to
Madhu <eno...@meer.net> writes:

> * Matthew Carter <877fqjk...@ahungry.com> :
> Wrote on Wed, 01 Jul 2015 14:28:22 -0400:
>
> | Using my horribly race condition prone code of reading the file in
> | 10 threaded segments and recombining at the end is giving me a result in
> | less than 1 seconds consistently (much more speed boost than any
> | combination of toggling byte/line/chacter reading and completing the
> | file in 40 seconds).
>
> You are almost surely measuring the wrong thing because of the race
> conditions and are being mislead. Did you verify your program worked?

Thanks for the exhaustive breakdown. It was a couple minute hack up.

I did verify the data was read and had a print out of the sum of
contents + manual inspection of the contents via REPL, after the sleep 1
call, which was simply there to confirm all data was done in at least 1
second - it would have been less if I was timing threads I was waiting
for I'd guess but didn't recall wait for thread syntax off the top of my
head (and wanted to make sure all 10 were going asynchronously).

I think there is still some unusual bottleneck with either:

1. Reading the full data set in a loop macro, or
2. Reading the full data set through a single with-open-file

As this (just reading and ignoring file contents) takes 40+ seconds in
SBCL/Common Lisp, while it takes ~3 seconds max in every other language
I read the file contents in (even one as terrible as PHP).

I'd expect it to scale linearly (if reading a tenth of the data in a
non-threaded approach takes .3 seconds, why does reading all of the data
take 30+ seconds and not 3 seconds).

Madhu

unread,
Jul 5, 2015, 12:15:13 AM7/5/15
to

* "Pascal J. Bourguignon" <87k2ufv...@kuiper.lan.informatimago.com> :
Wrote on Sun, 05 Jul 2015 01:27:12 +0200:

| Madhu <eno...@meer.net> writes:
|
|> * Matthew Carter <877fqjk...@ahungry.com> :
|> Wrote on Wed, 01 Jul 2015 14:28:22 -0400:
|> You can't use EQ to compare an integer with a rational, if divisor
|> doesn't divide lines exactly.
|
| No if. Even (EQ 1 1) can be false.
|
|> This is wrong. All the threads capture the same DOTIMES variable X -
|> which is modified in THREAD-READ but used in GET-FILE-CONTENTS, the
|> value which is used is likely to be almost always surely wrong. you end
|> up seeking to a wrong position in the file. You must pass X as an
|> argument to the worker lambda.
|
| No. The standard explicitely says:
|
| It is implementation-dependent whether dotimes establishes a new
| binding of var on each iteration or whether it establishes a binding
| for var once at the beginning and then assigns it on any subsequent
| iterations.
|
| The code above is not conforming: the function make thread will receive
| semantically different closures in different implementations.

Thanks Pascal

[BTW: In the (elided) code I posted I named certain variables as
bytes-read etc., the byte-naming is misleading unless the stream
element type is BASE-CHAR. However the code I posted showed an utf-8
external format. which means FILE-LENGTH counted characters and not
bytes. On LW I noticed a slowdown from 1 to 5 seconds on a 50MB file
between :latin-1 (:default) and :utf-8 external formats. Although the
intended purpose of the code is nonsense, maybe the other Pascal can
check the MP constructs and post any errors in them]---Madhu

Madhu

unread,
Jul 5, 2015, 5:45:22 AM7/5/15
to

* Madhu <m3r3onq...@leonis4.robolove.meer.net> :
Wrote on Sun, 05 Jul 2015 09:44:50 +0530:

| [BTW: In the (elided) code I posted I named certain variables as
| bytes-read etc., the byte-naming is misleading unless the stream
| element type is BASE-CHAR. However the code I posted showed an utf-8
| external format. which means FILE-LENGTH counted characters and not

This is incorrect again I missed it the first time. On LW, FILE-LENGTH
counts bytes, but the READ-LINE on the UTF-8 stream reads multi-byte
characters, adding up the characters wont give the file-length, the
program was doomed to failure.

Alexandre Landi

unread,
Jul 17, 2015, 3:33:23 AM7/17/15
to
Le mercredi 1 juillet 2015 19:05:15 UTC+2, paul wallich a écrit :

> This is almost certainly a stupid question, but I think it needs to be
> asked, especially with SBCL:
>
> Exactly what are you compiling in the above? Because if you're not
> timing the execution of already-compiled code, the overhead might swamp
> whatever changes you're making.

I am not sure I am getting your point. What do you mean?

My understanding is that with-open-file is a macro which expands before the body gets executed.

This was the code including the body (without the adjustments subsequently suggested):

(with-open-file (stream "my-file.csv" :direction :input
:external-format :ascii)
(let* ((header (read-line stream nil))
(columns-list (mapcar #'read-from-string
(cl-ppcre:split ";" header)))
(number-of-columns (length columns-list))
(column-names (make-array number-of-columns
:initial-contents columns-list))
(row-names (make-array 1 :adjustable t :fill-pointer 0))
(values (make-array 1 :adjustable t :fill-pointer 0)))
(set-syntax-from-char #\; #\ )
(loop
:for reader = (read stream nil stream)
:until (eq reader stream)
:for count = 0 :then (1+ count)
:until (eq reader stream)
:do (if (zerop (mod count number-of-columns))
(vector-push-extend reader row-names)
(vector-push-extend reader values))
:finally (return (values column-names
row-names
values))))
nil)

paul wallich

unread,
Jul 19, 2015, 3:02:17 PM7/19/15
to
On 7/17/15 3:33 AM, Alexandre Landi wrote:
> Le mercredi 1 juillet 2015 19:05:15 UTC+2, paul wallich a écrit :
>
>> This is almost certainly a stupid question, but I think it needs to be
>> asked, especially with SBCL:
>>
>> Exactly what are you compiling in the above? Because if you're not
>> timing the execution of already-compiled code, the overhead might swamp
>> whatever changes you're making.
>
> I am not sure I am getting your point. What do you mean?

My question was whether you were timing code that had already been
compiled, or whether your timing included running the compiler on the
body of the macro as part of the total runtime. Which would be a little
like timing a the execution of a C program starting with "make".
>
> My understanding is that with-open-file is a macro which expands before the body gets executed.
>
> This was the code including the body (without the adjustments subsequently suggested):

[...]

I see. And yeah, gosh, that's more than a touch of overhead, especially
for something that's supposed to eventually run on enormous files.

WJ

unread,
Jul 19, 2015, 11:27:32 PM7/19/15
to
Why this is marked as abuse? It has been marked as abuse.
Report not abuse
tar...@google.com wrote:

> First, you could get each line using
> READ-LINE, then use a loop through the
> positions of the delimiter character #\; to
> find the field positions. Now,
> unfortunately, Common Lisp doesn't have a
> PARSE-FLOAT function, so you would have to
> import a library do do that.
> See PARSE-NUMBER from
> http://cliki.net/parse-number or the older
> PARSE-FLOAT from
> http://www.cs.cmu.edu/Groups/AI/util/lang/lisp/code/math/atof/0.html
> (The latter
> doesn't have the nice floating format
> support, though).
>
> So, you could do something along the lines of (untested):
>
> (with-open-file (stream "my-file.csv" :direction :input)
> (loop :for line :in (read-line stream nil nil)
> :while line
> :nconc (loop :for column-number :upfrom 0
> :as start = 0 then end
> :as end = (position #\; line :start (1+ start))
> :collect (if (zerop column-number)
> (subseq line (1+ start) (1- end))
> (parse-number line (1+ start) end))
> :when (null end) :do (loop-finish))))

Gauche Scheme:

(define csv
"1999-01-04;1391.12;3034.53;66.515625;86.2;441.39
1999-01-05;1404.86;3072.41;66.3125;86.17;440.63
1999-01-06;1435.12;3156.59;66.4375;86.32;441.7
1999-01-07;1432.32;3106.08;66.25;86.22;447.67")

(with-input-from-string csv
(cut generator-map
(lambda (line)
(let1 fields (string-split line #\;)
(cons (car fields) (map string->number (cdr fields)))))
read-line))

===>
(("1999-01-04" 1391.12 3034.53 66.515625 86.2 441.39)
("1999-01-05" 1404.86 3072.41 66.3125 86.17 440.63)
("1999-01-06" 1435.12 3156.59 66.4375 86.32 441.7)
("1999-01-07" 1432.32 3106.08 66.25 86.22 447.67))

--
Viewed at its most abstract level, a fundamental agenda is thus to influence
the European-derived peoples of the United States to view concern about their
own demographic and cultural eclipse as irrational and as an indication of
psychopathology. --- Dr. Kevin MacDonald; "The Frankfurt School of Social
Research and the Pathologization of Gentile Group Allegiances"

tar...@google.com

unread,
Jul 20, 2015, 1:13:54 PM7/20/15
to
On Sunday, July 19, 2015 at 8:27:32 PM UTC-7, WJ wrote:
The whole point was to avoid having to create NEW strings in order to increase performance.

Marco Antoniotti

unread,
Jul 21, 2015, 5:36:33 AM7/21/15
to
> The whole point was to avoid having to create NEW strings in order to increase performance.
>

Don't expect the WJ-bot to understand such subtleties :)

Having said that, the lack of a STRING-BUFFER abstraction in CL is an annoyance. It is easy to build one on top of string/array with a fill pointer coupled with READ-SEQUENCE, but having a proper interface for such a thing would have given implementors leeway to provide a specialized (and portable) fast version.

Cheers
--
MA

Raymond Wiker

unread,
Jul 21, 2015, 12:44:20 PM7/21/15
to
What, exactly, do you want from a STRING-BUFFER that is not covered by
STRING-INPUT-STREAM, STRING-OUTPUT-STREAM and (possibly) displaced
arrays?


Marco Antoniotti

unread,
Jul 22, 2015, 6:06:22 AM7/22/15
to
On Tuesday, July 21, 2015 at 6:44:20 PM UTC+2, Raymond Wiker wrote:
The use case is the following.

(with-open-file (f "foo.txt" :direction :input)
(loop with line-buffer = (make-string-buffer 42)
for line = (read-into-line line-buffer f :eof-error nil :eof-value nil) ; Making this up.
while line
do (do-stuff line-buffer)))

READ-SEQUENCE does not quite cut it.

Maybe a CDR would be worthwhile.

Cheers
--
MA

PS Sorry for the LOOP WJ-bait. I should have written a recursive version 3:) 3:)

Pascal J. Bourguignon

unread,
Jul 22, 2015, 6:50:01 AM7/22/15
to
Marco Antoniotti <mar...@gmail.com> writes:

> The use case is the following.
>
> (with-open-file (f "foo.txt" :direction :input)
> (loop with line-buffer = (make-string-buffer 42)
> for line = (read-into-line line-buffer f :eof-error nil :eof-value nil) ; Making this up.
> while line
> do (do-stuff line-buffer)))
>
> READ-SEQUENCE does not quite cut it.

Do you mean READ-LINE-INTO-BUFFER?

This is lacking. but if you use read-sequence to read buffers,
generally you want to virtually split lines in the buffer.
I guess one would need an extension to Gray streams to implement
efficiently READ-LINE-INTO-BUFFER.


Why the :eof-error parameters?


I often use something like:

(setf current-position (read-sequence buffer stream :start current-position))


> Maybe a CDR would be worthwhile.

Maybe, but as mentionned above, perhaps it may be done at the Gray
stream level.

Marco Antoniotti

unread,
Jul 22, 2015, 10:49:47 AM7/22/15
to
On Wednesday, July 22, 2015 at 12:50:01 PM UTC+2, informatimago wrote:
> Marco Antoniotti <marcoxa@> writes:
>
> > The use case is the following.
> >
> > (with-open-file (f "foo.txt" :direction :input)
> > (loop with line-buffer = (make-string-buffer 42)
> > for line = (read-into-line line-buffer f :eof-error nil :eof-value nil) ; Making this up.
> > while line
> > do (do-stuff line-buffer)))
> >
> > READ-SEQUENCE does not quite cut it.
>
> Do you mean READ-LINE-INTO-BUFFER?

Probably a better name.

> This is lacking. but if you use read-sequence to read buffers,
> generally you want to virtually split lines in the buffer.
> I guess one would need an extension to Gray streams to implement
> efficiently READ-LINE-INTO-BUFFER.
>

The issue is that once the spec is missing, no implementor can play around with alternatives (portable ones; my pet peeve).

>
> Why the :eof-error parameters?
>

To make it similar in use to READ-LINE. I think a signature could be

read-line-into-buffer buffer stream &key eof-error-p of-value recursivep ....
==>
buffer
N-read
newline-seen-p

The second value would be like the value returned by READ-SEQUENCE (if no error signaled). In any case, this is just brainstorming.

>
> I often use something like:
>
> (setf current-position (read-sequence buffer stream :start current-position))
>

Yes. But the point of a STRING-BUFFER (or a BUFFER) abstraction would be to allow for more "dynamic behavior".

>
> > Maybe a CDR would be worthwhile.
>
> Maybe, but as mentionned above, perhaps it may be done at the Gray
> stream level.
>

It would still be nice to have a CDR for it.

--
MA

Pascal J. Bourguignon

unread,
Jul 22, 2015, 10:59:04 AM7/22/15
to
Marco Antoniotti <mar...@gmail.com> writes:

>> Do you mean READ-LINE-INTO-BUFFER?
>
> Probably a better name.
>
>> This is lacking. but if you use read-sequence to read buffers,
>> generally you want to virtually split lines in the buffer.
>> I guess one would need an extension to Gray streams to implement
>> efficiently READ-LINE-INTO-BUFFER.
>>
>
> The issue is that once the spec is missing, no implementor can play
> around with alternatives (portable ones; my pet peeve).
>
>>
>> Why the :eof-error parameters?
>>
>
> To make it similar in use to READ-LINE. I think a signature could be
>
> read-line-into-buffer buffer stream &key eof-error-p of-value recursivep ....
> ==>
> buffer
> N-read
> newline-seen-p
>
> The second value would be like the value returned by READ-SEQUENCE (if
> no error signaled). In any case, this is just brainstorming.

We also want the read-sequence parameters:

read-line-into-buffer buffer stream &key eof-error-p of-value
recursivep start end

then newline-seen-p takes it's whole meaning, since one you could be
reading a line longer than the available space in the buffer between
start and end.

It would be specified that the implementation should try to make it more
efficient than merely reading character by character.


>> I often use something like:
>>
>> (setf current-position (read-sequence buffer stream :start current-position))
>>
>
> Yes. But the point of a STRING-BUFFER (or a BUFFER) abstraction would
> be to allow for more "dynamic behavior".
>
>>
>> > Maybe a CDR would be worthwhile.
>>
>> Maybe, but as mentionned above, perhaps it may be done at the Gray
>> stream level.
>>
>
> It would still be nice to have a CDR for it.

Yes.

paul wallich

unread,
Jul 22, 2015, 11:38:14 AM7/22/15
to
How exactly would start and end interact with the buffer length? Do you
signal a continuable error as soon as you know the distance between
start and end is too great? Do you wait until you've actually read too
much?

If you're going to do this, would it make sense to have instead or also
something like read-chunk-into-buffer where the function isn't hardcoded
for #\newline?

paul

Pascal J. Bourguignon

unread,
Jul 22, 2015, 1:05:17 PM7/22/15
to
Yes, you can signal the error early, because start end would be
specified to be /bounding index designators/.

start, end---/bounding index designators/ of sequence. The defaults for
start and end are 0 and nil, respectively.


> If you're going to do this, would it make sense to have instead or
> also something like read-chunk-into-buffer where the function isn't
> hardcoded for #\newline?

Is it necessary?
How many times do we read binary files up until some byte?

So, something like:

read-line-into-buffer buffer stream &key eof-error-p of-value
recursivep start end
(terminator-character #\newline)

read-bytes-into-buffer buffer stream &key eof-error-p of-value
recursivep start end
(terminator-byte 0)

?

Madhu

unread,
Jul 23, 2015, 12:28:58 AM7/23/15
to

* Marco Antoniotti <fbc0719f-5db5-4145...@googlegroups.com> :
Wrote on Wed, 22 Jul 2015 03:06:18 -0700 (PDT):

|> > Having said that, the lack of a STRING-BUFFER abstraction in CL is
|> > an annoyance. It is easy to build one on top of string/array with
|> > a fill pointer coupled with READ-SEQUENCE, but having a proper
|> > interface for such a thing would have given implementors leeway to
|> > provide a specialized (and portable) fast version.
|>
|> What, exactly, do you want from a STRING-BUFFER that is not covered
|> by STRING-INPUT-STREAM, STRING-OUTPUT-STREAM and (possibly) displaced
|> arrays?
|
| The use case is the following.
|
| (with-open-file (f "foo.txt" :direction :input)
| (loop with line-buffer = (make-string-buffer 42)
| for line = (read-into-line line-buffer f :eof-error nil :eof-value nil) ; Making this up.
| while line
| do (do-stuff line-buffer)))
|
| READ-SEQUENCE does not quite cut it.
| Maybe a CDR would be worthwhile.

But this use-case does not illustrate a use-case for a STRINGBUFFER.
The issues concerning READ-LINE (and READ-LINE-INTO-BUFFER) have to do
with new-line recognition, i.e. handling end-of-line convention, and
buffering issues involving READING the input stream. But the API use of
the hypothetical `stringbuffer' (i.e. WRITING to the buffer) is already
covered by the WRITE-CHAR, STRING-OUTPUT-STREAM and
GET-OUTPUT-STREAM-STRING abstractions over VECTOR-PUSH-EXTEND and
FILL-POINTER manipulations. So a spec of READ-LINE-INTO-BUFFER does not
address any `stringbuffer' API issues but it addresses stream copying
issues (in the implementation), --- unless I'm missing something about
`stringbuffer'. --- Madhu

Dimitri Fontaine

unread,
Jul 23, 2015, 5:20:14 AM7/23/15
to
"Pascal J. Bourguignon" <p...@informatimago.com> writes:
>> If you're going to do this, would it make sense to have instead or
>> also something like read-chunk-into-buffer where the function isn't
>> hardcoded for #\newline?
>
> Is it necessary?
> How many times do we read binary files up until some byte?

See cl-csv and this file:

https://github.com/AccelerationNet/cl-csv/blob/master/read-until.lisp

CSV is such a poorly defined format that sometimes the newline character
is found to be a pseudo-random non-printable byte in the ascii range
rather than a newline.
--
dim

Marco Antoniotti

unread,
Jul 23, 2015, 7:44:24 AM7/23/15
to
I think you are right in all your observations. But let's just take a step back and say the following:

READ-LINE allocates a new string object per call (YMMV)

You want to be able to "read" a "line" into a (user level) "buffer" like READ-SEQUENCE does.

READ-SEQUENCE does not cut it because it will "barf" at least if you current line is longer than the "buffer".

Every quotation mark above indicates a term that is amenable to interpretation.

So, I'd like an API which can be limited to "lines" that combined the "good parts" of READ-LINE and READ-SEQUENCE.

I am not so stuck up on the actual signature of the function.

Cheers
--
MA

Pascal J. Bourguignon

unread,
Jul 23, 2015, 7:45:12 AM7/23/15
to
If you read a csv file in binary, you will still have to parse it
character to character, because as you note it, a newline can be found
in the data (between double-quotes or prefixed by an antislash).

Hence you won't read a "line" at once.

Madhu

unread,
Jul 23, 2015, 11:47:47 AM7/23/15
to

* Marco Antoniotti <99648d15-5119-411d...@googlegroups.com> :
Wrote on Thu, 23 Jul 2015 04:44:17 -0700 (PDT):
Maybe READ-LINE's API is enough without the READ-SEQUENCE parts which
are confusing

Here is a sample implementation that follows exactly the API of
READ-LINE, except for the extra first `string-buffer' argument which is
just an adjustable character array with a fill-pointer. (see below)

(defun read-line-into-buffer
(buffer input-stream &optional eof-error-p eof-value recursive-p)
(setf (fill-pointer buffer) 0)
(let ((c (read-char input-stream eof-error-p nil recursive-p)))
(if (null c)
(values eof-value t)
(loop (if (and c (not (eql c #\Newline)))
(vector-push-extend c buffer)
(return (values buffer (null c))))
(setq c (read-char input-stream nil nil recursive-p))))))

(time
(with-open-file (f "file" :direction :input)
(loop with line-buffer = (make-array 0 :adjustable t :fill-pointer t
:element-type (stream-element-type f))
for line = (read-line-into-buffer line-buffer f nil nil)
while line
sum (length line))))

I haven't tested out the eof corner cases, but you can see how this
allocates less than the version with READ-LINE. (I did NOT see this
savings-in-allocation with STRING-OUTPUT-STREAMS in ccl and lw). Plain
old READ-LINE + GC can still be orders of magnitude (!) faster than
reading character by character, despite more space being allocated. As
Pascal noted elsewhere, this function can written efficiently only if it
had access to the stream code. (And if line-buffer should be a
string-output-stream which reuses its underlying storage, the function's
code should have access to that storage) ---Madhu

Pascal J. Bourguignon

unread,
Jul 23, 2015, 1:18:33 PM7/23/15
to
Madhu <eno...@meer.net> writes:

> Maybe READ-LINE's API is enough without the READ-SEQUENCE parts which
> are confusing

They're not confusing to me, I've used them a lot.

It's perfectly natural to use parts of a buffer with READ-SEQUENCE and
other operations, when you are using a buffer.
I'd propose the following reference implementation. An actual
implementation can be more optimized, because because it can have
knowledge of the length of the next line to read from the kernel
buffers, or its own buffers (in buffered streams).

While READ-LINE-INTO-SEQUENCE wants to avoid copying data, (eg. an
implementation could call the kernel with a pointer into the buffer
itself), the unix API doesn't let the userspace process know how many
bytes have been buffered for a line input (definitely not for a file,
and not even from the tty device driver, check man 4 tty_ioctl ; for a
tty in line discipline, the information would be in present in the
driver, but in this only case where we could optimize the thing, there's
no point since 1- it's slow human input, 2- it's a small input: humans
don't usually input megacharacter lines). So in any case, we will have
at least one internal buffer (kernel or userspace), and one copying to
our buffer. The kernels doesn't allow to scan its buffer, so an
implementation would better have its own userspace buffers, instruct the
kernel to fill them (so direct from the device to the userspace buffer),
and then the implementation can scan it for the newline and copy it to
our buffer.


Finally, notice that if you want to have performance with line input
with files, you can always implement a record structure on a a unix file
(be it variable record length or fixed record length), and then there's
no need to scan for a newline anymore.




------------------------------------------------------------------------

(eval-when (:compile-toplevel :load-toplevel :execute)
(pushnew :costly-assert *features*))

(declaim (inline %read-char-until %finish-read-line-into-sequence))

(defun %read-char-until (stream recursivep store)
(loop
:for ch := (read-char stream nil nil recursivep)
:while (and ch (funcall store ch))
:finally (return ch)))

(defun %finish-read-line-into-sequence (ch buffer stream eof-error-p eof-value start)
(if (null ch)
(if eof-error-p
(error 'end-of-file :stream stream)
(values eof-value start nil))
(values buffer start (eql ch #\Newline))))

(defgeneric read-line-into-sequence (sequence input-stream
&key
eof-error-p eof-value recursivep
start end)

(:documentation "

Reads characters from the INPUT-STREAM until a #\\Newline is found, and
store the characters read into the SEQUENCE, from START, up to below
END. If END is reached before the #\\Newline character is read, then
reading stops there and the third result value is NIL. The #\Newline
character is not stored. No other slot of the SEQUENCE is modified
than those between START and POSITION.

RETURN: VALUE, POSITION, NEWLINE-SEEN-P

VALUE: Either SEQUENCE or EOF-VALUE depending on whether an
end-of-file has been seen.

SEQUENCE: A sequence (OR LIST VECTOR). If specialized, the vector
must have an element-type that is a supertype of the
stream element-type. If a fill-pointer is present, it
is ignored.

POSITION: The index in the SEQUENCE of the first element not
written. (<= START POSITION (OR END (LENGTH BUFFER)))


NEWLINE-SEEN-P: Whether a #\\Newline has been read.

INPUT-STREAM: an input stream. The element-type of the INPUT-STREAM
must be a subtype of CHARACTER.

EOF-ERROR-P: a generalized boolean. The default is true. If true,
then an END-OF-FILE error is signaled upon end of file.

EOF-VALUE: an object. The default is NIL.

RECURSIVE-P: a generalized boolean. The default is NIL. If
RECURSIVE-P is true, this call is expected to be
embedded in a higher-level call to read or a similar
function used by the Lisp reader.

START, END: bounding index designators of SEQUENCE.
The defaults for START and END are 0 and NIL, respectively.

")

(:method ((buffer vector) (stream stream) &key
(eof-error-p t)
(eof-value nil)
(recursivep nil)
(start 0)
(end nil))
(let ((end (or end (length buffer))))
(check-type start (and fixnum (integer 0)))
(check-type end (or null (and fixnum (integer 0))))
(cond
((and (= end start) (<= start (length buffer)))
(values buffer start nil))
((or (< end start) (< (length buffer) start))
(error "Bad interval for sequence operation on ~S: start = ~A, end = ~A"
buffer start end))
(t
(%finish-read-line-into-sequence
(%read-char-until stream recursivep
(lambda (ch)
(if (char= #\Newline ch)
nil
(progn
(setf (aref buffer start) ch)
(incf start)
(< start end)))))
buffer stream eof-error-p eof-value start)))))

(:method ((buffer list) (stream stream) &key
(eof-error-p t)
(eof-value nil)
(recursivep nil)
(start 0)
(end nil))
(check-type start (and fixnum (integer 0)))
(check-type end (or null (and fixnum (integer 0))))
(let ((current buffer))
(loop
:repeat start
:do (if (null current)
(error "Bad interval for sequence operation on ~S: start = ~A, end = ~A"
buffer start end)
(pop current)))
#+costly-assert (assert (<= start (length buffer)))
(cond
((if end
(= start end)
(null current))
(values buffer start nil))
((or (null current) (and end (< end start)))
(error "Bad interval for sequence operation on ~S: start = ~A, end = ~A"
buffer start end))
(t
#+costly-assert (assert (and (or (null end) (<= start end))
(< start (length buffer))))
(%finish-read-line-into-sequence
(%read-char-until stream recursivep
(lambda (ch)
(if (char= #\Newline ch)
nil
(progn
(setf (car current) ch
current (cdr current))
(incf start)
(if end
(< start end)
current)))))
buffer stream eof-error-p eof-value start))))))



(defun test/read-line-into-sequence ()
(let ((buffer (make-array 40 :element-type 'character :initial-element #\space)))
(with-input-from-string (input "Hello
World
Howdy")
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 1 :end 1))
'(" "
1 nil)))
(assert (handler-case
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 2 :end 1)
(:no-error (&rest results)
(declare (ignore results))
nil)
(error (err)
(declare (ignore err))
t)))
(assert (handler-case
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 41)
(:no-error (&rest results)
(declare (ignore results))
nil)
(error (err)
(declare (ignore err))
t)))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 0))
'("Hello "
5 t)))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 10 :end 13))
'("Hello Wor "
13 nil)))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 20))
'("Hello Wor ld "
22 t)))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 37))
'("Hello Wor ld How"
40 nil)))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 30))
'(:eof 32 nil)))))

(let ((buffer (make-list 40 :initial-element 0)))
(with-input-from-string (input "Hello
World
Howdy")
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 1 :end 1))
'((0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0) 1 nil)))
(assert (handler-case
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 2 :end 1)
(:no-error (&rest results)
(declare (ignore results))
nil)
(error (err)
(declare (ignore err))
t)))
(assert (handler-case
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 41)
(:no-error (&rest results)
(declare (ignore results))
nil)
(error (err)
(declare (ignore err))
t)))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 0))
'((#\H #\e #\l #\l #\o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) 5 t)))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 10 :end 13))
'((#\H #\e #\l #\l #\o 0 0 0 0 0 #\W #\o #\r 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) 13 nil)))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 20))
'((#\H #\e #\l #\l #\o 0 0 0 0 0 #\W #\o #\r 0 0 0 0 0 0 0
#\l #\d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) 22 t) ))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 37))
'((#\H #\e #\l #\l #\o 0 0 0 0 0 #\W #\o #\r 0 0 0 0 0 0 0
#\l #\d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 #\H #\o #\w) 40
nil)))
(assert (equal
(multiple-value-list
(read-line-into-sequence buffer input
:eof-error-p nil :eof-value :eof :recursivep nil
:start 30))
'(:eof 32 nil)))))
:success)

(test/read-line-into-sequence)
------------------------------------------------------------------------

Madhu

unread,
Jul 23, 2015, 9:31:11 PM7/23/15
to

* "Pascal J. Bourguignon" <87y4i6z...@kuiper.lan.informatimago.com> :
Wrote on Thu, 23 Jul 2015 19:18:26 +0200:

| (defgeneric read-line-into-sequence (sequence input-stream
| &key
| eof-error-p eof-value recursivep
| start end)
|
| Reads characters from the INPUT-STREAM until a #\\Newline is found, and
| store the characters read into the SEQUENCE, from START, up to below
| END. If END is reached before the #\\Newline character is read, then
| reading stops there and the third result value is NIL. The #\Newline
| character is not stored. No other slot of the SEQUENCE is modified
| than those between START and POSITION.

I think this does something slightly different and is 1 level
lower-level, as it does not allocate storage. I understood the
stringbuffer "dynamic" requirement was that the I/O function would
allocate more memory if required, eg. READ-LINE-INTO-BUFFER upthread
adjusted the given buffer to fit the line being read, without allocating
a new buffer for each line.

i.e. marco wanted the equivalent functionality of getline(3) -

ssize_t getline(char **lineptr, size_t *n, FILE *stream);

If *lineptr is NULL, then getline() will allocate a buffer for
storing the line, which should be freed by the user program. (In
this case, the value in *n is ignored.)

Alternatively, before calling getline(), *lineptr can contain a
pointer to a malloc(3)-allocated buffer *n bytes in size. If the
buffer is not large enough to hold the line, getline() resizes it
with realloc(3), updating *lineptr and *n as necessary.

--- Madhu

Vityok

unread,
Jul 24, 2015, 4:27:53 AM7/24/15
to
середа, 1 липня 2015 р. 20:05:15 UTC+3 користувач paul wallich написав:
> On 7/1/15 5:28 AM, Alexandre Landi wrote:
> > Le vendredi 5 juin 2015 21:37:00 UTC+2, Zach Beane a écrit :
> >
> >> It's possible that SBCL's internal representation of characters as
> >> 32-bit values, and the conversion time that entails, is causing some of
> >> the increased runtime.
> >
> > Given what I found here: http://www.sbcl.org/manual/#External-Formats and since I know all of my data is encoded as ASCII, I tried wrapping my script into
> >
> > (with-open-file (stream "my-file.csv" :external-format :ascii)
>
> [...]
> > ; compiling (TIME (WITH-OPEN-FILE # ...))
> > Evaluation took:
> > 40.128 seconds of real time
> > 40.148000 seconds of total run time (39.848000 user, 0.300000 system)
> > [ Run times consist of 4.400 seconds GC time, and 35.748 seconds non-GC time. ]
> > 100.05% CPU
> > 76,066,872,855 processor cycles
> > 7,103,713,384 bytes consed
>
> This is almost certainly a stupid question, but I think it needs to be
> asked, especially with SBCL:
>
> Exactly what are you compiling in the above? Because if you're not
> timing the execution of already-compiled code, the overhead might swamp
> whatever changes you're making.
>
> paul

Hi all,

by the way, I once tried to benchmark similar operations and even
collected and published some results:

https://bitbucket.org/vityok/cl-faster-input/wiki/Benchmark%20results

Benchmarked functions either read the file line-by-line or in
chunks of a pre-defined byte-array buffer:

https://bitbucket.org/vityok/cl-faster-input/src/db98cd726be7f2c2bf1cda70aad63ffd7ae8c9ec/benchmark-read-line.lisp?at=default#benchmark-read-line.lisp-93

My aim was to speed up parsing logs, where over 90% of the lines
are of no interest (don't match with pre-defined problem
patterns) so it might be not so useful for parsing CSV, where all
data is then processed.

I did these tests long ago, and they are very far from perfect,
but hopefully somebody will find them useful.

Thanks,
Victor

Marco Antoniotti

unread,
Jul 25, 2015, 5:42:07 PM7/25/15
to
Yep. Pretty much what I had implemented a long time ago. It is less "heavy" than the combination with READ-SEQUENCE and it works well for the specific idiomatic case.

Alas, we have neither "user level" string buffers (the adjustable character or (unsigned-byte x) arrays), nor the READ-LINE-INTO-BUFFER.

Cheers
--
MA

Madhu

unread,
Jul 27, 2015, 1:41:38 PM7/27/15
to

* Marco Antoniotti <bff9d233-7c57-484b...@googlegroups.com> :
Wrote on Sat, 25 Jul 2015 14:41:59 -0700 (PDT):

| Yep. Pretty much what I had implemented a long time ago. It is less
| "heavy" than the combination with READ-SEQUENCE and it works well for
| the specific idiomatic case.
|
| Alas, we have neither "user level" string buffers (the adjustable
| character or (unsigned-byte x) arrays), nor the READ-LINE-INTO-BUFFER.

Heh this is the full circle when going around in circles. I can't
believe this is an honest statement, after going had the discussion
above and after having understood the issues with the array and stream
API, and after having written READ-LINE-INTO-BUFFER with CL's adjustable
arrays.

Marco Antoniotti

unread,
Jul 28, 2015, 5:12:26 AM7/28/15
to
Madhu. I had an implementation similar to yours years ago. My point is that we do not have a reliable spec "out there" for this (it is not in the ANSI).
Because of this, this thread is bound to reappear as people do not have the "obvious" way to read lines of variable length from a character stream without consing up strings.

Cheers
--
MA

Pascal J. Bourguignon

unread,
Jul 28, 2015, 10:19:11 AM7/28/15
to
Marco Antoniotti <mar...@gmail.com> writes:

> Alas, we have neither "user level" string buffers (the adjustable
> character or (unsigned-byte x) arrays), nor the READ-LINE-INTO-BUFFER.

What are you talking about?

(adjust-array "Hello world" 42 :initial-element #\_)
--> "Hello world_______________________________"

(adjust-array (coerce #(1 2 3) '(vector (unsigned-byte 8))) 10 :initial-element 0)
--> #(1 2 3 0 0 0 0 0 0 0)

Pascal J. Bourguignon

unread,
Jul 28, 2015, 10:21:02 AM7/28/15
to
But we can hardly write a CDR to have some "reliable" spec be integrated
in implementations, if we can't agree on a function and its API…

I proposed my documented "reference implementation".
Apparently you and Madhu have something else in mind.

Marco Antoniotti

unread,
Jul 29, 2015, 4:33:56 AM7/29/15
to
On Tuesday, July 28, 2015 at 4:19:11 PM UTC+2, informatimago wrote:
> Marco Antoniotti <mar...@gmail.com> writes:
>
> > Alas, we have neither "user level" string buffers (the adjustable
> > character or (unsigned-byte x) arrays), nor the READ-LINE-INTO-BUFFER.
>
> What are you talking about?
>
> (adjust-array "Hello world" 42 :initial-element #\_)
> --> "Hello world_______________________________"
>
> (adjust-array (coerce #(1 2 3) '(vector (unsigned-byte 8))) 10 :initial-element 0)
> --> #(1 2 3 0 0 0 0 0 0 0)
>

Of course. But this should be "hidden" by the API. Which is what Madhu (and I) did.

--
MA

Marco Antoniotti

unread,
Jul 29, 2015, 5:04:13 AM7/29/15
to
On Tuesday, July 28, 2015 at 4:21:02 PM UTC+2, informatimago wrote:
> Marco Antoniotti <mar...@gmail.com> writes:
>
> > On Monday, July 27, 2015 at 7:41:38 PM UTC+2, Madhu wrote:
> >> * Marco Antoniotti <bff9d233-7c57-484b...@googlegroups.com> :
> >> Wrote on Sat, 25 Jul 2015 14:41:59 -0700 (PDT):
> >>
> >> | Yep. Pretty much what I had implemented a long time ago. It is less
> >> | "heavy" than the combination with READ-SEQUENCE and it works well for
> >> | the specific idiomatic case.
> >> |
> >> | Alas, we have neither "user level" string buffers (the adjustable
> >> | character or (unsigned-byte x) arrays), nor the READ-LINE-INTO-BUFFER.
> >>
> >> Heh this is the full circle when going around in circles. I can't
> >> believe this is an honest statement, after going had the discussion
> >> above and after having understood the issues with the array and stream
> >> API, and after having written READ-LINE-INTO-BUFFER with CL's adjustable
> >> arrays.
> >
> > Madhu. I had an implementation similar to yours years ago. My point
> > is that we do not have a reliable spec "out there" for this (it is not
> > in the ANSI).
> > Because of this, this thread is bound to reappear as people do not
> > have the "obvious" way to read lines of variable length from a
> > character stream without consing up strings.
>
> But we can hardly write a CDR to have some "reliable" spec be integrated
> in implementations, if we can't agree on a function and its API...
>
> I proposed my documented "reference implementation".
> Apparently you and Madhu have something else in mind.
>

I think your "reference implementation" is good (as usual). But I'd like a separate "class/type" of adjustable strings. This is not something that can be done easily at the user level. Of course you can have a function like:

(defun make-string-buffer (size &key (initial-element #\Null) (element-type 'character))
(assert (subtypep element-type 'character))
(assert (subtypep (type-of initial-element) 'character))
(assert (and (fixnump size) (>= size 0)))
(make-array size
:initial-element initial-element
:element-type element-type
:adjustable t
:fill-pointer 0))

But the result is not a STRING (or at least, it is a "confusing" one). Cfr. (on LWM)

CL-USER 2 > (make-string-buffer 42)
""

CL-USER 3 > (vector-push-extend #\a * 10)
0

CL-USER 4 > **
"a"

CL-USER 5 > (stringp ***)
T

CL-USER 6 > (type-of **)
(ARRAY CHARACTER (42))

CL-USER 7 > (type-of "foo")
SIMPLE-BASE-STRING

CL-USER 8 > (subtypep * '(ARRAY CHARACTER))
NIL
T

CL-USER 9 > (subtypep '(ARRAY CHARACTER) **)
NIL
T

CL-USER 10 > (subtypep '(ARRAY CHARACTER) 'string) ; ... and viceversa.
NIL
T

This is because the relationships between arrays and their "features" or "traits" (or whatever it is the fashionable term) like "fill pointer" and the type system is fuzzy, not to speak of the relations between BASE-CHAR and CHARACTER.

BTW. Note that in order to make a good CDR out of your ref implementation you will have to - e.g. - carefully write down all the "Exceptional situations:".

Cheers
--
MA

Pascal J. Bourguignon

unread,
Jul 29, 2015, 11:44:28 AM7/29/15
to
Perhaps there's some confusion here.

java.lang.StringBuffer is merely a mutable java.lang.String.

Perhaps some of us were thinking about java.lang.StringBuilder.

In the builder pattern there's an assumption of optimization and
possibly lazyness: the message sent to a builder object are recorded,
and not necessarily effected immediately, but instead the object is
built lazily when requested This allows for the optimization of
operations, notably by applying algebraic transformations and
simplifications (therefore dropping useless operations).

In all cases, in Java those are classes, with defined sets of methods,
and full data hiding: we don't know the implementation, and we can even
obtain instances of subclasses with different implementations without
our knowledge. (Something similar occurs also in Objective-C (Cocoa)
with class clusters).


Following this model, you might want to define a string CLOS class, and
a mutable-string CLOS class? I don't feel that it would be
advantageous, over having strings being vectors and sequences.


However, the notion of string builder can be useful.

It's the reason for a function such as:
com.informatimago.common-lisp.cesarum.string:concatenate-strings
which allocates the result string once, instead of adjusting a mutable
string over and over while it's increasing in size…

And even if it's done in a way to amortize the cost under API such as
with-output-to-string, it remains still costly because it will need
eventually to allocate much more memory than needed (it cannot know the
final length).

On the other hand, in the string builder pattern, and in Lisp given the
impossibility to distinguish immutable strings from mutable strings, the
builder (or with-output-to-string) would have to collect in a list
copies of all the input strings! Again, this duplicates memory usage
and is not as efficient as something like concatenate-strings above.

But still, a string builder with the assumption (requirement) of
immutability of the given data until the lazy construction of the
resulting list would be the most efficient way to do it. It would be
more complex a datastructure and algorithms than mere vectors of
characters with concatenate, but in complex and big cases it may be
worth it.


Finally, notice that adjust-array returns a new array when the original
is not adjustable. So you can use it (hidden in some abstraction if you
want) transparently on all strings, as long as you use the result
instead of expecting mutation.

Vityok

unread,
Jul 31, 2015, 8:54:45 AM7/31/15
to
пʼятниця, 24 липня 2015 р. 11:27:53 UTC+3 користувач Vityok написав:
I think that this discussion is lacking measurements and numbers.

So, I've tried to benchmark different approaches, including the one
suggesting in this thread. So far, standard read-line function performs
best of all alternatives.

Please take a look at the benchmark code:

https://bitbucket.org/vityok/cl-faster-input/src/d436fc5a40a8e0caacb6da405daeef8dd0690350/benchmark-read-line.lisp?at=default

Feedback and suggestion for improvements are very welcome.

Thanks,
Victor

Marco Antoniotti

unread,
Aug 1, 2015, 6:54:25 AM8/1/15
to
It is not clear what you are measuring from the code, in any case, I am not surprised that READ-LINE performed best. One point of the discussion is that in absence of an API for "adjustable strings or buffers", implementors don't have anything to go on to provide optimized code.

Cheers
--
MA
0 new messages