Demo on parsing text file, please

Jinsong Zhao

unread,

Apr 7, 2014, 2:25:39 AM4/7/14

to

Hi there,

I hope to find a demo of common lisp on parsing a text file. Or, if
possible, would your please to show me how to parse the following text:

;;;; start here ;;;;

NET ATOMIC CHARGES AND DIPOLE CONTRIBUTIONS

ATOM NO. TYPE CHARGE ATOM ELECTRON DENSITY
1 C 0.0699 3.9301
2 O -0.3088 6.3088
3 H 0.0085 0.9915
4 H 0.0410 0.9590
5 H 0.0085 0.9915
6 H 0.1809 0.8191
DIPOLE X Y Z TOTAL
POINT-CHG. -0.697 0.295 0.542 0.932
HYBRID -0.149 0.289 0.532 0.624
SUM -0.847 0.585 1.074 1.488

;;;; end here ;;;;

I hope to assign "TYPE" and "CHARGE" into two variable, e.g., type and
change.

I try to read the text file using loop. I can find the proper position.
However I don't know how to continue to parse the following lines.

(with-open-file (stream "file.txt")
(do ((line (read-line stream nil)
(read-line stream nil)))
((null line))
(if (search "ATOM ELECTRON DENSITY" line) (;; lost here...
))))

Any suggestions will be appreciated.

Regards,
Jinsong

Pascal J. Bourguignon

unread,

Apr 7, 2014, 6:00:06 AM4/7/14

to

Jinsong Zhao <jsz...@yeah.net> writes:

> Hi there,
>
> I hope to find a demo of common lisp on parsing a text file. Or, if
> possible, would your please to show me how to parse the following
> text:
>
> ;;;; start here ;;;;
>
> NET ATOMIC CHARGES AND DIPOLE CONTRIBUTIONS
>
> ATOM NO. TYPE CHARGE ATOM ELECTRON DENSITY
> 1 C 0.0699 3.9301
> 2 O -0.3088 6.3088
> 3 H 0.0085 0.9915
> 4 H 0.0410 0.9590
> 5 H 0.0085 0.9915
> 6 H 0.1809 0.8191
> DIPOLE X Y Z TOTAL
> POINT-CHG. -0.697 0.295 0.542 0.932
> HYBRID -0.149 0.289 0.532 0.624
> SUM -0.847 0.585 1.074 1.488
>
> ;;;; end here ;;;;
>
> I hope to assign "TYPE" and "CHARGE" into two variable, e.g., type and
> change.

It looks like you have fixed size records, so you could read

https://groups.google.com/forum/#!original/comp.lang.lisp/S2aqG-UPhe8/MTIYx8VNArgJ

adding some code to detect what kind of record you're going to read.

On the other hand, if the data is formated in a subset of the lisp
syntax, you could also just read it using the lisp reader (easily enough
if there's no optional data).

> I try to read the text file using loop. I can find the proper
> position. However I don't know how to continue to parse the following
> lines.
>
> (with-open-file (stream "file.txt")
> (do ((line (read-line stream nil)
> (read-line stream nil)))
> ((null line))
> (if (search "ATOM ELECTRON DENSITY" line) (;; lost here...
> ))))
>
> Any suggestions will be appreciated.

https://groups.google.com/forum/#!original/comp.lang.lisp/gw1t5lTvu1k/CXQ-UCxZtDwJ

What you need to do is to analyse the structure of your file and give a
description of it. For example, you could come up with this grammar:

file ::= header atom-table dipole-table trailer .
header ::= { empty-line } 'NET' 'ATOMIC' 'CHARGES' 'AND' 'DIPOLE' 'CONTRIBUTIONS' { empty-line } .
atom-table ::= atom-table-header { atom-table-line } .
atom-table-header ::= 'ATOM' 'NO.' 'TYPE' 'CHARGE' 'ATOM' 'ELECTRON' 'DENSITY' .
atom-table-line ::= atomic-number type charge atom-electron-density .
atomic-number ::= integer .
type ::= symbol .
charge ::= floating-point-number .
atom-electron-density ::= floating-point-number .
dipole-table ::= dipole-table-header { dipole-table-line } .
dipole-table-header ::= 'DIPOLE' 'X' 'Y' 'Z' 'TOTAL'.
dipole-table-line ::= dipole-title x y z total .
dipole-title ::= symbol .
x ::= floating-point-number .
y ::= floating-point-number .
z ::= floating-point-number .
total ::= floating-point-number .

Then you can write a parser for it.

But the important thing here is that you have a data structure that is
composed of a sequence of two different repeatitions.

/
| file-header
|
| atom-header
|
| /
| |
| | atom-no
| |
| atom < type
| |
| | charge
| |
| | electron-density
file < \
|
| dipole-header
|
| /
| |
| | title
| |
| | x
| dipole <
| | y
| |
| | z
| |
| | total
\ \

That means that the program to process this file will consist in a
sequence of two loops:

(progn
(read-file-header)
(read-atom-header)
(loop
:named read-atom-lines
:do …)
(read-dipole-header)
(loop
:named read-dipole-lines
:do …))

So you're starting on the wrong foot, by writing a single outer loop:
the file structure is NOT a repeation of things, it's a sequence of
different things!

Actually, we can skip the read-dipole-header phase, since it has only
one line which will be read by the read-atom-lines loop (if we had to
process it, we could have this loop return it or save it for further
processing).

(defun read-atomic-charges-and-dipole-contributions-file (path)
(let ((atomic-charges '())
(dipole-contributions '()))
(with-open-file (stream path)
(read-file-header stream)
(read-atom-header stream)
(loop
:named read-atom-lines
:for line = (read-line stream nil nil)
:while (atomic-charge-line-p line)
:do (push (parse-atomic-charge-line line) atomic-charges))
(loop
:named read-dipole-lines
:for line = (read-line stream nil nil)
:while (dipole-contribution-p line)
:do (push (parse-dipole-contribution-line line) dipole-contributions)))
(list atomic-charges dipole-contributions)))

;; Here we just read the number of lines, without any check. You could
;; also parse them, cf. the BNF above.

(defun read-file-header (stream)

(read-line stream nil)
(read-line stream nil)

(read-line stream nil))

(defun read-atom-header (stream)
(read-line stream nil))

;; Similarly, the detection of type of lines is primitive, but if we
;; assume the input file is always correct, sufficient.

(defun atomic-charge-line-p (line)
(with-input-from-string (stream line)
(integerp (ignore-errors (read stream)))))

;; For the parsing functions, we could build structures instead of
;; returning lists.

(defun parse-atomic-charge-line (line)
(with-input-from-string (stream line)
(list (read stream) (read stream) (read stream) (read stream))))

(defun dipole-contribution-p (line)
(and line (< 1 (length (string-trim " " line)))))

;; Be careful that the syntax of floating point numbers is not
;; universal. Lisp has its own syntax, and it is different from the syntax
;; issued by Fortran or C programs! So READ may not be adapted: you may
;; have to write your own scanner for those data items.

(defun parse-dipole-contribution-line (line)
(with-input-from-string (stream line)
(list (read stream) (read stream) (read stream) (read stream) (read stream))))

(read-atomic-charges-and-dipole-contributions-file "/tmp/file.txt")
--> (((6 h 0.1809 0.8191)
(5 h 0.0085 0.9915)
(4 h 0.041 0.959)
(3 h 0.0085 0.9915)
(2 o -0.3088 6.3088)
(1 c 0.0699 3.9301))
((sum -0.847 0.585 1.074 1.488)
(hybrid -0.149 0.289 0.532 0.624)
(point-chg. -0.697 0.295 0.542 0.932)))

--
__Pascal Bourguignon__
http://www.informatimago.com/
"Le mercure monte ? C'est le moment d'acheter !"

Max Rottenkolber

unread,

Apr 7, 2014, 10:39:44 AM4/7/14

to

I like to use MPC ( http://mr.gy/maintenance/mpc ) to parse all kinds of
stuff. It's not very efficient but very powerful. A possible solution
could look like this (this will parse two lists, atoms and dipoles, and
ignore lines that don't match the patterns):

(defpackage parse-test
(:use :cl :mpc :mpc.numerals :mpc.characters))

(in-package :parse-test)

(defun =ignore-unless (unless)
(=unless unless (=line)))

;; This is just for parsing floats, there are other/better ways to do it.
(defun =float ()
(=let* ((s (=or (=character #\-)
(=character #\+)
(=result #\+)))
(i (=natural-number))
(f (=maybe (=and (=character #\.)
(=string-of (=digit))))))
(=result
(* (if f
(float (+ i (/ (parse-integer f) (expt 10 (length f)))))
i)
(case s (#\+ 1) (#\- -1))))))

(defun =atom-line ()
(=list (=skip-whitespace (=integer-number))
(=skip-whitespace (=item))
(=skip-whitespace (=float))
(=skip-whitespace (=float))))

(defun =dipole-line ()
(=list (=skip-whitespace (=string-of (=not (=whitespace))))
(=skip-whitespace (=float))
(=skip-whitespace (=float))
(=skip-whitespace (=float))
(=skip-whitespace (=float))))

(defun =file ()
(=list (=and (=zero-or-more (=ignore-unless (=atom-line)))
(=zero-or-more (=atom-line)))
(=and (=zero-or-more (=ignore-unless (=dipole-line)))
(=zero-or-more (=dipole-line)))))

;; Assuming #p"/tmp/test" is your example file.
(with-open-file (in #p"/tmp/test")
(run (=file) in))

Jinsong Zhao

unread,

Apr 8, 2014, 2:19:41 AM4/8/14

to

Thank you very much for the solution and package. I am newer to Lisp.
Now I just want to practice with built-in feature of common lisp itself
so that I can learn this language quickly. In fact, however, my progress
is very slow, :-(

Any way, thanks a lot.

Regards,
Jinsong

Jinsong Zhao

unread,

Apr 8, 2014, 2:30:54 AM4/8/14

to

It's my question, and you gave the detailed answer. Thanks a lot. I
didn't find a clue between it and this question when I posted this
question. Now, I have more understanding on that.

You always give me detailed answer and kind help. Now, I know how to
deal with the text file. Although it's slow, I think I can do what I
want to do. I appreciate you for all kinds of help.

I am going to parse a text file to obtain some information that I used
to do by Fortran, which is the only language I can use in practice. So I
thought I was affected by it when I try to do the same thing in Lisp.
Your code give me a new viewpoint on such thing. Thanks!

Best wishes,
Jinsong

Max Rottenkolber

unread,

Apr 12, 2014, 8:35:18 PM4/12/14

to

> Now I just want to practice with built-in feature of common lisp itself
> so that I can learn this language quickly.

Fair enough. But be told that parsing anything above regular grammars is
a pretty hard problem. Parsing *efficiently* is even hard for regular
grammars.

Generally, you wouldn't want to solve non-trivial parsing problems bare-
handed.

Rob Warnock

unread,

Apr 12, 2014, 11:59:51 PM4/12/14

to

Max Rottenkolber <m...@mr.gy> wrote:
+---------------

+---------------

True, but... For simpler problems, Common Lisp *does* have a number of
useful (and in some cases, underappreciated) built-in functions, e.g.:

POSITION
SEARCH
MISMATCH
PARSE-INTEGER
SUBSEQ
REPLACE
CONCATENATE

To get the most from these, you will need to read & understand
the sections in the CLHS about "bounding index designators"
and the :START and :END [and sometimes :START2 and :END2]
keyword arguments which nearly all sequence functions take.
Also learn about the :KEY and :TEST keyword arguments which,
again, nearly all sequence functions take. [Oh, and :FROM-END, too.]

In particular, MISMATCH is one of more underappreciated
string-bashing functions in CL, since it actually tells you
how much *was* matched. ;-} Very useful [especially with the
:START2/:END2 options] to tell whether a (possibly-abbreviated)
fixed substring exists at some specific location in a string,
*without* having to do a SUBSEQ first to extract the portion
to be tested. [Avoids unnecessary consing.]

-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <http://rpw3.org/>
San Mateo, CA 94403