Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Demo on parsing text file, please

115 views
Skip to first unread message

Jinsong Zhao

unread,
Apr 7, 2014, 2:25:39 AM4/7/14
to
Hi there,

I hope to find a demo of common lisp on parsing a text file. Or, if
possible, would your please to show me how to parse the following text:

;;;; start here ;;;;

NET ATOMIC CHARGES AND DIPOLE CONTRIBUTIONS

ATOM NO. TYPE CHARGE ATOM ELECTRON DENSITY
1 C 0.0699 3.9301
2 O -0.3088 6.3088
3 H 0.0085 0.9915
4 H 0.0410 0.9590
5 H 0.0085 0.9915
6 H 0.1809 0.8191
DIPOLE X Y Z TOTAL
POINT-CHG. -0.697 0.295 0.542 0.932
HYBRID -0.149 0.289 0.532 0.624
SUM -0.847 0.585 1.074 1.488

;;;; end here ;;;;

I hope to assign "TYPE" and "CHARGE" into two variable, e.g., type and
change.

I try to read the text file using loop. I can find the proper position.
However I don't know how to continue to parse the following lines.

(with-open-file (stream "file.txt")
(do ((line (read-line stream nil)
(read-line stream nil)))
((null line))
(if (search "ATOM ELECTRON DENSITY" line) (;; lost here...
))))

Any suggestions will be appreciated.

Regards,
Jinsong

Pascal J. Bourguignon

unread,
Apr 7, 2014, 6:00:06 AM4/7/14
to
Jinsong Zhao <jsz...@yeah.net> writes:

> Hi there,
>
> I hope to find a demo of common lisp on parsing a text file. Or, if
> possible, would your please to show me how to parse the following
> text:
>
> ;;;; start here ;;;;
>
> NET ATOMIC CHARGES AND DIPOLE CONTRIBUTIONS
>
> ATOM NO. TYPE CHARGE ATOM ELECTRON DENSITY
> 1 C 0.0699 3.9301
> 2 O -0.3088 6.3088
> 3 H 0.0085 0.9915
> 4 H 0.0410 0.9590
> 5 H 0.0085 0.9915
> 6 H 0.1809 0.8191
> DIPOLE X Y Z TOTAL
> POINT-CHG. -0.697 0.295 0.542 0.932
> HYBRID -0.149 0.289 0.532 0.624
> SUM -0.847 0.585 1.074 1.488
>
> ;;;; end here ;;;;
>
> I hope to assign "TYPE" and "CHARGE" into two variable, e.g., type and
> change.

It looks like you have fixed size records, so you could read

https://groups.google.com/forum/#!original/comp.lang.lisp/S2aqG-UPhe8/MTIYx8VNArgJ

adding some code to detect what kind of record you're going to read.


On the other hand, if the data is formated in a subset of the lisp
syntax, you could also just read it using the lisp reader (easily enough
if there's no optional data).


> I try to read the text file using loop. I can find the proper
> position. However I don't know how to continue to parse the following
> lines.
>
> (with-open-file (stream "file.txt")
> (do ((line (read-line stream nil)
> (read-line stream nil)))
> ((null line))
> (if (search "ATOM ELECTRON DENSITY" line) (;; lost here...
> ))))
>
> Any suggestions will be appreciated.

https://groups.google.com/forum/#!original/comp.lang.lisp/gw1t5lTvu1k/CXQ-UCxZtDwJ


What you need to do is to analyse the structure of your file and give a
description of it. For example, you could come up with this grammar:

file ::= header atom-table dipole-table trailer .
header ::= { empty-line } 'NET' 'ATOMIC' 'CHARGES' 'AND' 'DIPOLE' 'CONTRIBUTIONS' { empty-line } .
atom-table ::= atom-table-header { atom-table-line } .
atom-table-header ::= 'ATOM' 'NO.' 'TYPE' 'CHARGE' 'ATOM' 'ELECTRON' 'DENSITY' .
atom-table-line ::= atomic-number type charge atom-electron-density .
atomic-number ::= integer .
type ::= symbol .
charge ::= floating-point-number .
atom-electron-density ::= floating-point-number .
dipole-table ::= dipole-table-header { dipole-table-line } .
dipole-table-header ::= 'DIPOLE' 'X' 'Y' 'Z' 'TOTAL'.
dipole-table-line ::= dipole-title x y z total .
dipole-title ::= symbol .
x ::= floating-point-number .
y ::= floating-point-number .
z ::= floating-point-number .
total ::= floating-point-number .


Then you can write a parser for it.

But the important thing here is that you have a data structure that is
composed of a sequence of two different repeatitions.


/
| file-header
|
| atom-header
|
| /
| |
| | atom-no
| |
| atom < type
| |
| | charge
| |
| | electron-density
file < \
|
| dipole-header
|
| /
| |
| | title
| |
| | x
| dipole <
| | y
| |
| | z
| |
| | total
\ \


That means that the program to process this file will consist in a
sequence of two loops:


(progn
(read-file-header)
(read-atom-header)
(loop
:named read-atom-lines
:do …)
(read-dipole-header)
(loop
:named read-dipole-lines
:do …))


So you're starting on the wrong foot, by writing a single outer loop:
the file structure is NOT a repeation of things, it's a sequence of
different things!

Actually, we can skip the read-dipole-header phase, since it has only
one line which will be read by the read-atom-lines loop (if we had to
process it, we could have this loop return it or save it for further
processing).


(defun read-atomic-charges-and-dipole-contributions-file (path)
(let ((atomic-charges '())
(dipole-contributions '()))
(with-open-file (stream path)
(read-file-header stream)
(read-atom-header stream)
(loop
:named read-atom-lines
:for line = (read-line stream nil nil)
:while (atomic-charge-line-p line)
:do (push (parse-atomic-charge-line line) atomic-charges))
(loop
:named read-dipole-lines
:for line = (read-line stream nil nil)
:while (dipole-contribution-p line)
:do (push (parse-dipole-contribution-line line) dipole-contributions)))
(list atomic-charges dipole-contributions)))



;; Here we just read the number of lines, without any check. You could
;; also parse them, cf. the BNF above.

(defun read-file-header (stream)
(read-line stream nil)
(read-line stream nil)
(read-line stream nil))

(defun read-atom-header (stream)
(read-line stream nil))

;; Similarly, the detection of type of lines is primitive, but if we
;; assume the input file is always correct, sufficient.

(defun atomic-charge-line-p (line)
(with-input-from-string (stream line)
(integerp (ignore-errors (read stream)))))

;; For the parsing functions, we could build structures instead of
;; returning lists.

(defun parse-atomic-charge-line (line)
(with-input-from-string (stream line)
(list (read stream) (read stream) (read stream) (read stream))))

(defun dipole-contribution-p (line)
(and line (< 1 (length (string-trim " " line)))))

;; Be careful that the syntax of floating point numbers is not
;; universal. Lisp has its own syntax, and it is different from the syntax
;; issued by Fortran or C programs! So READ may not be adapted: you may
;; have to write your own scanner for those data items.

(defun parse-dipole-contribution-line (line)
(with-input-from-string (stream line)
(list (read stream) (read stream) (read stream) (read stream) (read stream))))


(read-atomic-charges-and-dipole-contributions-file "/tmp/file.txt")
--> (((6 h 0.1809 0.8191)
(5 h 0.0085 0.9915)
(4 h 0.041 0.959)
(3 h 0.0085 0.9915)
(2 o -0.3088 6.3088)
(1 c 0.0699 3.9301))
((sum -0.847 0.585 1.074 1.488)
(hybrid -0.149 0.289 0.532 0.624)
(point-chg. -0.697 0.295 0.542 0.932)))


--
__Pascal Bourguignon__
http://www.informatimago.com/
"Le mercure monte ? C'est le moment d'acheter !"

Max Rottenkolber

unread,
Apr 7, 2014, 10:39:44 AM4/7/14
to
I like to use MPC ( http://mr.gy/maintenance/mpc ) to parse all kinds of
stuff. It's not very efficient but very powerful. A possible solution
could look like this (this will parse two lists, atoms and dipoles, and
ignore lines that don't match the patterns):

(defpackage parse-test
(:use :cl :mpc :mpc.numerals :mpc.characters))

(in-package :parse-test)

(defun =ignore-unless (unless)
(=unless unless (=line)))

;; This is just for parsing floats, there are other/better ways to do it.
(defun =float ()
(=let* ((s (=or (=character #\-)
(=character #\+)
(=result #\+)))
(i (=natural-number))
(f (=maybe (=and (=character #\.)
(=string-of (=digit))))))
(=result
(* (if f
(float (+ i (/ (parse-integer f) (expt 10 (length f)))))
i)
(case s (#\+ 1) (#\- -1))))))

(defun =atom-line ()
(=list (=skip-whitespace (=integer-number))
(=skip-whitespace (=item))
(=skip-whitespace (=float))
(=skip-whitespace (=float))))

(defun =dipole-line ()
(=list (=skip-whitespace (=string-of (=not (=whitespace))))
(=skip-whitespace (=float))
(=skip-whitespace (=float))
(=skip-whitespace (=float))
(=skip-whitespace (=float))))

(defun =file ()
(=list (=and (=zero-or-more (=ignore-unless (=atom-line)))
(=zero-or-more (=atom-line)))
(=and (=zero-or-more (=ignore-unless (=dipole-line)))
(=zero-or-more (=dipole-line)))))

;; Assuming #p"/tmp/test" is your example file.
(with-open-file (in #p"/tmp/test")
(run (=file) in))

Jinsong Zhao

unread,
Apr 8, 2014, 2:19:41 AM4/8/14
to
Thank you very much for the solution and package. I am newer to Lisp.
Now I just want to practice with built-in feature of common lisp itself
so that I can learn this language quickly. In fact, however, my progress
is very slow, :-(

Any way, thanks a lot.

Regards,
Jinsong

Jinsong Zhao

unread,
Apr 8, 2014, 2:30:54 AM4/8/14
to
It's my question, and you gave the detailed answer. Thanks a lot. I
didn't find a clue between it and this question when I posted this
question. Now, I have more understanding on that.
You always give me detailed answer and kind help. Now, I know how to
deal with the text file. Although it's slow, I think I can do what I
want to do. I appreciate you for all kinds of help.

I am going to parse a text file to obtain some information that I used
to do by Fortran, which is the only language I can use in practice. So I
thought I was affected by it when I try to do the same thing in Lisp.
Your code give me a new viewpoint on such thing. Thanks!

Best wishes,
Jinsong

Max Rottenkolber

unread,
Apr 12, 2014, 8:35:18 PM4/12/14
to
> Now I just want to practice with built-in feature of common lisp itself
> so that I can learn this language quickly.

Fair enough. But be told that parsing anything above regular grammars is
a pretty hard problem. Parsing *efficiently* is even hard for regular
grammars.

Generally, you wouldn't want to solve non-trivial parsing problems bare-
handed.

Rob Warnock

unread,
Apr 12, 2014, 11:59:51 PM4/12/14
to
Max Rottenkolber <m...@mr.gy> wrote:
+---------------
+---------------

True, but... For simpler problems, Common Lisp *does* have a number of
useful (and in some cases, underappreciated) built-in functions, e.g.:

POSITION
SEARCH
MISMATCH
PARSE-INTEGER
SUBSEQ
REPLACE
CONCATENATE

To get the most from these, you will need to read & understand
the sections in the CLHS about "bounding index designators"
and the :START and :END [and sometimes :START2 and :END2]
keyword arguments which nearly all sequence functions take.
Also learn about the :KEY and :TEST keyword arguments which,
again, nearly all sequence functions take. [Oh, and :FROM-END, too.]

In particular, MISMATCH is one of more underappreciated
string-bashing functions in CL, since it actually tells you
how much *was* matched. ;-} Very useful [especially with the
:START2/:END2 options] to tell whether a (possibly-abbreviated)
fixed substring exists at some specific location in a string,
*without* having to do a SUBSEQ first to extract the portion
to be tested. [Avoids unnecessary consing.]


-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <http://rpw3.org/>
San Mateo, CA 94403
0 new messages