Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Derive file CSV file parser based on eBNF

162 views
Skip to first unread message

Mirko Vukovic

unread,
Aug 19, 2016, 9:24:54 AM8/19/16
to
Hello,

I want to write a file parser based on a BNF specification.
Right now it is CSV files, but eventually I will also do it for
Excel spreadsheets (using Java POI library + ABCL)

My head is spinning a little from reading about lexers and parsers,
and libraries mentioned on Cliki.net

I foresee the following flow:
0. Define data structure that will store file contents
1. Use a CSV reader to read a line and return a list of fields
2. Pass that to ??? to do ???.
3. Obtain a data structure (tree?) with CSV file contents.

I could manually write code for step 2. But for many cases where
the EBNF specification is easy to derive, I would like to automate
the process, and learn a bit about lexers and parsers (I just finished
perusing my daughter's undergraduate textbook on Computer Science
and found it fascinating).

Thanks,

Mirko

Dimitri Fontaine

unread,
Aug 19, 2016, 3:23:10 PM8/19/16
to
Mirko Vukovic <mirko....@gmail.com> writes:
> I want to write a file parser based on a BNF specification.

Have a look at

https://github.com/dimitri/cl-abnf

> Right now it is CSV files, but eventually I will also do it for
> Excel spreadsheets (using Java POI library + ABCL)

Have a look at

https://github.com/AccelerationNet/cl-csv
https://github.com/dimitri/pgloader

> My head is spinning a little from reading about lexers and parsers,
> and libraries mentioned on Cliki.net
>
> I foresee the following flow:
> 0. Define data structure that will store file contents

Do you need to manage cases where the file content won't fit in memory?

Regards,
--
dim

Mirko Vukovic

unread,
Aug 19, 2016, 5:07:30 PM8/19/16
to
On Friday, August 19, 2016 at 3:23:10 PM UTC-4, Dimitri Fontaine wrote:
The files to fit in memory.

Mirko

Pascal J. Bourguignon

unread,
Aug 20, 2016, 2:54:08 PM8/20/16
to
Mirko Vukovic <mirko....@gmail.com> writes:

> Hello,
>
> I want to write a file parser based on a BNF specification.

Good.

What's your BNF specification?


> Right now it is CSV files,

As I said (and as Stefan mentionned), what's your BNF specification?


> but eventually I will also do it for
> Excel spreadsheets (using Java POI library + ABCL)
>
> My head is spinning a little from reading about lexers and parsers,
> and libraries mentioned on Cliki.net
>
> I foresee the following flow:
> 0. Define data structure that will store file contents

Like, a file? Accessed from Common Lisp thru a STREAM object?



> 1. Use a CSV reader to read a line and return a list of fields

Well, already you fail here.


Of course, I could just defer to Stefan (CSV is a can of worms), but one
fundamental feature is that CSV allows for multi-line values!

So you don't read a line, you PARSE a CSV record!


> 2. Pass that to ??? to do ???.

The parsing theory is that you read your input (a stream of characters
or bytes), and perform a lexical analysis, providing a stream of tokens.

Then you pass this stream of tokens to a parser, and you get a parse
tree.


> 3. Obtain a data structure (tree?) with CSV file contents.
>
> I could manually write code for step 2. But for many cases where
> the EBNF specification is easy to derive, I would like to automate
> the process, and learn a bit about lexers and parsers (I just finished
> perusing my daughter's undergraduate textbook on Computer Science
> and found it fascinating).

Yes, parsing is fascinating.

On the other hand, as often with ill-defined languages (including CSV),
the lexical analysis is not trivial either!


Have fun.

https://gitlab.com/com-informatimago/com-informatimago/tree/master/common-lisp/csv

--
__Pascal Bourguignon__ http://www.informatimago.com/
“The factory of the future will have only two employees, a man and a
dog. The man will be there to feed the dog. The dog will be there to
keep the man from touching the equipment.” -- Carl Bass CEO Autodesk

Mirko Vukovic

unread,
Aug 21, 2016, 6:14:30 PM8/21/16
to
On Saturday, August 20, 2016 at 2:54:08 PM UTC-4, informatimago wrote:
> Mirko Vukovic writes:
>
> > Hello,
> >
> > I want to write a file parser based on a BNF specification.
>
> Good.
>
> What's your BNF specification?

I am including the BNF that I derived from the files, followed by an example CSV


eBNF specification (in Emacs' Org & ebnf modes), not 100% complete

#+BEGIN_SRC ebnf
FILE = FILE-PREAMBLE, DATA+
DATA = DATA-PREAMBLE, STATISTICS, SCAN-DATA, BLANK-LINE+
#+END_SRC

General definitions
#+BEGIN_SRC ebnf
DATE+TIME = MONTH, '/', DAY, '/', YEAR, ' ',
HOUR, ':', MINUTE, ' ', 'AM' | 'PM'
MONTH = %d1-12
DAY = %d1-31
YEAR = %d2000-3000
HOUR = %d0-24
MINUTE = %d0-60
SITE-VALUE-MAYBE = SITE-VALUE-N/A | REAL-NUMBER
SITE-VALUE-N/A = 'site ', INTEGER, 'n/a'
#+END_SRC

#+BEGIN_SRC ebnf
FILE-PREAMBLE = CASETTE, COLLECTION-DATE/TIME, PROCESS-DATE/TIME,
EQUIPMENT, OPERATOR, SHIFT, MACHINE-TYPE, BLANK-LINE+
CASSETTE = 'Cassette:', literal
COLLECTION-DATE/TIME = 'COLLECTION DATE/TIME:', DATE+TIME
PROCESS-DATE/TIME = 'PROCESS DATE/TIME:', DATE+TIME
EQUIPMENT = 'EQUIPMENT', [literal]
OPERATOR = 'OPERATOR', [literal]
SHIFT = 'SHIFT:', [literal]
MACHINE-TYPE = 'MACHINE TYPE:', literal
#+END_SRC

#+BEGIN_SRC ebnf
DATA-PREAMBLE = WAFER-ID, LOT-ID, CASSETTE, SLOT, STATUS,
DATA-TYPE, RECIPE, RCP-CNT, MEAS-STE, MATERIAL,
SITE
WAFER-ID = 'WAFER-ID', literal
LOT-ID = 'LOT ID', literal
CASSETTE = 'Cassette', literal
SLOT = 'SLOT', literal
STATUS = 'STATUS', literal
DATA-TYPE = 'DATA TYPE', literal
RECIPE = 'RECIPE', literal
RCP-CNT = 'RCP CNT', literal
MEAS-SET = 'MEAS SET', literal
MATERIAL = 'MATERIAL', literal
SITE = 'SITE', literal
#+END_SRC

#+BEGIN_SRC ebnf
(* Defines a table of N+2 columns where N is the number of films At
parse time, N is unknown. It can be deduced after reading the table
header (STATISTICS-HEADER & FILM-THICKNESS-COLUMN-HEADER), assigning
the first N columns to the films, and the last two to RI and GOF.

BNF cannot handle that as far as I know *)
STATISCS = STATISTICS-HEADER, MEAN, MIN, MAX, %STDDEV, STDDEV,
3%STDDEV, 3STDDEV, RANGE
STATISTICS-HEADER = FILM-THICKNESSES-COLUMN-HEADER, '2nd RI @ 633.0nm'
'GOF'
FILM-THICKNESS-COLUMN-HEADER = literal+
MEAN = NUMBER-MAYBE+
MIN = NUMBER-MAYBE+
MAX = NUMBER-MAYBE+
%STDDEV = NUMBER-MAYBE+
STDDEV = NUMBER-MAYBE+
3%STDDEV = NUMBER-MAYBE+
3STDDEV = NUMBER-MAYBE+
RANGE = NUMBER-MAYBE+
#+END_SRC

#+BEGIN_SRC ebnf
(* Defines a table of N+4 columns where N is the number of films At
parse time, N is unknown at file parse time.

It can be deduced after reading the statistics table (see comment
therein) or the SCAN-DATA-HEADER-LINE header (STATISTICS-HEADER &
FILM-THICKNESS-COLUMN-HEADER), assigning the first N columns to the
films, and the last two to RI and GOF.

BNF cannot handle that as far as I know *)
SCAN-DATA = SCAN-DATA-HEADER-LINE, DATA-LINE+
SCAN-DATA-HEADER-LINE = 'Site #', 'Value'+, [X], [Y]
DATA-LINE = INDEX, THICKNESS-MAYBE+, RI, GOF, [X], [Y]
THICKNESS = "site ", SITE-INDEX, " n/a", REAL-NUMBER
SITE-INDEX = %d1-99999
#+END_SRC


The CSV file (in table format):

| Cassette: | TELOxHfO2OxBLKT | | | | | | |
| COLLECTION DATE/TIME: | 9/14/2015 15:40 | | | | | | |
| PROCESS DATE/TIME: | 9/14/2015 15:40 | | | | | | |
| EQUIPMENT: | | | | | | | |
| OPERATOR: | | | | | | | |
| SHIFT: | | | | | | | |
| MACHINE TYPE: | SpectraFx 100 | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| WAFER ID | 9/14/2015 15:41 | | | | | | |
| LOT ID | THICKNESS | | | | | | |
| Cassette | Left | | | | | | |
| SLOT | 1 | | | | | | |
| STATUS | | | | | | | |
| DATA TYPE | SE | | | | | | |
| RECIPE | BLKT | | | | | | |
| RCP CNT | 1 | | | | | | |
| MEAS SET | Film Stack | | | | | | |
| MATERIAL | ! Oxide | | | | | | |
| SITE | 1 | | | | | | |
| RESULT TYPE | 3rd Thickness | 2nd Thickness | 1st Thickness | 3rd RI @ 633.0nm | GOF | | |
| MEAN | 871.6986 | 21.761 | 443.9889 | 1.3301 | 0.6179 | | |
| MIN | 412.4353 | 0 | 419.6866 | 0.9571 | 0 | | |
| MAX | 1638.1409 | 43.1082 | 485.2459 | 1.5477 | 0.9933 | | |
| % STDDEV | 66.5296 | 77.0872 | 7.1018 | 17.0624 | 78.1924 | | |
| STDDEV | 579.9373 | 16.7749 | 31.5312 | 0.2269 | 0.4832 | | |
| 3 % STDDEV | 199.5887 | 231.2615 | 21.3054 | 51.1873 | 234.5771 | | |
| 3 STDDEV | 1739.8118 | 50.3248 | 94.5937 | 0.6808 | 1.4495 | | |
| RANGE | 1225.7055 | 43.1082 | 65.5593 | 0.5905 | 0.9933 | | |
| Site # | Value | Value | Value | Value | Value | X | Y |
| 1 | site 1 n/a | site 1 n/a | site 1 n/a | site 1 n/a | site 1 n/a | -141.0544 | -130.2041 |
| 2 | site 2 n/a | site 2 n/a | site 2 n/a | site 2 n/a | site 2 n/a | -141.0544 | -97.6531 |
| 3 | 1343.9272 | site 3 n/a | site 3 n/a | 0.9571 | 0 | -141.0544 | -65.102 |
| 4 | 413.8047 | 33.8049 | 420.047 | 1.5025 | 0.9932 | -141.0544 | -32.551 |
| 5 | 1601.4456 | 0 | 485.228 | 1.0437 | 0 | -141.0544 | 0 |
| | | | | | | | |
| WAFER ID | 9/14/2015 15:41 | | | | | | |
| LOT ID | THICKNESS | | | | | | |
| Cassette | Left | | | | | | |
| SLOT | 2 | | | | | | |
| STATUS | | | | | | | |
| DATA TYPE | SE | | | | | | |
| RECIPE | BLKT | | | | | | |
| RCP CNT | 1 | | | | | | |
| MEAS SET | Film Stack | | | | | | |
| MATERIAL | ! Oxide | | | | | | |
| SITE | 1 | | | | | | |
| RESULT TYPE | 3rd Thickness | 2nd Thickness | 1st Thickness | 3rd RI @ 633.0nm | GOF | | |
| MEAN | 871.4749 | 21.3048 | 444.296 | 1.3274 | 0.6107 | | |
| MIN | 412.3054 | 0 | 418.5304 | 0.9331 | 0 | | |
| MAX | 1638.9301 | 36.2416 | 485.3701 | 1.541 | 0.9933 | | |
| % STDDEV | 66.6998 | 78.2225 | 7.1253 | 17.1506 | 79.161 | | |
| STDDEV | 581.2721 | 16.6651 | 31.6573 | 0.2277 | 0.4834 | | |
| 3 % STDDEV | 200.0994 | 234.6676 | 21.3758 | 51.4519 | 237.4829 | | |
| 3 STDDEV | 1743.8162 | 49.9954 | 94.9719 | 0.683 | 1.4503 | | |
| RANGE | 1226.6247 | 36.2416 | 66.8397 | 0.6078 | 0.9933 | | |
| Site # | Value | Value | Value | Value | Value | X | Y |
| 1 | site 1 n/a | site 1 n/a | site 1 n/a | site 1 n/a | site 1 n/a | -141.0544 | -130.2041 |
| 2 | site 2 n/a | site 2 n/a | site 2 n/a | site 2 n/a | site 2 n/a | -141.0544 | -97.6531 |
| 3 | 1212.4539 | site 3 n/a | site 3 n/a | 0.9331 | 0 | -141.0544 | -65.102 |
| 4 | 413.421 | 33.7844 | 420.0446 | 1.5024 | 0.9932 | -141.0544 | -32.551 |
| 5 | 1603.8169 | 0 | 485.2838 | 1.0436 | 0 | -141.0544 | 0 |


Mirko

Pascal J. Bourguignon

unread,
Aug 21, 2016, 6:37:10 PM8/21/16
to
Mirko Vukovic <mirko....@gmail.com> writes:

> On Saturday, August 20, 2016 at 2:54:08 PM UTC-4, informatimago wrote:
>> Mirko Vukovic writes:
>>
>> > Hello,
>> >
>> > I want to write a file parser based on a BNF specification.
>>
>> Good.
>>
>> What's your BNF specification?
>
> I am including the BNF that I derived from the files, followed by an example CSV
>
>
> eBNF specification (in Emacs' Org & ebnf modes), not 100% complete
> […]

Now add parentheses, give it to some parser generator and parse your
files!

Mirko Vukovic

unread,
Sep 20, 2016, 9:37:23 PM9/20/16
to
On Sunday, August 21, 2016 at 6:37:10 PM UTC-4, informatimago wrote:
> Mirko Vukovic writes:
>
> > On Saturday, August 20, 2016 at 2:54:08 PM UTC-4, informatimago wrote:
> >> Mirko Vukovic writes:
> >>
> >> > Hello,
> >> >
> >> > I want to write a file parser based on a BNF specification.
> >>
> >> Good.
> >>
> >> What's your BNF specification?
> >
> > I am including the BNF that I derived from the files, followed by an example CSV
> >
> >
> > eBNF specification (in Emacs' Org & ebnf modes), not 100% complete
> > […]
>
> Now add parentheses, give it to some parser generator and parse your
> files!
>
>
>
>
> --
> __Pascal Bourguignon__ http://www.informatimago.com/
> “The factory of the future will have only two employees, a man and a
> dog. The man will be there to feed the dog. The dog will be there to
> keep the man from touching the equipment.” -- Carl Bass CEO Autodesk

To follow up.

I used cl-yacc to define a parser and cl-csv to read the csv-file:

(define-parser *csv-parser*
(:start-symbol all)
;; terminals must match those defined in CSV-FILE-LEXER
(:terminals (:dictionary :empty-line :stat-table-header :stat-data :site-table-header :site-data))

;; utilities
(dictionary-block (dictionary-entry dictionary-block #'cons-w-list))
(dictionary-block (dictionary-entry))
(dictionary-entry (:dictionary #'declaration))
(empty-line :empty-line)


;; Top-level file structure
(all (file-header empty-line empty-line data-blocks))

;; File header has several dictionary entries that are unique to the
;; file
(file-header dictionary-block)

;; Define multiple data blocks
(data-blocks (data-block data-blocks #'cons-w-list))
(data-blocks (data-block))

;; Data block begin with a single empty line, a header, statistics
;; data and site data
(data-block (empty-line data-block-header stat-table site-table))

;; Data block header has a dictionary. Entries are re-defined
;; for each data-block
(data-block-header dictionary-block)

;; Statistics is a table with header line and multiple data lines
(stat-table (stat-table-header stat-table-data #'cons-w-list))
(stat-table-header :stat-table-header)
(stat-table-data (stat-table-row stat-table-data #'cons-w-list))
(stat-table-data (stat-table-row))
(stat-table-row :stat-data)

;; Site data is a table with a single header line and multiple data
;; lines that end with an empty line
(site-table (site-table-header site-table-data #'cons-w-list))
(site-table-header :site-table-header)
(site-table-data (site-table-row site-table-data #'cons-w-list))
(site-table-data (site-table-row))
(site-table-row (:site-data #'normalize-site-data)))

The lexer uses cl-csv and some auxiliary functions to normalize the read data

(defun csv-file-lexer (row-list)
"Create a lexer that returns either the row contents as lists, one item/field

If the line is empty, the list contains a single element, :empty-line keyword
Other lines are prefixed by their type"
(lambda ()
(let ((csv-row (pop row-list)))
(if csv-row
(destructuring-bind (head &rest rest) csv-row
(declare (ignore rest))
(cond
((typep head 'integer) (values :site-data csv-row))
((equal head :result-type) (values :stat-table-header csv-row))
((equal head :site-#) (values :site-table-header csv-row))
((member head '(:mean :min :max :%-stddev :stddev :3-%-stddev :3-stddev :range))
(values :stat-data csv-row))
((equal head :empty-line) (values :empty-line nil))
((and (typep head 'symbol)
(equal "KEYWORD" (package-name (symbol-package head))))
(values :dictionary csv-row))
(t (error "Unrecognize row type"))))
(values nil nil)))))

0 new messages