Is there a lisp function to read a single word (as a string) from a stream, e.g., file stream

jeosol

unread,

May 14, 2017, 7:00:27 PM5/14/17

to

I am trying to process a file that is very large (in size) and has unstructured or complicated file structure. So I would need a function that I can read a word from the file stream, check what it is (e.g., some kind of output keyword in the file), and then decide how to process the next block of data.

Mind you, read-line is not convenient for what I am trying to do.

Before writing anything, I wanted checked with you guys. Thanks.

Jeff Barnett

unread,

May 14, 2017, 7:17:48 PM5/14/17

to

I'm not sure what you mean by word. If you mean a symbol, the read
function will do the job. If you mean the next 32 or 64 bits, that might
be implementation dependent. If you the next n bytes as ASCII, Lisp
provides a character reading function, etc.
--
Jeff Barnett

paul wallich

unread,

May 14, 2017, 7:53:03 PM5/14/17

to

Before going any further, can you explain what you mean by "word"? Are
you talking about any shortish string separated from the file around it
by some kind of whitespace? About words in a natural-language
dictionary? Words in some kind of application-specific mini-language?
Are the blocks of data also words?

There are plenty of ways to read a character at a time (directly or from
a big buffer) until you figure out what you have. Or if each "word"
you're interested in unambiguously specifies something to be done with
subsequent data you might try using the Lisp reader mechanism. Or there
are the regex-style and sequence-splitting routes. But unless you firm
up the description of what you're doing, one character at a time is the
default.

paul

jeosol

unread,

May 14, 2017, 10:42:03 PM5/14/17

to

Paul and Jeff, I apologize for not being very specific. And thanks for providing the clarifications.

For starters, I mostly use and test using SBCL on 64bit Linux box.

What I mean by word is that it is mostly ascii text for now, could be numbers, words, etc. But the common thing is that each word is separated by a space around them on each side. Essentially different tokens in a string but I have to get them one at a time. The reason for this is because depending on the word read at some point, I need to create some other data structures to store the subsequent data as I read them. I can't read the whole file first and then read into my data structure. This approach takes a lot of memory and I think its not efficient. SBCL failed a few times with the two prong approach, i.e., read data first and then create storage and read from the first data structure.

When there are numbers, I would read as string and then convert to numbers. I would know when the next data are numbers based on the keyword I would have read previously.

To give a simple example, let's a file contains the following string "I was there", if I call the function three times, I should get "I", "was" "there" returned each time. A fourth call to the function should fail gracefully because the end of file has been reached.

So I was thinking of something like :

(defun read-word (file-stream)
...
word)

Thanks

Robert L.

unread,

May 15, 2017, 12:27:55 AM5/15/17

to

On 5/14/2017, jeosol wrote:

> To give a simple example, let's a file
> contains the following string "I was
> there", if I call the function three
> times, I should get "I", "was" "there"
> returned each time.

(use srfi-13)

(string-split (string-trim-both (read-line)) #/\s+/)

--
It is the same instinct that made one [Swedish] female refugee aid worker and
her colleagues hush up her recent rape at the hands of some recent arrivals.
They feared that mentioning the rape might exacerbate anti-immigrant sentiment
in Europe.
theoccidentalobserver.net/2015/11/douglas-murrays-warning-to-the-jewish-community

Pascal J. Bourguignon

unread,

May 15, 2017, 1:43:22 AM5/15/17

to

That's a good start.

So: word := spaces non-spaces spaces.

Now replace ... with the code needed to parse this syntax!

--
__Pascal J. Bourguignon
http://www.informatimago.com

smh

unread,

May 15, 2017, 5:47:26 AM5/15/17

to

On Sunday, May 14, 2017 at 10:43:22 PM UTC-7, informatimago wrote:

> That's a good start.
>
> So: word := spaces non-spaces spaces.

("Not" "so" "sure" "that's" "such" "a" "good" "start.")

You'll need to decide what to do about punctuation, and it's nontrivial in both machine and human languages. (In the former this general problem is usually called "tokenization" if you want to do some google research.) Consider the hyphens in the following.

"We need to co-operate -- I mean, to work together."

Dan Sommers

unread,

May 15, 2017, 9:33:29 PM5/15/17

to

On Mon, 15 May 2017 02:47:23 -0700, smh wrote:

> You'll need to decide what to do about punctuation, and it's
> nontrivial in both machine and human languages. (In the former this
> general problem is usually called "tokenization" if you want to do
> some google research.) Consider the hyphens in the following.
>
> "We need to co-operate -- I mean, to work together."

And consider that if I've been writing too much LaTeX, I might write
that same sentence as follows:

"We need to co-operate---I mean, to work together."

Kaz Kylheku

unread,

May 15, 2017, 9:48:02 PM5/15/17

to

It doesn't take a lot.

I only wrote a moderate amount of TeX and LaTeX up to around than twenty
years ago, and I still write hyphens---like that.

kint...@gmail.com

unread,

May 18, 2017, 8:59:55 PM5/18/17

to

> What I mean by word is that it is mostly ascii text for now, could be
> numbers, words, etc. But the common thing is that each word is separated
> by a space around them on each side. Essentially different tokens in a string

In my experience , it is good to realize from the start that
you are implementing a tokenizer and parser .
If you try to do something "simpler" than a tokenizer/parser ,
you usually end up with something more complicated .
An advantage of implementing a tokeenizer/parser is that there is lots of literature on how to do so ; also your solution is more likely to adapt nicely to the requirements that arise as you develop the solution .

There are LOTS of complicated ways to do tokenizing and parsing
I recommend a simple recursive-descent parser .

In regards to tokenization ,
it might be helpful if instead of describing that a "word"
continues for each character until it meets a "space" or a "punctuation" ;
instead describe it such that when you have started reading a "word" it continues until you have a "nonword" .

Also , re: not reading in the whole file .
Probably you will find that it is not sufficient to tokenize 1 word at a time .
Probably your parser will need to (sometimes) know not only the current word but also a "lookahead" word (i.e. the next) word .

The way I do that is that I implement the parser to always be given 3 words . word 1 is the "current" word, word 2 is the "lookahead" word , word 3 is the "lookahead-lookahead" word . Thus the first thing the parser gets from the tokenizer is a list containing word 1 , word 2 , word 3 . Thus the second thing the parser gets from the tokenizer is a list containing word 2 , word 3 , word 4 . Thus the next-to-last thing the parser gets from the tokenizer is word N-1 , word N , end_of_text . Thus the very-last thing the parser gets from the tokenizer is word N , end_of_text , end_of_text .

~~ kintalken ~~

Rob Warnock

unread,

May 20, 2017, 8:26:58 AM5/20/17

to

[Getting slightly far-afield from the OP's question, but...]

<kint...@gmail.com> wrote:
+---------------

| There are LOTS of complicated ways to do tokenizing and parsing
| I recommend a simple recursive-descent parser .

...

| The way I do that is that I implement the parser to always be
| given 3 words . word 1 is the "current" word, word 2 is the
| "lookahead" word , word 3 is the "lookahead-lookahead" word .

+---------------

A style I picked up from the BLISS-10 compiler circa 1970
and have used numerous times since on small projects is a
mixture of recursive descent and simple operator precedence.[1]
The lexer tags tokens as either "delimiters" [really "operators",
which have a precedence and a reduction function] or "symbols"
[really "values"], that is, anything which isn't an operator,
e.g., identifiers, numbers, character & string literals, etc.

The interface between the tokenizer and the parser is a global
four-word "lexeme"[2] window [these days we'd use a structure]
with these slots:

SYM ; The current value object [NOTA BENE! Can
; be a node in the parser's graph table!]
DEL ; The current delimiter/operator lexeme [2]
FUTSYM ; The "future" (next) symbol (or value).
FUTDEL ; The "future" (next) delimiter/operator.

In simple operator precedence grammars, SYM or FUTSYM might
be empty/null, but DEL/FUTDEL can never be [EOF is a delim].

The main lexical reader [interface from the parser to the lexer]
is a function called RUND (Read Until Next Delimiter), which
discards SYM & DEL, shifts FUTSYM into SYM and FUTDEL into DEL,
and then fills FUTSYM with the next token if it's *not* a delimiter
[else FUTSYM gets null], and finally the token after *that* --
which MUST be a delimiter -- goes into FUTDEL.

A simplified version in Scheme [cribbed from some old code of mine
so apologies for any incompleteness/inconsistencies, such as naming
the slots OLDSYM/OLDDEL/SYM/DEL instead of as above]:

(define (rund) ; Operates on globals, ugh!
(set! oldsym sym) ; Slide the window
(set! olddel del)
(set! del (read-token))
(if (token-is-del? del)
(set! sym null-token) ; Empty value between delims
(begin
(set! sym del)
(set! del (read-token))
(unless (token-is-del? del)
(error "Two adjacent non-delimeters!")))))

and the parse code for (say) binary infix operators looks
something like this [all infix operators will have the
"expr-infix" function as their "reduce" function]:

(define (expr-infix)
(let ((mysym sym) ; My left operand
(myprio (del-prio del)) ; Me
(mycode (del-code del)))
(rund) ; Step forward
(do ()
((<= (del-prio del) myprio) ; "<=" causes left-associativity
(make-graph-lexeme mycode mysym sym))
(set! sym ((del-reduce-function del))))))

and the generic expression parser is:

(define (expression)
(do ()
((<= (del-prio del) prio-stopper) ; Right-paren, semicolon, etc.
sym)
(set! sym ((del-reduce-function del)))))

The "trick" to hybridizing this with recursive descent is to assign
*very* high operator priorities to the start tokens of control expressions
[e.g., "keywords" such as BEGIN, IF, WHILE, etc.] and *very* low priorities
to the ends of expressions ["stoppers" such as END, ELSE, FI, right-paren,
right-bracket, comma, semicolon, &c]. This causes the simple operator
precedence parser to quite naturally behave as a recursive descent parser
when it encounters any of those constructs. ;-}

-Rob

[1] Also mentioned here:

https://en.wikipedia.org/wiki/Operator-precedence_parser
Operator-precedence parser

https://en.wikipedia.org/wiki/Operator-precedence_parser#Relationship_to_other_parsers
Relationship to other parsers
...
Perl 6 sandwiches an operator-precedence parser between two
recursive descent parsers in order to achieve a balance of
speed and dynamism. This is expressed in the virtual machine
for Perl 6, Parrot, as the Parser Grammar Engine (PGE).
GCC's C and C++ parsers, which are hand-coded recursive
descent parsers, are both speed up by an operator-precedence
parser that can quickly examine arithmetic expressions.
Operator precedence parsers are also embedded within
compiler compiler-generated parsers to noticeably speed
up the recursive descent approach to expression parsing.

[2] Note: In the BLISS parsers, a lexeme can be either a terminal
value/token [literal, identifier, delimeter, etc.] or the
result of a reduce operation representing a value -- in which
case it is called a "graph table lexeme", and points to a node
in the parse tree being built.

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <http://rpw3.org/>
San Mateo, CA 94403

kint...@gmail.com

unread,

May 26, 2017, 7:40:19 PM5/26/17

to

Thanks for your great post Rob .
I love the ptopic of parsing and your example of the BLISS-10 approach is really fascinating .

I also really pareeciate the summary whereby you talked about perl etc , I'm definately going to check that code out .

Some questions related to the process described by the code you included .
Does by that scheme the grouping constructs for example ``(`` and also ``)`` and also ``[` and ``]`` and ``{`` and ``}`` each them get parsed at an operator ?

In my opinion btw puting operator precedence activity to early in the parse chain is one of the biggest reason for medicocre praser implemntations with lots of problems . But that is a reponse to the adhoc nature perhaps which is usually available as skill and thus applied . It's an extremely complicated problem I'm not proposing that I am intelligent about it .

You also mentioned something like "a token might be a parse tree" , isn't that the cart before the horse ? how can a tokenizer receive as tree representation of prior parsing ? In a general way , where would you put the code you gave about shifting around symbols an operators in the chain "the text through the scanner through the tokenizer through the lexer through the parser" . Do you have notions about the functionality that could be isolated and specified as happening in the lexer ? I see all kinds of madness in the code I review , if actual practice is the criteria of evaluation then it is bullshit to pretend that for example scanning/tokenizing/lexing/parsing has all been figured out and bullshit too to insist that the solution is to do it the correct way according to the instructions of the old masters . Now that being said if I wasn't being so lazy I'd be reading the
dragon book and answering these questions for myself instead of wasting your time with something that is so off-topic . I'm not even a lisp programmer anymore I am completely enamoured with prolog thus guilty of being an infiltrator from the opposing camp / probably you should just add me to your spam filter rejection list instead of having to respond to me to complain about me .

Thanks again for the very execellent post .

~~ kintalken ~~

kint...@gmail.com

unread,

May 26, 2017, 8:08:07 PM5/26/17

to

that's good to mention punctuation it completely screws everything up .
in my opinion the punctuation problem has not been competently solved via todays methodologies .
now by todays methodologies i want to emphasise that i do so with respect and I will give an example .
it was for example the early dream of computer programmer to parse all the languages of the world and become intelligent in the engagment with them when represented as the abstracted forms available to us in computer machine hardware .
I think an honest evaluation of the history reveals not that the problem of natural language interaction was solved but rather abondoned because not prudently providing solutions about numbers useful to money lenders .
Well perhaps just abandoned because too hard or whatever --- not necessary to have judgement of it but necessary to accept it I think because then opening the door to the realization that even though this is 2017 the journey is just beginning , in fact it was only started once and then it failed . Lots of opportunity for interesting things to develop in the domain of the scanner/the lexer/the tokenizer/the parser chain .

And I think it is true description to say that modern methodology is betrayed as being rediculously complex yet achieving only minimal functional gains . By example I refer to the EcmaScript spec . Now javascript , like all compuyter languages in general use , is a very restricted syntax . I love javascript for it's minimalism . In my opinion the inventor of javascript advanced the art of literate engagment with machine instructions by 20 years . I want to emphasize that it is simple , it is a simple and limited minimal quite with few parsing oddities . Now look at the EcmaScript spec and find the suntax definition , and now implement that in any language of your choosing . How many weeks would it take you to be successful at implementing that spec , and passing of a comprehensive test suite ? Or better ask , perhaps , how many months ? a year ? " The point is how can that such a ridiculously complex specification be needed to describe the syntax of a minimal and simplified language ? Am I proposing that the EcmaScript authors are idiots who just made the whole thing too verbose ? Nope , I think it's an excellent specification I have nothing but praise for that document .

If it requires that much complexity to describe the parsing of something with that much simplicity , does it not then follow that the parsing of the much more complex "natural language" will be confoundingly complex in an unmanageable way ? In my opinion that is the state of modern parsing .

Now some examples might seem to argue against that assertion . For example translation software is remarkable for what it can achieve . But careful consideration of the detail of semantic-syntactic correlation you expect out of your computer programming text is not the same detail level as the detail considered by a translation system receiving as a chunk of text and producing as output serial translation .

The great tragedy in my opinion in the lost off interest in natural language parsing is that in effect modern humanity is being trained to conform to the semantic and syntactic structures so simple and useful and intriguing and fun for as computer programs but also manifest now as behaviour training of the user taught special vernakular as necessary to become comformant to the new form of communication ; that lack of adaptation to the natural languages i think is causing the loss of all the wonderful clues and puzzles and beauty contained within the complexity of those arrangements - the modern human vernakular has become now simplified and restrained because conformant to the simplicity of the language the computer programmers can understand . Such a vernacular is honestly good for business and also good for war so I expect this change in language will proceed at a rapid pace .

Anyways something certain about punctuation is never be the fool who allows for an operator without spaces around it , for example ``$a-$b`` should never do for you any math because ``-`` not an operator in that context because lacking spaces around itself .

Of course you'll ignore that advice but then you gotta figure out an alternate solution to the problem .

~~~ kintalken ~~~

kint...@gmail.com

unread,

May 26, 2017, 8:45:15 PM5/26/17

to

> > "We need to co-operate -- I mean, to work together."

> "We need to co-operate---I mean, to work together."

a great example of a real conundrum .
Remeber also the c++ usage { a-- } .vs. { --a } .
And consideration of { --a } might remind one about the need for { - 1 } to probably mean a negativ number but maybe mean make a markdown list item because at the beginning of the line .

I think the secret of { -- } and { --- } might be that { -- } is aking to { .. } because { ... } is akin to { --- } . Now the nice thing about { ... } is it always keeps you on track , if you are designing a lnaguage for example then if you use { ... } it better conform to the semantic traditions of interpreation of it's meanings - they are very powerfully established so you just conform to them . I mean do whatever you want make { ... } mean something ridiculous like "end of list" if you want to but I'm just warning you you can't fuck with { ... } , it's it's own thing and it is staying that way .
By similar considerations { .. } commonly means "parent directory" for example and has an interesting manifestation in both dart and prolog .
{ . } just means whatever it means in javascript because that's what almost everybody does now ; you are best to always respect that meaning and find an interpretativ meaning of the { . } with reference to the javascript notation if you want to make that consideration with prolog or lisp , which have seemingly contrary uses of { . } . Making that consideration consistent with consideration of the use of { . } in Haskell might make a good thesis .

By laying out for oneself a semantic experience via consideration of the { ... } and the { .. } and the { . } then a similar process conducted with the { --- } and the { -- } and the { - } gives you two cases and thus the basis for insight born from comparative analysis .

~~ .kintalken$ ~~ $am i right yet?

Rob Warnock

unread,

Jul 9, 2017, 9:38:33 AM7/9/17

to

[Apologies for the several week reply delay. Got started on a reply,
then got interrupted, then didn't get back to it until now...]

<kint...@gmail.com> wrote:
+---------------

| Thanks for your great post Rob .

+---------------

You're welcome.

+---------------

| Some questions related to the process described by the code you included .
| Does by that scheme the grouping constructs for example ``(`` and also
| ``)`` and also ``[` and ``]`` and ``{`` and ``}`` each them get parsed
| at an operator ?

+---------------

Yes, though the opening brackets are given a *very high* priority,
whereas the closing brackets are given a *very low* priority, as
are things like "," and ";" and so on. Such low-priority operators
are sometimes collectively called "stoppers", since they end
[or "stop"] the parsing of an expression [or subexpression].

[Note: EOF is also considered a "stopper" operator, of the
lowest possible priority.]

+---------------

| In my opinion btw puting operator precedence activity to early in
| the parse chain is one of the biggest reason for medicocre praser
| implemntations with lots of problems .

+---------------

In this case there's no "early" or "late" to it: the simple operator
precedence parsing is *completely* interwoven with the recursive
descent parsing -- both are done during the same phase. This hybrid
approach permits the recursive descent parser to be *much* simpler,
since you don't have to have the traditional layers and layers of
intermediate parsing routines whose only purpose is to resolve the
operator precedence -- which is one of the reasons hand-coding "pure"
recursive descent parsers is usually such a pain!! ;-}

+---------------

| You also mentioned something like "a token might be a parse tree",
| isn't that the cart before the horse ? how can a tokenizer receive
| as tree representation of prior parsing ?

+---------------

Actually, I believe that what I said [in a footnote] was:

Note: In the BLISS parsers, a lexeme can be either a terminal
value/token [literal, identifier, delimeter, etc.] or the
result of a reduce operation representing a value -- in which
case it is called a "graph table lexeme", and points to a node
in the parse tree being built.

The "tokenizer" [the LEX phase, in BLISS-11] doesn't "receive as tree
representation of prior parsing". Rather, the tokenizer delivers
[via the RUND() function, see previous article] "lexemes" *to*
the parser [the SYN phase, in BLISS-11]:

Lexemes are the smallest units of program used by the rest
of the compiler, and consist of the internal representations
of identifiers, reserved words, special characters, and
literal values.
-- From Section IV.1.1 of "The Design of an Optimizing Compiler",
Wulf et al.(1975, Elsevier).

Identifier & literal lexemes are kinds of "values", as opposed
to "operators". For values resulting from the evaluation of
compile-time expressions, the result might be the same as
(or similar to) lexemes from literal values read in from
the source. But in the process of performing a "reduction"
operation, the syntax analyzer *also* creates "value" lexemes
[graph table lexemes a.k.a. IR nodes] to represent the results
of an operator that can only be executed at run-time. For example,
if the RUND buffer contains the following:

sym del futsym futdel
FOO * BAR +

then because the precedence of the operator "*" is higher than
the operator "+", the syntax parser can construct a "graph table
lexeme" representing the subexpression "FOO * BAR", stick that
into FUTSYM and then call RUND again, resulting in (perhaps):

sym del futsym futdel
[*] + BAZ ;
|
\---->+--------------+
| <op:arith:3> |
+--------------+
| * |
+--------------+
| FOO |
+--------------+
| BAR |
+--------------+

In other words, the syntax analyzer has (briefly) acted like a
lexer by forming an IR code node and overwriting FUTSYM with it.
So when the syntax analyser calls RUND [the lexer] to read the
next lexeme or two [in this case, "BAZ" & ";"], it automatically
shifts the result of the syntax reduction into SYM where it is
needed to continue the parse.

But what would happen if the current operator in DEL were of *lower*
precedence that the operator in FUTDEL? Then the current syntax parsing
routine would save SYM & DEL in local variables ["on the stack",
effectively] and loop calling RUND and the appropriate delimeter-
processing routine(s) until the original saved operator *was* of
higher precedence than the current DEL, then return a graph table
lexeme for {saved_DEL, saved_SYM, SYM}. That's where the "recursive
descent" part actually happens, by the way. ;-}

See the code for the EXPR-INFIX function in my previous message.
This single routine handles *all* binary infix operators, regardless
of the operator's precedence.

+---------------
| In a general way, where would you put the code you gave about shifting

| around symbols an operators in the chain "the text through the scanner
| through the tokenizer through the lexer through the parser".

+---------------

I'm not *quite* sure what you're asking here, but I think I've
already shown you in detail in my previous reply & above.

But to recap, there's a function [subroutine, really, if you want to
be pedantic about "functions"] named RUND -- Read Until Next Delimiter
(a.k.a. operator) -- that is the primary interface to the lexer
from the syntax analyzer. RUND reads in the next one or two lexemes
[depending on whether the first is an "operator" lexeme or not], shifts
FUTSYM into SYM & FUTDEL into DEL, and then fills FUTSYM with the first
new lexeme [or a dummy NOVALUE, if the first lexeme was an operator],
and fills FUTDEL with the second new lexeme [or the first, if the first
lexeme was an operator]. This must *always* leave SYM & FUTSYM containing
"value" lexemes or graph table lexemes or NOVALUE, and must *always*
leave DEL & FUTDEL containing operators lexemes. [If not, there is a
syntax error. But that's another story...]

+---------------

| Do you have notions about the functionality that could be
| isolated and specified as happening in the lexer?

+---------------

Basically, the *only* thing that happens in the lexer proper is that
one or more characters of input are read until a complete lexical token
is found, and then that token is analyzed (just a little) into "values"
[identifiers or literals such as numbers & strings] or "delimiters"
["operators" such as "IF,THEN,WHILE" or "+-*/({[=" or "stoppers" such
as "]}),;"] and then packaged up into a tagged "lexeme" object [one of
the tags being delimeter/operator or not].

+---------------
| I see all kinds of madness in the code I review, if actual practice

| is the criteria of evaluation then it is bullshit to pretend that for

| example scanning/tokenizing/lexing/parsing has all been figured out...
+---------------

I think you're being a little too skeptical. With a little more experience
in different languages/compilers you might learn that indeed scanning/
tokenizing/lexing/parsing *HAS* been pretty much all figured out, at least
for languages with a relatively straightforward syntax.

+---------------

| and bullshit too to insist that the solution is to do it the
| correct way according to the instructions of the old masters.

+---------------

Well, I'm not trying to push any particular design style as any kind
of "correct way" -- far from it!! I'm just sharing one particular
little parsing "trick" [that I learned from the BLISS-10 & BLISS-11
compilers decades ago] that I have personally found quite useful when
implementing small (*non*-Lisp-like) languages that need to be embedded
in other software.

The biggest thing I ever personally used the above for was writing
an infix dialect of Scheme [called "Parentheses-Light Scheme", or
shorter, "P'Lite Scheme", or just "plite"]. It was successfully
used by a networking hardware development group at SGI to code
user-mode hardware debugging programs.

+---------------

| Now that being said if I wasn't being so lazy I'd be reading the
| dragon book and answering these questions for myself instead of
| wasting your time with something that is so off-topic.

+---------------

Well, I didn't learn about the above in the Dragon Book myself,
either, but from the BLISS-10 compiler source and the BLISS-11
compiler design book [mentioned above].

And it's not *too* far off-topic, since Lispers *do* occasionally
have to be able to parse non-sexpr code or config file formats, and
the above parsing style is easy to implement in Lisp. So there. ;-}

-Rob