[Haskell-cafe] Lazy Parsing

GüŸnther Schmidt

unread,

May 26, 2009, 7:52:38 PM5/26/09

to haskel...@haskell.org

Hi all,

is it possible to do lazy parsing with Parsec? I understand that one can
do that with polyparse, don't know about uulib, but I happen to be
already somewhat familiar with Parsec, so before I do switch to
polyparse I rather make sure I actually have to.

The files it has to parse is anywhere from 500 MB to 5 GB.

Günther

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Henning Thielemann

unread,

May 27, 2009, 11:23:02 AM5/27/09

to GüŸnther Schmidt, haskel...@haskell.org

GüŸnther Schmidt schrieb:

> Hi all,
>
> is it possible to do lazy parsing with Parsec? I understand that one can
> do that with polyparse, don't know about uulib, but I happen to be
> already somewhat familiar with Parsec, so before I do switch to
> polyparse I rather make sure I actually have to.
>
> The files it has to parse is anywhere from 500 MB to 5 GB.

I don't think that it is in general possible to use the same parser for
lazy and strict parsing, just because of the handling of parser failure.
If parser failure is denoted by a Left constructor in (Either Reason
Result) then the whole parsing process must be finished, before the
parser knows whether the answer is Left or Right.
I also used polyparse for lazy parsing, but I found it unintuitive how
to make a parser lazy. I tried to do better in tagchup, where I make
explicit in the type, whether a parser can fail or not. In the first
case in cannot be lazy, in the second case it can. I also did lazy
parsing in 'midi' package and in 'spreadsheet'. I liked to factor out a
lazy parser library from them, but I failed to unify all these
applications. At least I have factored out handling of lazy failure (aka
asnychronous exceptions) in explicit-exception package.

Btw. a good place to discuss such issues is our local Haskell meeting
that takes place on 2009-06-12:
http://iba-cg.de/hal4.html

Henning Thielemann

unread,

May 27, 2009, 5:23:31 PM5/27/09

to Gü nther Schmidt, haskel...@haskell.org

On Wed, 27 May 2009, Gü?nther Schmidt wrote:

> is it possible to do lazy parsing with Parsec? I understand that one can do
> that with polyparse, don't know about uulib, but I happen to be already
> somewhat familiar with Parsec, so before I do switch to polyparse I rather
> make sure I actually have to.

Also see
http://www.haskell.org/haskellwiki/Maintaining_laziness

Malcolm Wallace

unread,

May 28, 2009, 5:44:50 AM5/28/09

to haskel...@haskell.org

Henning Thielemann <schle...@henning-thielemann.de> wrote:

> I don't think that it is in general possible to use the same parser
> for lazy and strict parsing, just because of the handling of parser
> failure.

Polyparse demonstrates that you can mix-and-match lazy parsers with
strict parsers in the different parts of a grammar (by choosing whether
to use applicative or monadic style). You can also switch between lazy
or strict interpretations of the applicative parts of your grammar (by
changing the import that decides which version of the parser primitives
is in scope).

> I also used polyparse for lazy parsing, but I found it unintuitive how
> to make a parser lazy.

It can certainly be tricky, and requires a certain amount of
experimentation. I think the difficulties are mainly due to the mix of
lazy (applicative) and strict (monadic) styles in different
non-terminals. A parser that you intend to be lazy, may turn out to be
stricter than you hope, because of the strictness of another parser that
it depends upon.

Regards,
Malcolm

Penz

unread,

May 28, 2009, 4:08:38 PM5/28/09

to

There is also lazyMany, that I think already appeared on the list some
times, hope it helps:

lazyMany :: Parser () -> Parser a -> SourceName -> String -> [a]
lazyMany p0 p filename contents = lm
state0
where
localp0 = p0 >> getParserState
state0 = case parse localp0 filename contents of -- get an
initial state
Right s -> s
Left e -> error $ show e
lm state = either (error . show) id (parse p' "" "")
where
p' = setParserState state >> choice [eof >> return [],
do
x <- p
state' <- getParserState
return (x:lm state')]

Leandro Penz

S. Doaitse Swierstra

unread,

May 29, 2009, 4:20:26 AM5/29/09

to GüŸnther Schmidt, haskel...@haskell.org

Lazy parsing has been the default for the last ten years in uulib, and
is now available in the simple uu-parsinglib (http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uu-parsinglib
). The whole design of the latter in described in a technical report
to which references are given on the web page. It provides also error
correction, the ability to use several different kinds of input
tokens, and (with some help) ambiguities. If speed is an issue you can
insert extra hints which locally change the breadth-first parsing
process locally into a somewhat more depth-first form. When compared
with Parsec the good news is that usually you do not have to put
annotations to get nice results.

The older uulib version also performs an abstract interpretation which
basically changes the search for which alternative to take from a
linear to a logarithmic complexity, but does not provide a monadic
structure, in which you use results recognised thus far to construct
new parsers.

Both the old uulib version and the new version have always had an
applicative interface.

In the near future elements of the abstract interpretation of the old
uulib version will migrate into the new version. It is the advent of
GADT's which made this new version feasable.

An example of the error correction at work at the following example
code:

pa, pb, paz :: P_m (Str Char) [Char]
pa = lift <$> pSym 'a'
pb = lift <$> pSym 'b'
p <++> q = (++) <$> p <*> q
pa2 = pa <++> pa
pa3 = pa <++> pa2

pCount p = (\ a b -> b+1) <$> p <*> pCount p <<|> pReturn 0
pExact 0 p = pReturn []
pExact n p = (:) <$> p <*> pExact (n-1) p

paz = pMany (pSym ('a', 'z'))

paz' = pSym (\t -> 'a' <= t && t <= 'z', "a .. z", 'k')

main :: IO ()
main = do print (test pa "a")
print (test pa "b")
print (test pa2 "bbab")
print (test pa "ba")
print (test pa "aa")
print (test (do l <- pCount pa
pExact l pb) "aaacabbb")
print (test (amb ( (++) <$> pa2 <*> pa3 <|> (++) <$> pa3
<*> pa2)) "aaabaa")
print (test paz "ab1z7")
print (test paz' "m")
print (test paz' "")

is

loeki:~ doaitse$ ghci -package uu-parsinglib
GHCi, version 6.10.1: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer ... linking ... done.
Loading package base ... linking ... done.
Loading package syb ... linking ... done.
Loading package array-0.2.0.0 ... linking ... done.
Loading package filepath-1.1.0.1 ... linking ... done.
Loading package old-locale-1.0.0.1 ... linking ... done.
Loading package old-time-1.0.0.1 ... linking ... done.
Loading package unix-2.3.1.0 ... linking ... done.
Loading package directory-1.0.0.2 ... linking ... done.
Loading package process-1.0.1.1 ... linking ... done.
Loading package random-1.0.0.1 ... linking ... done.
Loading package haskell98 ... linking ... done.
Loading package uu-parsinglib-2.0.0 ... linking ... done.
Prelude> :m Text.ParserCombinators.UU.Examples
Prelude Text.ParserCombinators.UU.Examples> main
("a",[])
("a",[
Deleted 'b' at position 0 expecting one of ["'a'"],
Inserted 'a' at position 1 expecting one of ["'a'"]])
("aa",[
Deleted 'b' at position 0 expecting one of ["'a'"],
Deleted 'b' at position 1 expecting one of ["'a'"],
Deleted 'b' at position 3 expecting one of ["'a'"],
Inserted 'a' at position 4 expecting one of ["'a'"]])
("a",[
Deleted 'b' at position 0 expecting one of ["'a'"]])
("a",[
The token 'a'was not consumed by the parsing process.])
(["b","b","b","b"],[
Deleted 'c' at position 3 expecting one of ["'a'","'b'"],
Inserted 'b' at position 8 expecting one of ["'b'"]])
(["aaaaa"],[
Deleted 'b' at position 3 expecting one of ["'a'","'a'"]])
("abz",[
Deleted '1' at position 2 expecting one of ["'a'..'z'"],
The token '7'was not consumed by the parsing process.])
('m',[])
('k',[
Inserted 'k' at position 0 expecting one of ["a .. z"]])
Prelude Text.ParserCombinators.UU.Examples>

Doaitse Swierstra

On 27 mei 2009, at 01:52, G�nther Schmidt wrote:

> Hi all,
>
> is it possible to do lazy parsing with Parsec? I understand that one
> can do that with polyparse, don't know about uulib, but I happen to
> be already somewhat familiar with Parsec, so before I do switch to
> polyparse I rather make sure I actually have to.
>
> The files it has to parse is anywhere from 500 MB to 5 GB.
>
>

> G�nther

S. Doaitse Swierstra

unread,

May 29, 2009, 4:49:41 AM5/29/09

to Haskell Cafe

In the uu-parsinglib we actually have two versions of parsers: lazy
ones and strict ones, which have different types. So by giving a type
annotation you can select the one you want. Notice that in the left-
hand side of a monadic construct it does not make sense to use a lazy
parser, since its result will be used as a parameter to the right-hand
side operator, so in case of a monad our library system automagically
selects the strict version for the left hand side. For the right hand
side it depends on the type of the overall expression. Unfortunately
in Haskell both the left and right hand side of a bind need the to be
elements of the same monad, whereas in the case of a lazy oevrall
parser this is not the case. We solve this problem by tupling the two
parsers (NOT the parsing results), so still the do-notation can be used.

The use of the library is free of any trickery!

Doaitse Swierstra

Guenther Schmidt

unread,

May 29, 2009, 7:39:23 AM5/29/09

to haskel...@haskell.org

Dear Doaitse,

In the days since my original post I had already come to favor the
uu-parsing package. I have printed the report and read it every day to
figure out how to use it. I cannot follow everything yet, and also hope
that won't be necessary in order to use it. :-)

My progress is a bit slow, but I'm not giving up. What I do like most,
over the other combinatory packages, is the approach of using
"breadth-first" when it comes to choice, the idea is certainly
enlightening. The packages capability to do "online- / partial parsing"
is essential for me.

I am a bit surprised about it's "raw" state. The basic combinators and
primitives are there but combinators like pChain, pDigit etc. are not
predefined and merely present in the examples package.

I had gotten quite comfortable with parsec and need to find the right
way to "translate" my parsec code to your package.

Anyway let me thank you for your work, I really appreciate it very much.

Gï¿½nther

S. Doaitse Swierstra schrieb:

>> Gï¿½nther

GüŸnther Schmidt

unread,

May 30, 2009, 6:42:52 PM5/30/09

to haskel...@haskell.org

Dear Doaitse,

It is my pleasure to announce that after 5 days of experimenting with
uu-parsinglib I have absolutely no clue, whatsoever, on how to use it.

Period.

I do not even manage to write a parser for even a mere digit or a simple
character. I have read the tutorial from a to a to z and from z to a and
there were a few words I recognized.

I mean I'd like to be able to turn "12.05.2009" into something like (12,
5, 2009) and got no clue what the code would have to look like. I do
know almost every variation what the code must not look like :).

I am guessing here that when one does define a parsing function, since
all the parser combinators aren't function but methods, one *must* also
provide a type signature so that the compiler knows the actual
*instance* method?

Günther

Stephen Tetley

unread,

May 31, 2009, 6:55:37 AM5/31/09

to GüŸnther Schmidt, haskel...@haskell.org

Hi G�nther

The code below should work for your simple example, provided it hasn't
lost formatting when I pasted it in to the email.

I was a bit surprised that there is no pSatisfy in this library, but
there are parsers for digits, lower case, upper case letters etc. in
the Examples module that would otherwise be achieved with pSatisfy.

Best wishes

Stephen

{-# LANGUAGE FlexibleContexts #-}

module Demo1 where

import Text.ParserCombinators.UU.Examples
import Text.ParserCombinators.UU.Parsing

-- here's a simple character '@' parser
pAtSym :: Symbol p Char Char => p Char
pAtSym = pSym '@'

test_simple_char = test pAtSym "@"
test_simple_char2 = test pAtSym "@@@@@"

-- pDigit is supplied in Text.ParserCombinators.UU.Examples
test_any_digit = test pDigit "6"

-- pNatural is supplied in Text.ParserCombinators.UU.Examples
-- It looks like the most likely candidate to parse a
-- sequence of digits...

test_natural = test pNatural "1234"

-- ... and it is!

-- parse a date "12.05.2009" as a triple (Int,Int,Int)
pDateTriple :: (Symbol p (Char,Char) Char, Applicative p, ExtApplicative p st,
Provides st Char Char)
=> p (Int,Int,Int)
pDateTriple = (,,) <$> pNatural <* pDot <*> pNatural <* pDot <*> pNatural

pDot :: (Symbol p Char Char, Applicative p) => p [Char]
pDot = lift <$> pSym '.'

test_date = test pDateTriple "12.05.2009"

Malcolm Wallace

unread,

May 31, 2009, 7:42:08 AM5/31/09

to Haskell Cafe mailing list

> It is my pleasure to announce that after 5 days of experimenting
> with uu-parsinglib I have absolutely no clue, whatsoever, on how to
> use it.
>

> I do not even manage to write a parser for even a mere digit or a
> simple character.

I don't know whether you will be willing to change over to polyparse
library, but here are some hints about how you might use it.

Given that you want the input to be a simple character stream, rather
than use a more elaborate lexer, the first thing to do is to
specialise the parser type for your purposes:

> type TextParser a = Parser Char a

Now, to recognise a "mere digit",

> digit :: TextParser Char
> digit = satisfy Char.isDigit

and for a sequence of digits forming an unsigned integer:

> integer :: TextParser Integer
> integer = do ds <- many1 digit
> return (foldl1 (\n d-> n*10+d)
> (map (fromIntegral.digitToInt) ds))
> `adjustErr` (++("expected one or more digits"))

> I mean I'd like to be able to turn "12.05.2009" into something like
> (12, 5, 2009) and got no clue what the code would have to look like.
> I do know almost every variation what the code must not look like :).

> date = do a <- integer
> satisfy (=='.')
> b <- integer
> satisfy (=='.')
> c <- integer
> return (a,b,c)

Of course, that is just the standard (strict) monadic interface used
by many combinator libraries. Your original desire was for lazy
parsing, and to achieve that, you must move over to the applicative
interface. The key difference is that you cannot name intermediate
values, but must construct larger values directly from smaller ones by
something like function application.

> lazydate = return (,,) `apply` integer `discard` dot
> `apply` integer `discard` dot
> `apply` integer
> where dot = satisfy (=='.')

The (,,) is the constructor function for triples. The `discard`
combinator ensures that its second argument parses OK, but throws away
its result, keeping only the result of its first argument.

Apart from lazy space behaviour, the main observable difference
between "date" and "lazydate" is when errors are reported on incorrect
input. For instance:

> fst $ runParser date "12.05..2009"
*** Exception: In a sequence:
Parse.satisfy: failed
expected one or more digits

> fst $ runParser lazydate "12.05..2009"
(12,5,*** Exception: In a sequence:
Parse.satisfy: failed
expected one or more digits

Notice how the lazy parser managed to build the first two elements of
the triple, whilst the strict parser gave no value at all.

I know that the error messages shown here are not entirely
satisfactory, but they can be improved significantly just by making
greater use of the `adjustErr` combinator in lots more places (it is
rather like Parsec's <?>). Errors containing positional information
about the input can be constructed by introducing a separate lexical
tokenizer, which is also not difficult.

Regards,
Malcolm

Guenther Schmidt

unread,

May 31, 2009, 7:45:35 AM5/31/09

to S. Doaitse Swierstra, Haskell Cafe

Dear Doaitse,

thank you very much for your help.

>
> I am curious to know what made you go wrong with the tutorial, and
> caused that you could not find the solution below?
>
Well let's first agree that I'm not very bright. I hate to admit it, but
it's a simple fact ;-).

Second let's agree that the uu-parsinglib is a *very* sophisticated
beast, I have not seen anything else like it out there, my sincere
congratulations for it. Thirdly the tutorial is also a very
sophisticated beast, and forthly, well just see point 1 :-).

And I just figured out why I was unable to write even that simple parser.

The code you sent me works just fine, I copied and pasted it, no problems.

But, as soon as I comment out the "main" function the type checker
complains, because now the ghci's type checker can no longer infer the
types of pDate or pDot. And this is exactly what happened. I kept
getting error messages from ghci, eventhough I had defined my parsers
possible correctly, but, *minus* the type signatures *and* minus any
main function that called it.

In hindsight I realize that this is a trap I have walked into many times
before, I guess I still have not acquired a Haskellers intuition.

I promise to do better next time :)

G�nther

Gü?nther Schmidt

unread,

May 31, 2009, 8:38:41 AM5/31/09

to haskel...@haskell.org

Dear Malcom,

thanks for helping.

I had actually come to Haskell originally because of a parsing problem.
I had been using Smalltalk until I started a project which required
parsing files. Until then I had not done any RW parsing.

Well the route was more a Parsec -> Haskell, wtf is Haskell? Anyway
eventually I dropped Smalltalk and got addicted to Haskell. And managed
familiarize myself with Haskell and Parsec, the latter as it turned
out I didn't even need to solve my original problem.

Anyway polyparse certainly is an option, but there are a few things that
despite my "list of failures" to use it give uu-parsinglib a special
appeal, the breadth-first approach with choice, I find that terrible
elegant. Due to some kicks in my behind it seems that I might be able to
use Doaitse's combinators now, some more details on that are in another
post.

Gï¿½nther

Malcolm Wallace schrieb:

Stephen Tetley

unread,

May 31, 2009, 8:42:06 AM5/31/09

to Guenther Schmidt, Haskel...@haskell.org

Hi G�nther

I suspect the problem you were having is that there are various
'parsers' (more correctly 'parser types') defined in
Text.ParserCombinators.UU.Parsing and the code you had in your running
example didn't always have enough information to allow GHC to pick a
particular one.

The /test/ function in Examples demands the parser to be of type 'P_m
state a' [1], so if you were running your parsers with /test/ in your
main function this would be give the parsers a concrete, inferable
type (if you hadn't given then a type signature). Once you comment out
main, the parsers have a more general type than 'P_m state a', which
can't be inferred due to class constraints.

Maybe the 'haskeller's inituition' in this instance is to define the
type signatures and the functions at the same time, admittedly this
can be difficult for functions with heavy use of type classes.

Best wishes

Stephen

[1] I'm afraid I don't know the intricacies of the particular types in
the new UU parsing library, until this morning I'd only used the
previous version in uulib.

2009/5/31 Guenther Schmidt <gue.s...@web.de>:

S.Doaitse Swierstra

unread,

May 31, 2009, 4:06:24 PM5/31/09

to Haskell Cafe, Guenther Schmidt

Dear Gunther,

I am providing my solution, on which one can of course specialise in
making sure that a valid date is parsed, which would be a bit more
cumbersome; how should e.g. error correction be done. I prefer to test
afterwards in such situations.

Best,
Doaitse

module Guenther where
import Text.ParserCombinators.UU.Parsing
import Text.ParserCombinators.UU.BasicInstances
import Text.ParserCombinators.UU.Examples hiding (main)
import Control.Applicative hiding ((<*), (*>), (<$))

{- The first decision we have to make is what kind of input we are
providing. The simplest case is just to assume simple characters,
hence for our input type we will use the standard provided stream of
Characters: Str Char, so we use the type of our parsers to be the type
used in the Examples module; since we do not know whether we wil be
using the parsers in a monadic mode too we stay on the safe side ans
use the type P_m -}

type GP a = P_m (Str Char) a -- GP stands for GuenterParser

{- Once we know that our input contains characters, but that in our
output we what to have integer values, we start out by building a
parser for a single integer , for which we use the function pNatural
form the examples-}

pDate = (,,) <$> pNatural <* pDot <*> pNatural <* pDot <*>
(pNatural ::GP Int)
pDot = pSym '.'
{-
main = do print (test pDate "3.4.1900")
print (test pDate "3 4 1900")
print (test pDate "..1900")-}

-- end of Module Guenther

By playing with insertion and deletion costs (e.g. by building a more
picky pNatural) one can control the error recovery. Another option to
get better error recovery would be to define a specialised instance of
Provides which removes spaces. You might even temporarily pSwitch to
the use of this state

>
> Period.
>
> I do not even manage to write a parser for even a mere digit or a
> simple character. I have read the tutorial from a to a to z and from
> z to a and there were a few words I recognized.
>
> I mean I'd like to be able to turn "12.05.2009" into something like
> (12, 5, 2009) and got no clue what the code would have to look like.
> I do know almost every variation what the code must not look like :).
>
> I am guessing here that when one does define a parsing function,
> since all the parser combinators aren't function but methods, one
> *must* also provide a type signature so that the compiler knows the
> actual *instance* method?
>
>

> G�nther