Beginner parsing question

71 views
Skip to first unread message

Dylan Tisdall

unread,
Sep 15, 2015, 3:17:10 PM9/15/15
to Haskell Pipes
Hi,

I'm new to pipes (and pretty new to Haskell), and so have a question that's probably quite simple, but has managed to stump me for the last day. Basically, I'm trying to parse a file that consists of the two "sub-files" (call them first and second), with the length of first prefixed. My goal is to write a parser that reads just the length of first, then skips over first without read it all into memory, and then reads in and prints out the contents of the second file. So far I can read the length, but I'm stumped at how to do the skip, and handle the error case where even reading the length failed. Basically, I'm looking for what should go into "parseRest" in the following code, or any suggestions for how to refactor this to make it more consistent with the design of pipes.

import           Prelude hiding (length, concat, splitAt)
import           Data.ByteString.Lazy.Char8 (pack)
import           Data.ByteString.Lazy (ByteString, length, copy, concat)
import           Data.Binary (Word32)
import           Data.Int (Int32)
import           Data.Binary.Put (runPut, putWord32le)
import           Data.Binary.Get (getWord32le)
import           Pipes
import           Pipes.ByteString (fromLazy, stdout, splitAt)
import qualified Pipes.ByteString as P (ByteString)
import           Pipes.Parse (runStateT, evalStateT, drawAll, Parser)
import           Pipes.Binary (decodeGet, DecodingError, decoded, decode)
import           Control.Lens (view, zoom)

first :: ByteString
first = pack "foo"

second :: ByteString
second = pack "bar"

merge :: ByteString -> ByteString -> ByteString
merge a b = concat [lengthBS a, a, b]
  where
    lengthBS = runPut . putWord32le . fromIntegral . length

split :: Monad m => Producer P.ByteString m r ->
      m (Either DecodingError
         (Producer P.ByteString m (Producer P.ByteString m r)))
split p = do
    (headerLen, p') <- runStateT (decodeGet getWord32le) p
    return $ case headerLen of
        Left err         -> Left err
        Right headerLen' -> Right $  
            view (splitAt  $ fromIntegral $ (headerLen' :: Word32)) p'

decoder :: Parser P.ByteString IO ()
decoder = do
    headerLen <- decode
    lift $ print (headerLen :: Either DecodingError Word32)
    case headerLen of
        Left err -> lift $ print err
        Right hLen -> parseRest

main :: IO ()
main = do
    evalStateT decoder $ fromLazy (merge first second)


Thanks,
Dylan

Gabriel Gonzalez

unread,
Sep 15, 2015, 3:24:31 PM9/15/15
to haskel...@googlegroups.com, dy...@geeky.net
Here's how you can skip N bytes in the context of a `Parser`.  I can add this to `Pipes.ByteString` in some form as a useful utility in the "Parsers" section if it solves your issue:

    import Lens.Micro (zoom)              -- from the `microlens` package
    import Pipes                          -- from the `pipes` package
    import Pipes.ByteString (splitAt)     -- from the `pipes-bytestring` package
    import Pipes.Parse (Parser, skipAll)  -- from the `pipes-parse` package

    skipNBytes :: Monad m => Int -> Parser ByteString m ()
    skipNBytes n = zoom (splitAt n) skipAll
--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to haskell-pipe...@googlegroups.com.
To post to this group, send email to haskel...@googlegroups.com.

Dylan Tisdall

unread,
Sep 15, 2015, 4:03:54 PM9/15/15
to Haskell Pipes, dy...@geeky.net
Hi Gabriel,

Thanks for the quick reply. I had to modify the snippet a little bit so ghci find definitions for all the symbols (just changed the import lines a bit), but now I'm getting an error about illegal types. Below is the code, and the output from ghci. Any suggestions what I've done wrong?

--- code ---
import Prelude hiding (splitAt)
import Control.Lens (zoom)
import Pipes                          -- from the `pipes` package
import Pipes.ByteString (ByteString, splitAt)     -- from the `pipes-bytestring` package
import Pipes.Parse (Parser, skipAll)  -- from the `pipes-parse` package

skipNBytes :: Monad m => Int -> Parser ByteString m ()
skipNBytes n = zoom (splitAt n) skipAll

--- ghci 7.10.2 output --
Illegal polymorphic or qualified type: Parser ByteString m ()
Perhaps you intended to use RankNTypes or Rank2Types
In the type signature for ‘skipNBytes’:
  skipNBytes :: Monad m => Int -> Parser ByteString m ()


Thanks,
Dylan

Gabriel Gonzalez

unread,
Sep 15, 2015, 4:10:54 PM9/15/15
to haskel...@googlegroups.com, dy...@geeky.net
There are two solutions to this.

The first solution is to enable the `{-# LANGUAGE RankNTypes #-}` extension like the compiler suggests.

The reason why is that the `Parser` type synonym is:

    type Parser a m r = forall x . StateT (Producer a m x) m r

... which means that this type:


    Monad m => Int -> Parser ByteString m ()

... expands out to this type:

    Monad m => Int -> (forall x . StateT (Producer a m x) m r)

... which is a higher-rank type (because of the `forall` after the `(->)`).  This is normal in the case of `pipes-parse` and is used to simplify type signatures.

There's a second solution, which is to just use the type inferred by the compiler, which will be a little bit more verbose:

    Monad m => Int -> StateT (Producer a m x) m r

... which is almost the same as the previous type, except now that the `forall` has been implicitly been pulled out of the type like this:

    forall x . Monad m => Int -> StateT (Producer a m x) m r

This type doesn't require the `RankNTypes` language extension, but you can no longer simplify the type using the `Parser` type synonym.

It's up to you which solution you prefer.

Dylan Tisdall

unread,
Sep 15, 2015, 4:14:26 PM9/15/15
to Haskell Pipes, dy...@geeky.net
Great, thanks again!

Gabriel Gonzalez

unread,
Sep 15, 2015, 4:15:09 PM9/15/15
to haskel...@googlegroups.com, dy...@geeky.net
You're welcome!

Michael Thompson

unread,
Sep 15, 2015, 4:54:46 PM9/15/15
to Haskell Pipes, dy...@geeky.net
By the way, to forestall confusion, `zoom` is in `Lens.Micro.Mtl` in order to sort out the dependencies. `Lens.Micro.Mtl`  is also needed for `view`, but not for `(^.)`, because, following `Control.Lens`,  `view` is typed as 

     view :: MonadReader s m => Getting a s a -> m a

 The convenient way to get all the customary combinators for pipes use is thus

     import Lens.Micro.Platform

which just has a couple natural dependencies outside the boot libraries.  I haven't come across any difficulty with that.

Dylan Tisdall

unread,
Sep 15, 2015, 6:02:04 PM9/15/15
to Haskell Pipes, dy...@geeky.net
Thanks for the heads-up Michael; that makes more sense now.

However, I've tried to implement Gabriel's suggestion, and it's not behaving as I'd expect in my target application. I'm wondering if someone can help me figure out why. I've got the relevant code snipped below, with the suggested skipNBytes and then a decoder for my application.

skipNBytes :: Monad m => Int -> P.Parser P.ByteString m ()
skipNBytes n
= zoom (P.splitAt n) P.skipAll


decoder
:: P.Parser P.ByteString IO ()
decoder
= do
    headerLenRes
<- P.decodeGet getWord32le
    lift $
print (headerLenRes :: Either P.DecodingError Word32)
   
case headerLenRes of
       
Left err -> lift $ print err
       
Right hLenRes -> parseRest ((fromIntegral hLenRes) - 4)
 
where
    parseRest hLen
= do
        skipNBytes $ hLen
       
--P.decodeGet $ getByteString hLen
        mdh
<- P.decode
        lift $
print (mdh :: Either P.DecodingError MDH.MDH)


Leaving the details aside (e.g., MDH.MDH implements Binary so I can just call P.decode to parse it), my problem is that

   skipNBytes $hLen

doesn't seem to consume the same number of bytes as
 
  P.decodeGet $ getByteString hLen

If I use the second line, I parse a correct MDH.MDH out of my input stream. If I use the skipNBytes line, I end up in the wrong place in my input stream. I had assumed these two lines were identical, except for the fact that the second buffers all the data before throwing it away, with the first reads it in one byte at a time and throws it away. That's not what I'm seeing, though. Any suggestions for what I'm missing?


Thanks,
Dylan

Gabriel Gonzalez

unread,
Sep 15, 2015, 6:30:17 PM9/15/15
to haskel...@googlegroups.com, dy...@geeky.net
Are you sure you are using the correct `splitAt`?  From your latest code snippet it looks like you changed it to use `Pipes.Parse.splitAt` but you should be using `Pipes.ByteString.splitAt`.

The difference between the two is that:

* `Pipes.Parse.splitAt` will skip `n` elements (where the elements in this case are chunks) so you skip `n` chunks
* `Pipes.ByteString.splitAt` will skip `n` bytes (regardless of what the chunk sizes are)

However, you are right that `decodeGet` buffers the entire decoded bytestring in memory before discarding it at the end whereas `skipNBytes` will run in constant space.  The original rationale for this was that the author of `pipes-binary` wanted `decodeGet` to be able to recover from failures by rewinding the input stream to where it began.
--

Dylan Tisdall

unread,
Sep 15, 2015, 6:55:54 PM9/15/15
to Haskell Pipes, dy...@geeky.net
Thanks, that was it! I was wondering how it knew whether its elements were bytes or chunks, and that answer it.
Reply all
Reply to author
Forward
0 new messages