Streaming Unicode data decoding

43 views
Skip to first unread message

Michael Snoyman

unread,
Feb 12, 2014, 6:49:01 AM2/12/14
to John Millikin, John Lato, Gabriel Gonzalez, what_is_it_t...@yahoo.com, streamin...@googlegroups.com
As some people may know, the current state of Unicode decoding (e.g., UTF-8 or UTF-32BE decoding) is a bit tricky today in a Haskell streaming data framework. The problem arises from the two facts that:

1. The text package, until recently, provided no streaming API. Instead, it only provided a strict API (which assumes all data is in a single strict ByteString) or a lazy API, which would require using lazy I/O. Obviously streaming data fans would prefer to consume the bytes incrementally. (The recent text 1.1 adds a streaming API for UTF8.)
2. text generally treats errors in decoding via exceptions from pure code. In streaming data, we usually want to be able to decode as much of the stream as possible, and then do something intelligent with the remaining bytes.

The original solution (to my knowledge) to this problem came from John Millikin in Data.Enumerator.Text. This code has since been used in both conduit and pipes-text. However, there are some downsides: the code is complicated, it's now been copy-pasted into three different projects, and it has some subtle differences from text's own decode which can lead to bugs[1][2][3].

Michael Thompson raised issue #60 against the text package[4] about this problem. My most recent comment[5] on that issue points to some new code I've just written for conduit, which takes text's own decoding functions, and tweaks them to have the streaming data behavior we'd want. The code currently lives in conduit.

My questions for this list are:

* Are others interested in using this code outside of conduit?
* Are there issues with the API that I've provided? (Note that I've designed this as a low-level API, and have used the technique of using a null ByteString to indicate end-of-stream. It's a hack, but it avoids extra runtime checks and surface area of the API.)
* I've run a battery of tests on this code, but I don't feel 100% comfortable with it yet. If others want to try and break it, I'd appreciate it.

Michael

Michael Snoyman

unread,
Feb 12, 2014, 11:44:27 PM2/12/14
to John Lato, John Millikin, Gabriel Gonzalez, what_is_it_t...@yahoo.com, streamin...@googlegroups.com



On Thu, Feb 13, 2014 at 1:19 AM, John Lato <jwl...@gmail.com> wrote:
On Wed, Feb 12, 2014 at 3:49 AM, Michael Snoyman <mic...@snoyman.com> wrote:
As some people may know, the current state of Unicode decoding (e.g., UTF-8 or UTF-32BE decoding) is a bit tricky today in a Haskell streaming data framework. The problem arises from the two facts that:

1. The text package, until recently, provided no streaming API. Instead, it only provided a strict API (which assumes all data is in a single strict ByteString) or a lazy API, which would require using lazy I/O. Obviously streaming data fans would prefer to consume the bytes incrementally. (The recent text 1.1 adds a streaming API for UTF8.)
2. text generally treats errors in decoding via exceptions from pure code. In streaming data, we usually want to be able to decode as much of the stream as possible, and then do something intelligent with the remaining bytes.

The original solution (to my knowledge) to this problem came from John Millikin in Data.Enumerator.Text. This code has since been used in both conduit and pipes-text. However, there are some downsides: the code is complicated, it's now been copy-pasted into three different projects, and it has some subtle differences from text's own decode which can lead to bugs[1][2][3].

Michael Thompson raised issue #60 against the text package[4] about this problem. My most recent comment[5] on that issue points to some new code I've just written for conduit, which takes text's own decoding functions, and tweaks them to have the streaming data behavior we'd want. The code currently lives in conduit.

My questions for this list are:

* Are others interested in using this code outside of conduit?

In principle yes; I've been quite disappointed with the current text api WRT streaming.  Although I don't deal with much textual data, so it hasn't really been a big issue for me and in all likelyhood I won't use it for some time even if you do break it out.
 
* Are there issues with the API that I've provided? (Note that I've designed this as a low-level API, and have used the technique of using a null ByteString to indicate end-of-stream. It's a hack, but it avoids extra runtime checks and surface area of the API.)

It seems adequate, although it might be nice if DecodeFailure reported a bit more information about the failure.  Although perhaps it provides enough to add that on top.
 

What I've done in conduit is keep track of the total number of bytes I've passed into the decoding function. Then combined with the leftover bytes returned from the decoder, I can create an error message with the byte offset into the stream where the error occurred, and the next few bytes in the stream. Are there other things we'd want to have in the error message?
 
* I've run a battery of tests on this code, but I don't feel 100% comfortable with it yet. If others want to try and break it, I'd appreciate it.

Not being a big text user, I'm probably not well-positioned to help much with this.

Cheers,
John L. 

Michael Snoyman

unread,
Feb 13, 2014, 2:19:53 AM2/13/14
to John Millikin, John Lato, Gabriel Gonzalez, what_is_it_t...@yahoo.com, streamin...@googlegroups.com
I've ported the code out of conduit and put it into its own repo[1]. I've put together a simple character counting benchmark comparing text's lazy API and this streaming API, and have put the results in a Gist[2]. There are two versions of the UTF8 benchmark: the standard one uses the same C library implementation of UTF8 decoding as text itself does and, as a result, the runtimes are almost identical (115.35 vs 115.25). For the other benchmarks, lazy text wins every time, with the slowdowns being:

UTF8 pure: 2.56
UTF16: 1.23
UTF32: 1.77

In practice, there's no reason to use the UTF8-pure implementation, I just decided to leave it in since it was already written and tested.

I'm happy with the UTF16 and UTF32 numbers as they stand, given that they're relatively close to the text numbers and those encodings aren't used nearly as much as UTF8. If anyone wants to play around with optimizing them further, I'd be happy to accept pull requests.

I think I'm going to go ahead and release this to Hackage. I still think this kind of API would be better placed in text, but even if it's included upstream, there's an advantage of having this in a separate package, since users constrained to existing versions of the text package by the Haskell Platform could use text-stream-decode as a compatibility shim.

Gabriel Gonzalez

unread,
Feb 13, 2014, 5:54:33 AM2/13/14
to Michael Snoyman, John Millikin, John Lato, what_is_it_t...@yahoo.com, streamin...@googlegroups.com
Sounds good to me.  `pipes-text` will probably use this (I have to ask Michael Thompson, since he maintains it, but he'd probably agree).

Michael Snoyman

unread,
Feb 13, 2014, 7:18:11 AM2/13/14
to streamin...@googlegroups.com, John Millikin, John Lato, what_is_it_t...@yahoo.com
Cool. By the way, it was very nice to be able to compare notes to the pipes-text implementation after I got the library polished.

The library is now officially available on Hackage[1].



--
You received this message because you are subscribed to the Google Groups "streaming-haskell" group.
To unsubscribe from this group and stop receiving emails from it, send an email to streaming-hask...@googlegroups.com.
To post to this group, send email to streamin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/streaming-haskell/52FCA469.2030601%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael Thompson

unread,
Feb 13, 2014, 4:52:26 PM2/13/14
to streamin...@googlegroups.com, John Millikin, John Lato, what_is_it_t...@yahoo.com, mic...@snoyman.com
This seems like a really great idea.

I commented on the corresponding `text` issue 
with a little quickcheck test for utf8.

I will try to rework pipes-text to use it this 
weekend.  One thing I already noticed is 
that, as Michael suggests, it suffers for not 
using bos's fancy FFI functions. I revised a
benchmark of Michael's 
and the pipe definition that uses the
new streamUtf8 is only half as fast.
Maybe one could find a way to use
bos's hidden c nonsense, or something
like it? Anyway I would much rather 
such a question were put to an 
independent library like this one, 
than to something like pipes-text.


        benchmarking lazy text
        mean: 4.928981 ms, lb 4.903218 ms, ub 5.024673 ms, ci 0.950
        std dev: 225.9253 us, lb 55.82041 us, ub 522.7356 us, ci 0.950


        benchmarking new conduit
        mean: 10.72952 ms, lb 10.55484 ms, ub 10.97152 ms, ci 0.950
        std dev: 1.043854 ms, lb 815.5910 us, ub 1.342237 ms, ci 0.950


        benchmarking stream
        mean: 10.74237 ms, lb 10.60407 ms, ub 10.93885 ms, ci 0.950
        std dev: 834.0893 us, lb 627.3384 us, ub 1.150477 ms, ci 0.950


        benchmarking pipes-text-old -- with FFI'd function
        mean: 5.140928 ms, lb 5.116550 ms, ub 5.198108 ms, ci 0.950
        std dev: 181.4065 us, lb 94.51407 us, ub 361.4704 us, ci 0.950


        benchmarking pipes-text-new
        mean: 11.06904 ms, lb 10.92610 ms, ub 11.28362 ms, ci 0.950
        std dev: 886.1462 us, lb 647.3568 us, ub 1.243963 ms, ci 0.950

Michael Snoyman

unread,
Feb 13, 2014, 10:33:44 PM2/13/14
to streamin...@googlegroups.com, John Millikin, John Lato, what_is_it_t...@yahoo.com
That's very strange, can you share your benchmark? The code that's on Hackage *does* use the underlying C library, and in my benchmarking, the performance is almost identical to text's decoding.


--
You received this message because you are subscribed to the Google Groups "streaming-haskell" group.
To unsubscribe from this group and stop receiving emails from it, send an email to streaming-hask...@googlegroups.com.
To post to this group, send email to streamin...@googlegroups.com.

Michael Thompson

unread,
Feb 13, 2014, 10:49:11 PM2/13/14
to streamin...@googlegroups.com, John Millikin, John Lato, what_is_it_t...@yahoo.com, mic...@snoyman.com
Sorry Michael, I was just writing to correct this. 
I am now using correct versions and everything 
is basically equivalent (for utf8), which is really pleasing. 

    benchmarking lazy text
    mean: 5.023918 ms, lb 4.982233 ms, ub 5.123620 ms, ci 0.950

    benchmarking new conduit
    mean: 5.581982 ms, lb 5.510454 ms, ub 5.722027 ms, ci 0.950

    benchmarking stream
    mean: 5.292396 ms, lb 5.264294 ms, ub 5.325694 ms, ci 0.950

    benchmarking pipes
    mean: 5.122821 ms, lb 5.088608 ms, ub 5.206244 ms, ci 0.950

    benchmarking pipesnew
    mean: 5.383861 ms, lb 5.353982 ms, ub 5.418508 ms, ci 0.950

Michael Snoyman

unread,
Feb 13, 2014, 10:53:04 PM2/13/14
to streamin...@googlegroups.com, John Millikin, John Lato, what_is_it_t...@yahoo.com
Cool, thanks for the update. What does this benchmark check? The one I've been using when testing the library was character counting, but I'd like to increase the benchmark coverage. I played around with different combinations of INLINE pragmas, and what we have now seemed to be the best, but have a second benchmark to compare against would be beneficial.


--
You received this message because you are subscribed to the Google Groups "streaming-haskell" group.
To unsubscribe from this group and stop receiving emails from it, send an email to streaming-hask...@googlegroups.com.
To post to this group, send email to streamin...@googlegroups.com.

Michael Thompson

unread,
Feb 13, 2014, 11:03:31 PM2/13/14
to streamin...@googlegroups.com, John Millikin, John Lato, what_is_it_t...@yahoo.com, mic...@snoyman.com
The benchmark was just a replica of one you 
had yesterday https://gist.github.com/michaelt/8995507

I had installed a pre-hackage and I guess pre-c version 
in ~/.cabal yesterday; by a wrong incantation I
compiled against it, and not the sandbox that includes
the github repo, which I hadn't yet started studying.

Reply all
Reply to author
Forward
0 new messages