thank you for your quick answer.
While trying to come up with a small example I think I've discovered the
problem. I'm not 100% sure, but at least my smaller data sets seem to
pass. I did not implement the query part yet so maybe there is still
something broken. The problem seem to be line feed and carriage return
control characters in words. My WordReader does not filter them. Is is
expected behavior of a WordReader to not include these characters in
words? I did not find it in the documentation. Nevertheless if it is,
then the same should apply to other unicode line terminators. Though I
would prefer that words were interpreted as arrays of unicode points.
If you're still interested in a test case I'd be happy to provide one.