José Romildo Malaquias
unread,Aug 22, 2012, 10:56:58 AM8/22/12Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Chris Kuklewicz, José Romildo Malaquias, haskel...@haskell.org
On Tue, Aug 21, 2012 at 05:50:44PM -0300, José Romildo Malaquias wrote:
> On Tue, Aug 21, 2012 at 04:05:28PM +0100, Chris Kuklewicz wrote:
> > I do not have time to test this myself right now. But I will unravel my code a
> > bit for you.
> >
> > > By November 2011 it worked without problems in my application. Now that
> > > I have resumed developping the application, I have been faced with this
> > > behaviour. As it used to work before, I believe it is a bug in
> > > regex-pcre or libpcre.
> >
> > I believe it may be problem in String <-> ByteString conversion. The "base"
> > library may have changed and your LOCALE information may be different or may be
> > being used differently by "base".
> >
> > > The (temporary) workaround I found is to convert the strings to
> > > byte-strings before matching, and then convert the results back to
> > > strings. With byte-strings it works well.
> >
> > That is an excellent sign that it is your LOCALE settings being picked up by
> > GHC's "base" package, see explanation below.
[...]
> I have written an application to test those things. There are 2 source
> files: test.hs and seestr.c, which are attached.
>
> The test does the following:
>
> 1. shows the getForeignEncoding
>
> 2. uses a C function to show the characters from a String (using
> withCString) and from a ByteString (using useAsCString)
>
> 3. matches a PCRE regular expression using String and ByteString
>
> The test is run twice, with different LANG settings, and its output
> follows.
[...]
> As can be seen, regular expression matching does not work with
> en_US.UTF-8. But it works with en_US.ISO-8859-1.
>
> The test shows that withCString is working as expected too. This
> may suggest the problem is really with regex-pcre.
The previous tests were run on an gentoo linux with ghc-7.4.1.
I have also run the tests on Fedora 17 with ghc-7.0.4, which does not
have the bug. The sources are attached. The tests output follows:
$ LANG=en_US.ISO-8859-1 && ./test
testing with String
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4
testing with ByteString
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4
regex : pa�s:(.*)
text : pa�s:Brasil
String match : [["pa\237s:Brasil","Brasil"]]
ByteString match : [["pa\237s:Brasil","Brasil"]]
$ LANG=en_US.UTF-8 && ./test
testing with String
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4
testing with ByteString
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4
regex : país:(.*)
text : país:Brasil
String match : [["pa\237s:Brasil","Brasil"]]
ByteString match : [["pa\237s:Brasil","Brasil"]]
Clearly witchCString has changed from ghc-7.0.4 to ghc-7.4.1. It seems
that With ghc-7.0.4 withCString does not obey the UTF-8 locale and
generates a latin1 C string.
Regards,
Romildo