Help needed for a new release of text-hyphen

2 views
Skip to first unread message

Austin Ziegler

unread,
Jul 15, 2011, 12:45:39 AM7/15/11
to ruby-talk ML
I've had folks asking me for a release of text-hyphen that works with
Ruby 1.9, and while I've got something that passes the tests that I've
created and added for MRI 1.9, it *loses* compatibility with Ruby
1.8.7 (and does so loudly in the tests) and JRuby (in either 1.8 or
1.9 mode, it appears). I need some help to get the last bits ready,
because I'm not ready to drop Ruby 1.8 entirely (at least one more
version).

- You can find the source on GitHub: https://github.com/halostatue/text-hyphen/
- You will need hoe as a development dependency to assist with this if
you want to use the Rakefile; otherwise, you can run the test files in
test/ directly.
- Only one of the tests fails, but there's a good chance that new
tests along the same lines would probably fail.

I have tested against most Ruby environments, and it only succeeds
against MRI 1.9.2; even JRuby in 1.9 mode fails in the same way is
JRuby 1.8.

This issue is preventing the release of the next release of
text-hyphen, and if you have some help you can provide, I need it as I
don't have time to investigate and fix it myself (I've got another
project that's taking all of my time).

After this release, this project will probably be put into maintenance
mode (the hyphenation files, aside from an update to UTF-8 encoding
where they weren't already such, have not been updated since the
original release) and I will look at implementing a new version that
works only under Ruby 1.9 (probably under a new name) that will use
the same basic engine but can read .tex hyphenation files from the
texhyphen project rather than depending on the hand-converted
hyphenation files I have, which will also simplify the licensing of
this successor project.

-a
[1] No, I won't remove it as it helps with release management.
--
Austin Ziegler • halos...@gmail.comaus...@halostatue.ca
http://www.halostatue.ca/http://twitter.com/halostatue

Michael Edgar

unread,
Jul 15, 2011, 1:46:13 AM7/15/11
to ruby-talk ML
On Jul 15, 2011, at 12:45 AM, Austin Ziegler wrote:

> I've had folks asking me for a release of text-hyphen that works with
> Ruby 1.9, and while I've got something that passes the tests that I've
> created and added for MRI 1.9, it *loses* compatibility with Ruby
> 1.8.7 (and does so loudly in the tests) and JRuby (in either 1.8 or
> 1.9 mode, it appears). I need some help to get the last bits ready,
> because I'm not ready to drop Ruby 1.8 entirely (at least one more
> version).


Hi Austin,

Running with the debugger on for 1.8.7 brings up this discrepancy:

The "letters" array for 1.8.7 is this:
["d", "a", "m", "p", "f", "s", "c", "h", "i", "f", "f", "f", "a", "h", "r", "t", "s", "k", "a", "p", "i", "t", "\303", "\244", "n", "s", "m", "\303", "\274", "t", "z", "e", "n", "h", "a", "l", "t", "e", "r", "h", "e", "r", "s", "t", "e", "l", "l", "e", "r"]

Now, "\303", "\244" is a UTF-8 encoding of umlauts-over-a (ä). In your 1.8 german
hyphenation file, you encode the ä in itä with the latin-1 encoding \344.

Your input text is UTF-8, but the library searches for the latin1 encoding. Changing
the input to \344 for ä and \374 for ü made the test pass for me on 1.8.7.

Michael Edgar
ad...@carboni.ca
http://carboni.ca/

Charles Oliver Nutter

unread,
Jul 15, 2011, 4:38:04 AM7/15/11
to ruby-talk ML
Is this the error I should see for JRuby?

https://gist.github.com/1084324

If so...yes, it could be something simple, but there's obviously a bug
here. Perhaps I could bother you to formally file a bug at
http://bugs.jruby.org, so we can track it off-list?

- Charlie

Michael Edgar

unread,
Jul 15, 2011, 5:38:26 AM7/15/11
to ruby-talk ML
On Jul 15, 2011, at 4:38 AM, Charles Oliver Nutter wrote:

> Is this the error I should see for JRuby?
>
> https://gist.github.com/1084324
>
> If so...yes, it could be something simple, but there's obviously a bug
> here. Perhaps I could bother you to formally file a bug at
> http://bugs.jruby.org, so we can track it off-list?
>
> - Charlie

That's the same error I saw, and fixed by using a latin1 input case
instead of a ut8 one.

Kaspar Schiess

unread,
Jul 15, 2011, 8:18:18 AM7/15/11
to ruby-talk ML
> Running with the debugger on for 1.8.7 brings up this discrepancy:
>
> The "letters" array for 1.8.7 is this:
> ["d", "a", "m", "p", "f", "s", "c", "h", "i", "f", "f", "f", "a", "h", "r", "t", "s", "k", "a", "p", "i", "t", "\303", "\244", "n", "s", "m", "\303", "\274", "t", "z", "e", "n", "h", "a", "l", "t", "e", "r", "h", "e", "r", "s", "t", "e", "l", "l", "e", "r"]
>
> Now, "\303", "\244" is a UTF-8 encoding of umlauts-over-a (�). In your 1.8 german
> hyphenation file, you encode the � in it� with the latin-1 encoding \344.

>
> Your input text is UTF-8, but the library searches for the latin1 encoding. Changing
> the input to \344 for � and \374 for � made the test pass for me on 1.8.7.

I second that analysis. It seems to use text-hyphen in Ruby 1.8 with
other languages than english (with any languages that use exotic
characters not in ASCII), you will have to make sure that your input is
in the same character encoding as the language file is. In the case of
german, this is LATIN1. So opening and changing the file in your text
editor has probably converted the file to utf8, Austin.

Fixing the 1.8 version in the general case (any input, any language file
encoding) will be hard... and useless, since you would program towards a
use case that should go extinct.

More than one solution offers itself ;)

a) convert the file test_bugs.rb back to latin1 (-> bad, will break soon
again)

b) digging back through the old version history (I am sure you have it
;)) - trying to see if [1] was specifically about german umlauts or if
it was just the german and the size of the word that tripped the bug. If
it was one of the latter - then remove those damn umlauts from the word
(� -> ae, � -> ue) and use the new test expectations that derive from
that. This would make the file ASCII again, and less sensible to editor
conversion.

c) The solution you say you don't want: Dropping 1.8 support from newer
gems. Since bundler & rvm this is increasingly simple to manage - I'll
just limit my old projects to use an old version of text-hyphen.

Considering the impossible (aka: very laborious and quite not to the
point) nature of the bug in 1.8, I would choose c) or (if must be) b).

best regards,
kaspar

[1]
http://rubyforge.org/tracker/index.php?func=detail&aid=9807&group_id=294&atid=1195


Austin Ziegler

unread,
Jul 15, 2011, 8:55:50 AM7/15/11
to ruby-talk ML
On Fri, Jul 15, 2011 at 4:38 AM, Charles Oliver Nutter
<hea...@headius.com> wrote:
> Is this the error I should see for JRuby?
>
> https://gist.github.com/1084324
>
> If so...yes, it could be something simple, but there's obviously a bug
> here. Perhaps I could bother you to formally file a bug at
> http://bugs.jruby.org, so we can track it off-list?

Yes. But does jruby fake out mvm in this case? Because while Rake is
being run with 1.9, I'm not sure that the tests are:

~/projects/text-hyphen $ jruby --1.9 -S rake test
rake/rdoctask is deprecated. Use rdoc/task instead (in RDoc 2.4.2+)
Couldn't read /Users/headius/.rubyforge/user-config.yml. Run `rubyforge setup`.
/Users/headius/projects/jruby/bin/jruby -w -Ilib:bin:test:. -e
'require "rubygems"; require "test/unit"; require "test/test_bugs.rb";
require "test/test_text_hyphen.rb"' --

The tests claim to be running "jruby -w ..." and not "jruby --1.9 -w
...". It doesn't matter because of https://gist.github.com/1084614

I've filed JRUBY-5927 about this; if my interpretation of what's
happening with "jruby --1.9 -S rake test" is correct, I can file a
separate enhancement request about that (it's a problem, but not a bug
per se). I think Michael Edgar is correct about the other case.

-a

Austin Ziegler

unread,
Jul 15, 2011, 8:57:25 AM7/15/11
to ruby-talk ML
On Fri, Jul 15, 2011 at 1:46 AM, Michael Edgar <ad...@carboni.ca> wrote:
> On Jul 15, 2011, at 12:45 AM, Austin Ziegler wrote:
>> I've had folks asking me for a release of text-hyphen that works with
>> Ruby 1.9, and while I've got something that passes the tests that I've
>> created and added for MRI 1.9, it *loses* compatibility with Ruby
>> 1.8.7 (and does so loudly in the tests) and JRuby (in either 1.8 or
>> 1.9 mode, it appears). I need some help to get the last bits ready,
>> because I'm not ready to drop Ruby 1.8 entirely (at least one more
>> version).
> Running with the debugger on for 1.8.7 brings up this discrepancy:
>
> The "letters" array for 1.8.7 is this:
> ["d", "a", "m", "p", "f", "s", "c", "h", "i", "f", "f", "f", "a", "h", "r", "t", "s", "k", "a", "p", "i", "t", "\303", "\244", "n", "s", "m", "\303", "\274", "t", "z", "e",     "n", "h", "a", "l", "t", "e", "r", "h", "e", "r", "s", "t", "e", "l", "l", "e", "r"]
>
> Now, "\303", "\244" is a UTF-8 encoding of umlauts-over-a (ä). In your 1.8 german
> hyphenation file, you encode the ä in itä with the latin-1 encoding \344.
>
> Your input text is UTF-8, but the library searches for the latin1 encoding. Changing
> the input to \344 for ä and \374 for ü made the test pass for me on 1.8.7.

I think you're right. Now to figure out how to fix it properly in this case.

-a

Austin Ziegler

unread,
Jul 15, 2011, 9:06:38 AM7/15/11
to ruby-talk ML
On Fri, Jul 15, 2011 at 8:18 AM, Kaspar Schiess <eu...@space.ch> wrote:
>> Running with the debugger on for 1.8.7 brings up this discrepancy:
>>
>> The "letters" array for 1.8.7 is this:
>> ["d", "a", "m", "p", "f", "s", "c", "h", "i", "f", "f", "f", "a", "h",
>> "r", "t", "s", "k", "a", "p", "i", "t", "\303", "\244", "n", "s", "m",
>> "\303", "\274", "t", "z", "e",     "n", "h", "a", "l", "t", "e", "r", "h",
>> "e", "r", "s", "t", "e", "l", "l", "e", "r"]
>>
>> Now, "\303", "\244" is a UTF-8 encoding of umlauts-over-a (ä). In your 1.8
>> german
>> hyphenation file, you encode the ä in itä with the latin-1 encoding \344.

>>
>> Your input text is UTF-8, but the library searches for the latin1
>> encoding. Changing
>> the input to \344 for ä and \374 for ü made the test pass for me on 1.8.7.

>
> I second that analysis. It seems to use text-hyphen in Ruby 1.8 with other
> languages than english (with any languages that use exotic characters not in
> ASCII), you will have to make sure that your input is in the same character
> encoding as the language file is. In the case of german, this is LATIN1. So
> opening and changing the file in your text editor has probably converted the
> file to utf8, Austin.
>
> Fixing the 1.8 version in the general case (any input, any language file
> encoding) will be hard... and useless, since you would program towards a use
> case that should go extinct.

I'm not so much looking for the general case, but this specific case,
since it's a bug about a word that you filed four years ago (yes, the
one you linked) ;)

Text::Hyphen under Ruby 1.8 has always said you need to match the
encoding of the input to the encoding of the hyphenation file (and
that'll still be true under Ruby 1.9, but at least there it'll be a
*consistent* UTF-8 encoding for all hyphenation files). I just forgot
that for this particular test.

> More than one solution offers itself ;)
>
> a) convert the file test_bugs.rb back to latin1 (-> bad, will break soon
> again)

Doing that would cause Ruby 1.9 to fail. If I'm willing to split the
test into 1.8 and 1.9 versions (and use load) for the specific failing
bug, then I can make this work for this release.

> b) digging back through the old version history (I am sure you have it ;)) -
> trying to see if [1] was specifically about german umlauts or if it was just
> the german and the size of the word that tripped the bug. If it was one of

> the latter - then remove those damn umlauts from the word (ä -> ae, ü -> ue)


> and use the new test expectations that derive from that. This would make the
> file ASCII again, and less sensible to editor conversion.

It was the umlauts, and (ahem) you filed the bug with the umlauts. ;)

> c) The solution you say you don't want: Dropping 1.8 support from newer
> gems. Since bundler & rvm this is increasingly simple to manage - I'll just
> limit my old projects to use an old version of text-hyphen.
>
> Considering the impossible (aka: very laborious and quite not to the point)
> nature of the bug in 1.8, I would choose c) or (if must be) b).

I'm trying to get out *one more* release of 1.8—this one—and then
Text::Hyphen (or its successor) will happily be 1.9 only. This is a
"final 1.8" release and then I'm going to bump the major version if I
keep the project name (which is a good one) and put "ruby >= 1.9.2" in
the gemspec. This is the transitional release only.

> [1]
> http://rubyforge.org/tracker/index.php?func=detail&aid=9807&group_id=294&atid=1195

Austin Ziegler

unread,
Jul 15, 2011, 9:37:22 AM7/15/11
to ruby-talk ML
On Fri, Jul 15, 2011 at 9:06 AM, Austin Ziegler <halos...@gmail.com> wrote:

Thanks everyone for the comments received. I've taken the approach
that I mentioned in my last message in response to Kaspar. You can see
the latest test code (where I have two data files; one latin1 and one
UTF-8). I'll be preparing a release this weekend.

Sadly, JRuby in 1.9 mode won't work because of an apparent bug in
JRuby itself, and "jruby --1.9 -S rake test" only looks like it works
because the test actually runs JRuby again in 1.8 mode. A bug has been
filed for the former case, but an improvement has not yet been filed
for the latter case.

-a

Marc Heiler

unread,
Jul 15, 2011, 4:11:33 PM7/15/11
to ruby-talk ML
> It was the umlauts

Man ... Ruby 1.9.x hates umlauts.

*hugs his 1.8.7 install*

--
Posted via http://www.ruby-forum.com/.

Charles Oliver Nutter

unread,
Jul 15, 2011, 6:40:40 PM7/15/11
to ruby-talk ML
On Fri, Jul 15, 2011 at 8:37 AM, Austin Ziegler <halos...@gmail.com> wrote:
> Sadly, JRuby in 1.9 mode won't work because of an apparent bug in
> JRuby itself, and "jruby --1.9 -S rake test" only looks like it works
> because the test actually runs JRuby again in 1.8 mode. A bug has been
> filed for the former case, but an improvement has not yet been filed
> for the latter case.

Ok, I see your bugs. We'll have a look into it.

FWIW, you can specify JRUBY_OPTS=--1.9 and it will pass through to the
child JRuby instances too. But I agree, we need a dotfile or similar
to force it.

- Charlie

Austin Ziegler

unread,
Jul 16, 2011, 10:08:50 AM7/16/11
to ruby-talk ML

I think it's a little more subtle than that, as I noted in my last comment on the --1.9 improvement request. When JRuby starts with --1.9 (whether through an arg, an opt, or a dotfile), it should essentially do:

ENV["JRUBY_OPTS"]="--1.9"

Of course, it should be a bit smarter than that, preserving other values, but this way you get the same expected behaviour that you get when MRI spawns another instance of MRI based on RbConfig::CONFIG["ruby_instance_name"].

-a « from my iPad

Reply all
Reply to author
Forward
0 new messages