[webgen-users] the tags content processor doesn't work correctly with ruby 1.9 and UTF-8 encoding

16 views
Skip to first unread message

Stefano Crocco

unread,
Dec 28, 2009, 7:36:20 AM12/28/09
to webgen...@rubyforge.org
Hello to everyone

I'm using webgen with ruby 1.9 to build a web site in Italian. If I encode my
.page files as UTF-8 and set ruby default external encoding to UTF-8, the tags
content processor doesn't work (I can't reproduce it right now, but I think it
says "Unbalanced curly brackets for tag"). To make it work, I have to use the
iso-8859-15 encoding both for the page files and for ruby (so, instead of
simply calling webgen, I have to use the command ruby -E iso-8859-15 `which
webgen`).

I investigated a bit and found out that the reason for the error is that the
tags content processor uses StringScanner, and in particular StringScanner#pos
to find out tags and their contents. The problem is that StringScanner#pos
returns the position in bytes, rather than in characters. The Tags class,
instead, uses the value returned by pos as if it always were a character
index. This works correctly with single-byte encodings like iso-8859, but
fails with multibyte encodings like UTF-8.

You can better see what I mean running the following code in ruby 1.9 (you
have to save it in a file encoded in UTF-8 and run it using ruby -E UTF-8 to
be sure to see the issue). It runs a simplified version of the code used by
Tags#replace_tags in the :before_tag case, showing the various values obtained
from the scanner.

#encoding: UTF-8
require 'strscan'

def test_match sc
sc.skip_until $reg
puts "data.backslashes: #{sc[1].length}"
puts "data.tag: #{sc[2]}"
puts "sc[3]: #{sc[3]}"
puts "matched length: #{sc.matched.length}"
puts "data.params_start_pos: #{sc.pos}"
start_pos = sc.pos - sc.matched.length
puts "data.start_pos: #{start_pos}"
puts "data: #{sc.string[start_pos..-1]}"
end


str_a = "abcdee {relocatable: xyez}"
str_u = "abcdèè {relocatable: xyèz}"
$reg = /(\\*)\{(\w+)(::?)/
sc_a = StringScanner.new str_a
sc_u = StringScanner.new str_u
puts "ASCII STRING"
test_match sc_a
puts "---\nUTF-8 STRING"
test_match sc_u

As you can see, in the case of the ascii string, the data is the whole tag:
{relocatable: xyez}
while in the case of the UTF-8 string, the first two characters are missing
(they're two because the string contained two multibyte characters before the
opening brace).

Some time ago, I sent a message about this behaviour of StringScanner#pos (and
possible workarounds) on the ruby mailing list, but got no answer. Now, I
found out that if you replace the last line in the test_match method above
with this:

puts "data char: #{sc.string.bytes.to_a[start_pos..-1].pack('c*')}",

it outputs the expected string (the whole tag from { to }) both in the ascii
and UTF-8 cases.

I don't know whether this bug was already known or not (I've looked at the bug
reports and searched the list archives finding nothing), and this is why I'm
sending this mail, instead of reporting the issue as a bug. Please, let me
know whether I should create a bug report or do something else.

Thanks in advance

Stefano
_______________________________________________
webgen-users mailing list
webgen...@rubyforge.org
http://rubyforge.org/mailman/listinfo/webgen-users

Thomas Leitner

unread,
Jan 24, 2010, 2:46:25 AM1/24/10
to Stefano Crocco, webgen...@rubyforge.org
Hi,

> I'm using webgen with ruby 1.9 to build a web site in Italian. If I
> encode my .page files as UTF-8 and set ruby default external encoding
> to UTF-8, the tags content processor doesn't work (I can't reproduce
> it right now, but I think it says "Unbalanced curly brackets for
> tag"). To make it work, I have to use the iso-8859-15 encoding both
> for the page files and for ruby (so, instead of simply calling
> webgen, I have to use the command ruby -E iso-8859-15 `which webgen`).

There is a standard way to invoke a Ruby executable which resides in
PATH with specific command line switches: `ruby -E iso-8859-15 -S
webgen`.

> I investigated a bit and found out that the reason for the error is
> that the tags content processor uses StringScanner, and in particular
> StringScanner#pos to find out tags and their contents. The problem is
> that StringScanner#pos returns the position in bytes, rather than in
> characters. The Tags class, instead, uses the value returned by pos
> as if it always were a character index. This works correctly with
> single-byte encodings like iso-8859, but fails with multibyte
> encodings like UTF-8.
>

> SNIP...


>
> Some time ago, I sent a message about this behaviour of
> StringScanner#pos (and possible workarounds) on the ruby mailing
> list, but got no answer. Now, I found out that if you replace the
> last line in the test_match method above with this:
>
> puts "data char: #{sc.string.bytes.to_a[start_pos..-1].pack('c*')}",
>
> it outputs the expected string (the whole tag from { to }) both in
> the ascii and UTF-8 cases.

Thanks for this bug report, I was not aware of this problem! You may
want to resend the mail you sent to the ruby-talk mailing list to the
ruby-core ML or file a bug report at the Ruby bugtracker at
http://redmine.ruby-lang.org/.

I will fix this in webgen in the meanwhile!

Best regards,
Thomas

Reply all
Reply to author
Forward
0 new messages