[ruby-core:19465] [Bug #680] csv.rb: CSV.parse is too late when encoding is mismatch

Takeyuki Fujioka

unread,

Oct 23, 2008, 8:58:33 PM10/23/08

to ruby...@ruby-lang.org

Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch
http://redmine.ruby-lang.org/issues/show/680

Author: Takeyuki Fujioka
Status: Open, Priority: Normal
Category: lib, Target version: 1.9.x

I think this result is true, but encoding mismatch raise is too late.

see:
% time ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000).force_encoding("shift_jis"))'
ruby19 -rcsv -e 0.30s user 0.02s system 96% cpu 0.330 total

% time ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000))'
/Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in `=~': broken UTF-8 string (ArgumentError)
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in `init_separators'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in `initialize'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in `new'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in `parse'
from -e:1:in `<main>'
ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000))' 1.55s user 2.57s system 90% cpu 4.530 total

----------------------------------------
http://redmine.ruby-lang.org

Michael Selig

unread,

Oct 23, 2008, 11:12:42 PM10/23/08

to ruby...@ruby-lang.org

Hi,

This bug actually brings up an interesting issue - should the source
encoding default to something other than UTF-8 (ie: if it is not specified
in the "magic comment")?

Perhaps it should default to the encoding specified by the user's locale?
Or perhaps it should default to the value of "default_internal" if it is
set? Or even default_external?

I suggest that it should default to "default_internal" if that is set, and
then to the locale encoding if not.

What do others think?
Having it default to the locale in this case would probably avoid the
encoding mismatch entirely (and the resulting confusion).

Cheers
Mike

Martin Duerst

unread,

Oct 24, 2008, 2:52:17 AM10/24/08

to ruby...@ruby-lang.org

A default for the source encoding has been discussed quite a long
time ago (in some Japanese meetings or on ruby-dev, I don't remember),
and the conclusion was that the source encoding has to be given
(with a majic comment) in the file itself (unless the file is all ascii).

The reason for this is that the source encoding is a property of the
source, and nothing else. On very simple scripts, it might occasionally
be slightly easier if it were the same as default_external or
default_internal, but this is only the case as long as you stay
in exactly the same environment, and don't move the script.
But scripts grow and move, so it's better to get the settings
right at the start.

However, as far as I remember, the idea was that for -e,
default_external should be used, because that's what one
is using in a shell. I'm not sure why this doesn't work below.
(assuming Takeyuki is working in a Shift_JIS environment,
which isn't completely sure).

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp

Michael Selig

unread,

Oct 24, 2008, 3:48:04 AM10/24/08

to ruby...@ruby-lang.org

Hi,

I am not sure I understand your argument about not defaulting the source
encoding.

The problem I am trying to solve is the compatibility of string literals
in your source and strings from other sources.

"default_internal" was introduced to try to make all strings the same
encoding to avoid incompatibilities. But at the moment string literals
seem to default to the source encoding or to UTF-8 if oit is not set
(please correct me if I am wrong). What I was suggesting was a way to make
string literals be compatible.

This normally isn't a problem if:
a) All string literals are 7 bit ASCII, or
b) The source encoding matches "default_internal"

If the source encoding of a program containing non-ascii string literals
is set different from default_internal, you are asking for trouble, and
would defeat the purpose of default_internal. Therefore to prevent the
programmer from having to remember to specify both, it makes sense to me
that the source encoding should default to default_internal. I think this
is important.

Now when default_internal is not set, we have some different issues. If
the source encoding defaults to the locale, this may cause different
behaviour and confusion if the script is run in different locales.
However, "default_external" already defaults to the locale, and with
"default_internal" not set, it means that strings read from files are
going to be in the locale's encoding anyhow. So that possibly means
encoding compatibility issues between data read from file and string
literals. Either way, there are possible problems. Possibly the better
solution when default_internal is not set is to default the source
encoding to "default_external" (which is in turn defaulted to the locale).

(By the way, I am not talking about libraries here. As I have stressed
previously, libraries should be carefully written to either use ASCII
string literals only, or to make sure that it transcodes them properly.)

The only other (and perhaps better) solution I can think of is to separate
the notion of "source encoding" and "string literal encoding". Then you
can have the source encoding set to anything, but always force non-ascii
string literals to the "string literal encoding", which can default to
Encoding.default_internal || Encoding.default_external. But perhaps this
is going over-board with too many different encoding settings.

Finally, are you suggesting that "-e" should perform differently to a
single-line ruby script? That seems non-intuitive to me.

Cheers,
Mike

On Fri, 24 Oct 2008 17:52:17 +1100, Martin Duerst <due...@it.aoyama.ac.jp>
wrote:

Yukihiro Matsumoto

unread,

Oct 24, 2008, 4:04:24 AM10/24/08

to ruby...@ruby-lang.org

Hi,

In message "Re: [ruby-core:19473] Re: Default source encoding (Was: [Bug #680] csv.rb: CSV.parse is toolate when encoding is mismatch)"

on Fri, 24 Oct 2008 16:48:04 +0900, "Michael Selig" <michae...@fs.com.au> writes:

|The problem I am trying to solve is the compatibility of string literals
|in your source and strings from other sources.
|
|"default_internal" was introduced to try to make all strings the same
|encoding to avoid incompatibilities. But at the moment string literals
|seem to default to the source encoding or to UTF-8 if oit is not set
|(please correct me if I am wrong). What I was suggesting was a way to make
|string literals be compatible.

You are correct here.

|This normally isn't a problem if:
|a) All string literals are 7 bit ASCII, or
|b) The source encoding matches "default_internal"
|
|If the source encoding of a program containing non-ascii string literals
|is set different from default_internal, you are asking for trouble, and
|would defeat the purpose of default_internal. Therefore to prevent the
|programmer from having to remember to specify both, it makes sense to me
|that the source encoding should default to default_internal. I think this
|is important.

The point is that when we have a source code written in source
encoding, the literals naturally encoded in that encoding. So do we
need to convert string literals in to default encoding? But
conversion can bring us more troubles, since they tend to change the
meaning, for example what /[<a>-<b>]/ mean, where <a> and <b> are
multi byte characters and their corresponding codepoints (and sorting
order) differ in converted encoding?

|(By the way, I am not talking about libraries here. As I have stressed
|previously, libraries should be carefully written to either use ASCII
|string literals only, or to make sure that it transcodes them properly.)

That makes me feel much better, so we can limit the issue about the
scripts only.

|Finally, are you suggesting that "-e" should perform differently to a
|single-line ruby script? That seems non-intuitive to me.

-e takes programs from command line shell, which probably yields
strings in locale encoding anyway. But we cannot assume that for
scripts contained in files.

matz.

Takeyuki Fujioka

unread,

Oct 24, 2008, 6:23:36 AM10/24/08

to ruby...@ruby-lang.org

Issue #680 has been updated by Takeyuki Fujioka.

File sample.csv added

Please save as 'sample.csv' attached file.
This file include japanese UTF-8 string in first line.
Other line is us-ascii. Line number count is 5001.

% time ruby19 -Eutf-8 -rcsv -e 'CSV.parse(open("sample.csv","r").read)'
ruby19 -Eutf-8 -rcsv -e 'CSV.parse(open("sample.csv","r").read)' 0.23s user 0.01s system 96% cpu 0.254 total

this is OK very fast.
But:

% time ruby19 -Eeuc-jp -rcsv -e 'CSV.parse(open("sample.csv","r").read)'
/Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in `=~': broken EUC-JP string (ArgumentError)

from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in `init_separators'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in `initialize'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in `new'
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in `parse'
from -e:1:in `<main>'

ruby19 -Eeuc-jp -rcsv -e 'CSV.parse(open("sample.csv","r").read)' 3.93s user 6.38s system 98% cpu 10.457 total

this result is very slow.
I hope raise as soon as encoding mismatch found .

# Sorry, I don't understand M17N's default_external and default_internal behavior.
# I can't reply about M17N's problem.
----------------------------------------
http://redmine.ruby-lang.org/issues/show/680

----------------------------------------
http://redmine.ruby-lang.org

James Gray

unread,

Oct 24, 2008, 9:07:13 AM10/24/08

to ruby...@ruby-lang.org

On Oct 24, 2008, at 1:52 AM, Martin Duerst wrote:

> A default for the source encoding has been discussed quite a long
> time ago (in some Japanese meetings or on ruby-dev, I don't remember),
> and the conclusion was that the source encoding has to be given
> (with a majic comment) in the file itself (unless the file is all
> ascii).
>
> The reason for this is that the source encoding is a property of the
> source, and nothing else. On very simple scripts, it might
> occasionally
> be slightly easier if it were the same as default_external or
> default_internal, but this is only the case as long as you stay
> in exactly the same environment, and don't move the script.
> But scripts grow and move, so it's better to get the settings
> right at the start.

I think I understand what you are saying here, which is that it's
better to just have Ruby blow up and force authors to set the encoding
of their code via the magic comment. Maybe you are right.

However, I can give a very real counter example to your scripts moving
environment example. I work on TextMate and we use Ruby all over the
place inside that application. I'm sure we have hundreds of scripts
in there. We try hard to make sure everything in TextMate is UTF-8,
so now we get errors out of Ruby 1.9. To fix, we need to add hundreds
of magic comments and worse, train our users who often write their own
automations in Ruby why they have to do the same to make their code
work. So in our case, a default to UTF-8 or the local encoding would
be a huge win.

James Edward Gray II

James Gray

unread,

Oct 24, 2008, 10:00:27 AM10/24/08

to Ruby Core

On Oct 24, 2008, at 8:06 AM, James Gray wrote:

> I work on TextMate and we use Ruby all over the place inside that
> application. I'm sure we have hundreds of scripts in there. We try
> hard to make sure everything in TextMate is UTF-8, so now we get
> errors out of Ruby 1.9. To fix, we need to add hundreds of magic
> comments and worse, train our users who often write their own
> automations in Ruby why they have to do the same to make their code
> work.

The real issue here is that you can argue the user doesn't even know
the proper encoding these scripts should be using. Only TextMate
really knows the encoding it's going to hand-off the data in.

Given that, would it be possible to support an environment variable
like RUBY_SCRIPT_ENCODING for these special cases? TextMate could set
that before the hand-off, since it knows the environment it's running
the scripts under best. Obviously, this would only affect the default
script encoding and a magic comment would still override this value.

James Edward Gray II

Nobuyoshi Nakada

unread,

Oct 24, 2008, 12:01:13 PM10/24/08

to ruby...@ruby-lang.org

Hi,

At Fri, 24 Oct 2008 23:00:27 +0900,
James Gray wrote in [ruby-core:19481]:

> > I work on TextMate and we use Ruby all over the place inside that
> > application. I'm sure we have hundreds of scripts in there. We try
> > hard to make sure everything in TextMate is UTF-8, so now we get
> > errors out of Ruby 1.9. To fix, we need to add hundreds of magic
> > comments and worse, train our users who often write their own
> > automations in Ruby why they have to do the same to make their code
> > work.
>
> The real issue here is that you can argue the user doesn't even know
> the proper encoding these scripts should be using. Only TextMate
> really knows the encoding it's going to hand-off the data in.

Though I don't know about TextMate at all, ruby-mode.el in 1.9
deals with magic comments automatically.

--
Nobu Nakada

James Gray

unread,

Oct 24, 2008, 12:17:12 PM10/24/08

to ruby...@ruby-lang.org

But that's not for Ruby scripts emacs is using though, right? You are
just talking about it helping you write Ruby scripts.

TextMate uses Ruby scripts internally. Hundreds of them. Users write
them, but TextMate runs them to do it's work. That's the issue.

James Edward Gray II

James Gray

unread,

Oct 24, 2008, 8:57:01 PM10/24/08

to ruby...@ruby-lang.org

Issue #680 has been updated by James Gray.

Status changed from Open to Closed
% Done changed from 0 to 100

Applied in changeset r19931.

James Gray

unread,

Oct 24, 2008, 8:58:33 PM10/24/08

to ruby...@ruby-lang.org

Issue #680 has been updated by James Gray.

Assigned to set to James Gray

Thanks for finding the bug in my logic. It should be much faster now:

$ time ruby_dev -Eeuc-jp -rlib/csv -e 'CSV.parse(open("/Users/james/Desktop/sample.csv","r").read)'
/Users/james/Documents/ruby_source/trunk/lib/csv.rb:1981:in `=~': broken EUC-JP string (ArgumentError)
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1981:in `init_separators'
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1563:in `initialize'
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1350:in `new'
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1350:in `parse'

from -e:1:in `<main>'

real 0m0.053s
user 0m0.039s
sys 0m0.011s

Michael Selig

unread,

Oct 25, 2008, 10:25:58 PM10/25/08

to ruby...@ruby-lang.org

Hi,

Sorry, perhaps I have been giving a (bad) solution, rather than stating
the problem clearly, so let me try again!
I certainly didn't mean to suggest there should be any transcoding of
string literals by Ruby's parser.

So here are the problems as I see them. They are all to do with the
default encoding of string literals, and they are all fairly minor, but I
think addressing them has merit:

1) The encoding of string literals constructed with "\x..." is ambiguous.
Well not strictly ambiguous, but certainly it can be confusing. The
trouble is that a string literal like the example in bug #680
"\x82\xA0,\x82\xA2" can either be used as a "binary" string (ASCII-8BIT)
or an encoded character string (intended to be Shift_JIS in this case),
but this depends on the source encoding. While technically these are the
same data, they are used in quite different ways in practice. Also, as we
see in the bug report, it can cause mysterious errors such as "Bad UTF-8
string" because the source encoding was apparently UTF-8 not Shift_JIS
(thank you to Martin for pointing this out).

Ruby treats strings constucted with "\u..." differently: they are set to
UTF-8 no matter what the source encoding. I think this is the correct
behaviour - there is no ambiguity. But "\x..." is not treated like this.
When the source encoding is not specified (or is US-ASCII), a "\x.."
string is set to ASCII-8BIT. Again I think this is the correct behaviour.
However if the source encoding is set to anything else, the encoding of
the string is set to the source encoding. I think this is the part that is
wrong, especially as the resultant string can be "broken", and no warning
is given about this by the parser.

My preference would be to *always* encode string literals constructed with
"\x.." as ASCII-8BIT, ignoring the source encoding. This means that if you
really want to use such a literal as an encoded string, you must use
"force_encoding". I think this would be much clearer and get rid of the
"ambiguity".

2) I find it slightly redundant to have to specify BOTH the
default_internal, and the source encoding at the top of an m17n script
which contains multibyte string literals, when in all practical cases they
should be the same. eg:

#! /usr/bin/ruby -E:UTF-8
# encoding: UTF-8

My suggestion for "defaulting" the source encoding was an attempt to avoid
having to do this (but probably not a good way!). It isn't a big deal, and
I understand the argument that the source encoding is a property of the
script. My original suggestion (last month) of a special magic comment was
to have a way of specifying BOTH the default_internal and source encoding
once, but this idea was rejected.

3) I think there should be some check (warning message?) that the (non
ASCII-8BIT) string literals in a library file are compatible with the
"default_internal" of the calling program (if it is set). Ideally this
check would be done when the "require" is called to flag possible
incompatibilities early.
Perhaps this check could be based on the library's source encoding? If
this were done, most libraries would have to use a source encoding of
US-ASCII (or just have no encoding magic comment) *not* UTF-8, so that
non-Unicode default_internal's will work. Perhaps Ruby could be smarter,
and only flag an error if there actually is an incomaptible string literal
in the library?

4) I was surprised at the different source encoding behaviour when using
"-e" compared to a script in a file. (Again thank you to Martin for
telling me about it)
Matz wrote:

> -e takes programs from command line shell, which probably yields
> strings in locale encoding anyway. But we cannot assume that for
> scripts contained in files.

Again I understand the sentiment, but for a simple non-m17n, non-ascii
ruby script that was likely written with an editor on the same machine or
in the same locale, why force it to have an "encoding" magic comment?

Also it means that:
ruby test.rb
may perform differently than:
ruby -e "`cat test.rb`"

Again potentially confusing, but not a big deal.

I hope I have made myself clearer this time!

Thanks,
Mike.

Nobuyoshi Nakada

unread,

Oct 26, 2008, 2:26:32 AM10/26/08

to ruby...@ruby-lang.org

Hi,

At Sun, 26 Oct 2008 11:25:58 +0900,
Michael Selig wrote in [ruby-core:19515]:
> 1)

> My preference would be to *always* encode string literals constructed with
> "\x.." as ASCII-8BIT, ignoring the source encoding. This means that if you
> really want to use such a literal as an encoded string, you must use
> "force_encoding". I think this would be much clearer and get rid of the
> "ambiguity".

> 2)

> My suggestion for "defaulting" the source encoding was an attempt to avoid
> having to do this (but probably not a good way!). It isn't a big deal, and
> I understand the argument that the source encoding is a property of the
> script. My original suggestion (last month) of a special magic comment was
> to have a way of specifying BOTH the default_internal and source encoding
> once, but this idea was rejected.

I'd prefer to default the internal encoding to the source
encoding of the main script.

> 3)

> Perhaps this check could be based on the library's source encoding? If
> this were done, most libraries would have to use a source encoding of
> US-ASCII (or just have no encoding magic comment) *not* UTF-8, so that
> non-Unicode default_internal's will work. Perhaps Ruby could be smarter,
> and only flag an error if there actually is an incomaptible string literal
> in the library?

What about comments? I suspect it might not a good idea.

> 4)

> Also it means that:
> ruby test.rb
> may perform differently than:
> ruby -e "`cat test.rb`"

magic comments are effective with -e too.

$ ruby19 -e 'p __ENCODING__'
#<Encoding:EUC-JP>

$ ruby19 -e '#-*- encoding:utf-8 -*-' -e 'p __ENCODING__'
#<Encoding:UTF-8>

Therefore no differences if the file has the magic comment.

--
Nobu Nakada

Michael Selig

unread,

Oct 26, 2008, 4:20:17 AM10/26/08

to ruby...@ruby-lang.org

On Sun, 26 Oct 2008 17:26:32 +1100, Nobuyoshi Nakada <no...@ruby-lang.org>
wrote:

> I'd prefer to default the internal encoding to the source
> encoding of the main script.

But then how do you tell Ruby NOT to set "default_internal"?
It also means that comments must be in the default_internal encoding (see
your comment below).

>> 3)
>> Perhaps this check could be based on the library's source encoding? If
>> this were done, most libraries would have to use a source encoding of
>> US-ASCII (or just have no encoding magic comment) *not* UTF-8, so that
>> non-Unicode default_internal's will work. Perhaps Ruby could be smarter,
>> and only flag an error if there actually is an incomaptible string
>> literal
>> in the library?
>
> What about comments? I suspect it might not a good idea.

That is why I suggested that, if possible, it should check for *actual*
string literals in the library, rather than the source encoding. This may
be hard to implement though.

>
>> 4)
>> Also it means that:
>> ruby test.rb
>> may perform differently than:
>> ruby -e "`cat test.rb`"
>
> magic comments are effective with -e too.
>
> $ ruby19 -e 'p __ENCODING__'
> #<Encoding:EUC-JP>
>
> $ ruby19 -e '#-*- encoding:utf-8 -*-' -e 'p __ENCODING__'
> #<Encoding:UTF-8>
>
> Therefore no differences if the file has the magic comment.

That's true, but my point was "why should a simple non-m17n non-ascii ruby
program have to contain the magic comment"?

Thanks,
Mike

Nobuyoshi Nakada

unread,

Oct 26, 2008, 8:34:26 AM10/26/08

to ruby...@ruby-lang.org

Hi,

At Sun, 26 Oct 2008 17:20:17 +0900,
Michael Selig wrote in [ruby-core:19518]:

> > I'd prefer to default the internal encoding to the source
> > encoding of the main script.
>
> But then how do you tell Ruby NOT to set "default_internal"?

I think defaulting the internal encoding to something other is
bad.

> It also means that comments must be in the default_internal encoding (see
> your comment below).

I don't follow you here, all comments should be written in the
source encoding. Why default_internal affects?

> > Therefore no differences if the file has the magic comment.
>
> That's true, but my point was "why should a simple non-m17n non-ascii ruby
> program have to contain the magic comment"?

Because, non-ascii. It's definitely enough reason.

--
Nobu Nakada

Michael Selig

unread,

Oct 26, 2008, 6:28:42 PM10/26/08

to ruby...@ruby-lang.org

On Sun, 26 Oct 2008 23:34:26 +1100, Nobuyoshi Nakada <no...@ruby-lang.org>
wrote:

>> > I'd prefer to default the internal encoding to the source

>> > encoding of the main script.
>>
>> But then how do you tell Ruby NOT to set "default_internal"?
>
> I think defaulting the internal encoding to something other is
> bad.

Yes you are right, and I was not suggesting doing that.
But Matz wants to default default_internal to nil. With your proposal, how
do you do that and still set the source encoding?
My original suggestion was to use an extended "magic comment" to set both.
I still think that is the best way!

>> It also means that comments must be in the default_internal encoding
>> (see
>> your comment below).
>
> I don't follow you here, all comments should be written in the
> source encoding. Why default_internal affects?

I thought one of your points was that you would like to be able to write
Japanese (or other non-ascii) comments which is otherwise only ascii
(which may use "\u" in literals, and want default_internal to be UTF-8).
This means that the source encoding should be Japanese. Your suggestion of
defaulting default_internal to the source encoding means that it will be
set to Japanese. I am not sure that this is always desirable. (This is
very minor - you can always override it)

>> > Therefore no differences if the file has the magic comment.
>>
>> That's true, but my point was "why should a simple non-m17n non-ascii
>> ruby
>> program have to contain the magic comment"?
>
> Because, non-ascii. It's definitely enough reason.

Isn't backward compatibility with 1.8 scripts more important?
You are now forcing anyone with 1.8 scripts containing non-ascii string
literals to put in a magic comment, otherwise you get "inavlid multibyte
char (US-ASCII)" error in 1.9.

Cheers
Mike

Michael Selig

unread,

Oct 26, 2008, 7:48:39 PM10/26/08

to ruby...@ruby-lang.org

On Sat, 25 Oct 2008 00:07:13 +1100, James Gray <ja...@grayproductions.net>
wrote:

> However, I can give a very real counter example to your scripts moving
> environment example. I work on TextMate and we use Ruby all over the
> place inside that application. I'm sure we have hundreds of scripts in
> there. We try hard to make sure everything in TextMate is UTF-8, so now
> we get errors out of Ruby 1.9. To fix, we need to add hundreds of magic
> comments and worse, train our users who often write their own
> automations in Ruby why they have to do the same to make their code
> work. So in our case, a default to UTF-8 or the local encoding would be
> a huge win.

I too have been concerned about 1.8 script compatibility regarding
encodings.
For my own interest, can you explain a bit more about what goes wrong with
these scripts in 1.9? I know absolutely nothing about Textmate.
Do they contain multibyte string literals?

Cheers
Mike

James Gray

unread,

Oct 26, 2008, 11:24:38 PM10/26/08

to ruby...@ruby-lang.org

They sure could, yeah. Our policy for TextMate development has always
been that UTF-8 is king. We use it heavily and I'm sure some scripts
do contain multibyte characters in UTF-8.

James Edward Gray II

Nobuyoshi Nakada

unread,

Oct 27, 2008, 1:07:54 AM10/27/08

to ruby...@ruby-lang.org

Hi,

At Mon, 27 Oct 2008 07:28:42 +0900,
Michael Selig wrote in [ruby-core:19525]:

> Yes you are right, and I was not suggesting doing that.
> But Matz wants to default default_internal to nil. With your proposal, how
> do you do that and still set the source encoding?

I don't like the idea setting default_internal from source
encoding, but meant "it feels less worse" by "prefer".

> My original suggestion was to use an extended "magic comment" to set both.

But it can't keep the source encoding unset, and
"internal_encoding" has no effect for Emacs.

> Isn't backward compatibility with 1.8 scripts more important?
> You are now forcing anyone with 1.8 scripts containing non-ascii string
> literals to put in a magic comment, otherwise you get "inavlid multibyte
> char (US-ASCII)" error in 1.9.

In other words, what you want is -K option?

--
Nobu Nakada

Michael Selig

unread,

Oct 27, 2008, 1:48:41 AM10/27/08

to ruby...@ruby-lang.org

On Mon, 27 Oct 2008 16:07:54 +1100, Nobuyoshi Nakada <no...@ruby-lang.org>
wrote:

> But it can't keep the source encoding unset, and

> "internal_encoding" has no effect for Emacs.

OK, I don't use Emacs, and no one told me that before, thanks! I assumed
it would work, but I admit I didn't test it.
Then is there another form of magic comment that can be used - eg:
"internal encoding: XXXX" or "encoding: XXXX internal" that does work with
Emacs?

I am not sure why you would want to keep the source encoding unset when
setting default_internal at the top of a script. Perhaps you could explain.

>> Isn't backward compatibility with 1.8 scripts more important?
>> You are now forcing anyone with 1.8 scripts containing non-ascii string
>> literals to put in a magic comment, otherwise you get "inavlid multibyte
>> char (US-ASCII)" error in 1.9.
>
> In other words, what you want is -K option?

What I am saying is that we need to consider backward compatibility of
Ruby scripts. James Grey brought up an example with his "Textmate scripts"
which contain UTF-8 multibyte string literals, which used to work with
1.8, but do not in 1.9, because they need either a "magic comment" or, as
you say "-KU". Either way, 1.9 is not truly backward compatible when it
comes to simple, non-m17n, non-ascii scripts, because you have to either
modify the script or add a flag to the ruby options. There must be lots of
Japanese ruby scripts which will have a similar issue.

Defaulting source encoding to locale encoding (like -e does) should fix
this (as long as the end-user's locale is correct), right?

I guess if necessary James can put "-KU" in the RUBYOPT environment
variable to save having to add multiple magic comments, but I feel this
shouldn't be necessary.

Cheers
Mike

Nobuyoshi Nakada

unread,

Oct 27, 2008, 2:27:57 AM10/27/08

to ruby...@ruby-lang.org

Hi,

At Mon, 27 Oct 2008 14:48:41 +0900,
Michael Selig wrote in [ruby-core:19532]:

> OK, I don't use Emacs, and no one told me that before, thanks! I assumed
> it would work, but I admit I didn't test it.
> Then is there another form of magic comment that can be used - eg:
> "internal encoding: XXXX" or "encoding: XXXX internal" that does work with
> Emacs?

No. Magic comments without -*- markers are for VIM, like
# vim: set encoding=UTF-8
and, both of VIM and Emacs wouldn't work with your examples.

> What I am saying is that we need to consider backward compatibility of
> Ruby scripts. James Grey brought up an example with his "Textmate scripts"
> which contain UTF-8 multibyte string literals, which used to work with
> 1.8, but do not in 1.9, because they need either a "magic comment" or, as
> you say "-KU". Either way, 1.9 is not truly backward compatible when it
> comes to simple, non-m17n, non-ascii scripts, because you have to either
> modify the script or add a flag to the ruby options. There must be lots of
> Japanese ruby scripts which will have a similar issue.

Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS
sources, so they have had -K in the shebang lines already.

> Defaulting source encoding to locale encoding (like -e does) should fix
> this (as long as the end-user's locale is correct), right?

Yes if they match.

> I guess if necessary James can put "-KU" in the RUBYOPT environment
> variable to save having to add multiple magic comments, but I feel this
> shouldn't be necessary.

-U option would be better.

--
Nobu Nakada

Michael Selig

unread,

Oct 27, 2008, 2:57:03 AM10/27/08

to ruby...@ruby-lang.org

On Mon, 27 Oct 2008 17:27:57 +1100, Nobuyoshi Nakada <no...@ruby-lang.org>
wrote:

> Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS

> sources, so they have had -K in the shebang lines already.

Why then can I write a ruby 1.8 script which does a "puts" of a Shift_JIS
string (no shebang or magic comment), and have it run fine without -Ks?

ruby1.8 t1.rb | od -c
0000000 S h i f t _ J I S s t r i n g
0000020 : 202 240 , 202 242 \n
0000030

ruby1.8 -Ks t1.rb | od -c
0000000 S h i f t _ J I S s t r i n g
0000020 : 202 240 , 202 242 \n
0000030

But on 1.9 it only works with -Ks:

ruby -v
ruby 1.9.0 (2008-10-27 revision 19961) [i686-linux]

ruby t1.rb
t1.rb:2: invalid multibyte char (US-ASCII)
t1.rb:2: invalid multibyte char (US-ASCII)

ruby -Ks t1.rb
0000000 S h i f t _ J I S s t r i n g
0000020 : 202 240 , 202 242 \n
0000030

>
>> Defaulting source encoding to locale encoding (like -e does) should fix
>> this (as long as the end-user's locale is correct), right?
>
> Yes if they match.
>
>> I guess if necessary James can put "-KU" in the RUBYOPT environment
>> variable to save having to add multiple magic comments, but I feel this
>> shouldn't be necessary.
>
> -U option would be better.

I don't think that will work:

t2.rb is a single line script which does a puts of a short UTF-8 multibyte
string.

ruby t2.rb
t2.rb:2: invalid multibyte char (US-ASCII)
t2.rb:2: invalid multibyte char (US-ASCII)

ruby -U t2.rb
ruby: "\xD8" on US-ASCII (Encoding::InvalidByteSequenceError)

ruby -KU t2.rb | od -c
0000000 U n i c o d e s t r i n g :
0000020 a b 330 265 330 271 \n
0000030

ruby1.8 t2.rb | od -c
0000000 U n i c o d e s t r i n g :
0000020 a b 330 265 330 271 \n
0000030

Cheers
Mike

Nobuyoshi Nakada

unread,

Oct 27, 2008, 5:55:32 AM10/27/08

to ruby...@ruby-lang.org

Hi,

At Mon, 27 Oct 2008 15:57:03 +0900,
Michael Selig wrote in [ruby-core:19535]:

> > Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS
> > sources, so they have had -K in the shebang lines already.
>
> Why then can I write a ruby 1.8 script which does a "puts" of a Shift_JIS
> string (no shebang or magic comment), and have it run fine without -Ks?

Because you are avoiding troublesome chars. Without such
chars, we can't write the words "display", "table", "software"
and "ruby".

> >> I guess if necessary James can put "-KU" in the RUBYOPT environment
> >> variable to save having to add multiple magic comments, but I feel this
> >> shouldn't be necessary.
> >
> > -U option would be better.
>
> I don't think that will work:
>
> t2.rb is a single line script which does a puts of a short UTF-8 multibyte
> string.

Indeed. -U sets only internal encoding, whereas -Ku sets also
external and source encodings. Therefore -U isn't direct
replacement for -Ku.

But it's very ambiguous and dangerous to imply encodings. We
can't trust locale for this purpose, at least.

You can use BOM to mean that the source is written in UTF-8.

--
Nobu Nakada

Michael Selig

unread,

Oct 27, 2008, 6:17:45 AM10/27/08

to ruby...@ruby-lang.org

On Mon, 27 Oct 2008 20:55:32 +1100, Nobuyoshi Nakada <no...@ruby-lang.org>
wrote:

> Hi,

>
> At Mon, 27 Oct 2008 15:57:03 +0900,
> Michael Selig wrote in [ruby-core:19535]:
>> > Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS
>> > sources, so they have had -K in the shebang lines already.
>>
>> Why then can I write a ruby 1.8 script which does a "puts" of a
>> Shift_JIS
>> string (no shebang or magic comment), and have it run fine without -Ks?
>
> Because you are avoiding troublesome chars. Without such
> chars, we can't write the words "display", "table", "software"
> and "ruby".

OK, I'm sure you know more about Japanese encodings that I do.
But my original point is that 1.8 scripts exist which contain multibyte
characters (eg UTF-8) which work fine under 1.8 without-K, but will fail
under 1.9 unless a magic comment or -K is provided.

> But it's very ambiguous and dangerous to imply encodings. We
> can't trust locale for this purpose, at least.

It's a trade-off between that and backward compatibility. I think the
"danger" is not high and it gives backward compatibility, so my vote would
be to use it.

> You can use BOM to mean that the source is written in UTF-8.

BOM? Byte order marker?
How does that help with backward compatibility? Doesn't it still mean
modifying the 1.8 script to work under 1.9?

Cheers
Mike

Martin Duerst

unread,

Oct 27, 2008, 6:37:58 AM10/27/08

to ruby...@ruby-lang.org

At 07:28 08/10/27, Michael Selig wrote:

>I thought one of your points was that you would like to be able to write
>Japanese (or other non-ascii) comments which is otherwise only ascii
>(which may use "\u" in literals, and want default_internal to be UTF-8).
>This means that the source encoding should be Japanese. Your suggestion of
>defaulting default_internal to the source encoding means that it will be
>set to Japanese. I am not sure that this is always desirable. (This is
>very minor - you can always override it)

I'm not sure what you mean by "Japanese". It's no problem at all
to use UTF-8 to write Japanese. And I guess if somebody uses
\u literals and wants default_internal to be UTF-8, they'll
in most cases use UTF-8 for the source encoding (comments or
whatever else).

If you mean Japanese legacy encodings (such as Shift_JIS and
EUC-JP), then your are correct, but it would be very rare
for somebody to use Shift_JIS or EUC-JP for comments when
the program is otherwise supposed to run all-UTF-8.

>Isn't backward compatibility with 1.8 scripts more important?
>You are now forcing anyone with 1.8 scripts containing non-ascii string
>literals to put in a magic comment, otherwise you get "inavlid multibyte
>char (US-ASCII)" error in 1.9.

Well, yes, that's actually the point of it. Wherever necessary,
get everybody to declare their encoding. It may be somewhat suboptimal
in the transition phase, but after that, we know what we're dealing
with.

Regards, Martin.

Martin Duerst

unread,

Oct 27, 2008, 6:38:00 AM10/27/08

to ruby...@ruby-lang.org

At 14:48 08/10/27, Michael Selig wrote:

>I am not sure why you would want to keep the source encoding unset when
>setting default_internal at the top of a script. Perhaps you could explain.

The simplest case is a script in US-ASCII only, but where you want
the data to be handled e.g. in UTF-8.

Martin Duerst

unread,

Oct 27, 2008, 6:39:12 AM10/27/08

to ruby...@ruby-lang.org

At 12:24 08/10/27, James Gray wrote:

>They sure could, yeah. Our policy for TextMate development has always
>been that UTF-8 is king. We use it heavily and I'm sure some scripts
>do contain multibyte characters in UTF-8.

Wouldn't it be only these scripts (including those that contain
\x escapes for UTF-8) that need the encoding indication at the top?
(please note that literals with \u escapes are automatically UTF-8).

Regards, Martin.

Martin Duerst

unread,

Oct 27, 2008, 6:58:52 AM10/27/08

to ruby...@ruby-lang.org

At 19:17 08/10/27, Michael Selig wrote:
>On Mon, 27 Oct 2008 20:55:32 +1100, Nobuyoshi Nakada <no...@ruby-lang.org>
>wrote:
>
>> Hi,
>>
>> At Mon, 27 Oct 2008 15:57:03 +0900,
>> Michael Selig wrote in [ruby-core:19535]:
>>> > Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS
>>> > sources, so they have had -K in the shebang lines already.
>>>
>>> Why then can I write a ruby 1.8 script which does a "puts" of a
>>> Shift_JIS
>>> string (no shebang or magic comment), and have it run fine without -Ks?
>>
>> Because you are avoiding troublesome chars. Without such
>> chars, we can't write the words "display", "table", "software"
>> and "ruby".
>
>OK, I'm sure you know more about Japanese encodings that I do.

To give you the details, these characters, in Shift_JIS, are
encoded with two bytes, the second of which is the same byte
as e.g. a backslash.

>But my original point is that 1.8 scripts exist which contain multibyte
>characters (eg UTF-8) which work fine under 1.8 without-K, but will fail
>under 1.9 unless a magic comment or -K is provided.

Yes, that's because 1.8 is essentially garbage-in-garbage out.
If you are careful about certain bytes, you can essentially have
arbitrary byte sequences in your script, and Ruby 1.8 won't complain.

Michael Selig

unread,

Oct 27, 2008, 7:07:21 AM10/27/08

to ruby...@ruby-lang.org

On Mon, 27 Oct 2008 21:38:00 +1100, Martin Duerst <due...@it.aoyama.ac.jp>
wrote:

> At 14:48 08/10/27, Michael Selig wrote:
>
>> I am not sure why you would want to keep the source encoding unset when
>> setting default_internal at the top of a script. Perhaps you could
>> explain.
>
> The simplest case is a script in US-ASCII only, but where you want
> the data to be handled e.g. in UTF-8.

But as UTF-8 or any other default_internal setting has to be ascii
compatible, there is no downside at setting the src encoding to the
default-internal setting even if your source is US-ASCII.

Mike

Nobuyoshi Nakada

unread,

Oct 27, 2008, 8:07:16 AM10/27/08

to ruby...@ruby-lang.org

Hi,

At Mon, 27 Oct 2008 19:17:45 +0900,
Michael Selig wrote in [ruby-core:19540]:

> But my original point is that 1.8 scripts exist which contain multibyte
> characters (eg UTF-8) which work fine under 1.8 without-K, but will fail
> under 1.9 unless a magic comment or -K is provided.

It just seemed working by chance.

> > But it's very ambiguous and dangerous to imply encodings. We
> > can't trust locale for this purpose, at least.
>
> It's a trade-off between that and backward compatibility. I think the
> "danger" is not high and it gives backward compatibility, so my vote would
> be to use it.

And it will suddenly crash or behave weirdly by moving other
locales.

Anyway, I think I understand the needs to specify source
encoding without magic comments. Is the option for that
purpose an acceptable solution?

--
Nobu Nakada

Nobuyoshi Nakada

unread,

Oct 27, 2008, 8:11:38 AM10/27/08

to ruby...@ruby-lang.org

Hi,

At Mon, 27 Oct 2008 19:37:58 +0900,
Martin Duerst wrote in [ruby-core:19541]:

> If you mean Japanese legacy encodings (such as Shift_JIS and
> EUC-JP), then your are correct, but it would be very rare
> for somebody to use Shift_JIS or EUC-JP for comments when
> the program is otherwise supposed to run all-UTF-8.

I don't do it of course, but know that some people love to do
it.

--
Nobu Nakada

James Gray

unread,

Oct 27, 2008, 11:12:46 AM10/27/08

to ruby...@ruby-lang.org

On Oct 27, 2008, at 7:07 AM, Nobuyoshi Nakada wrote:

> Anyway, I think I understand the needs to specify source
> encoding without magic comments. Is the option for that
> purpose an acceptable solution?

I believe so, yes.

I wasn't aware -KU still worked though, as Michael pointed out. I
thought for sure I had tried that and got a warning about it being
ignored now.

It may be that the TextMate team could use that. What all does it set
in 1.9? Source encoding obviously. It seems to affect
default_external as well, but not touch default_internal. Do I have
that right? Does it have any other special effects?

Will -KU stay supported for the foreseeable future?

James Edward Gray II

James Gray

unread,

Oct 27, 2008, 11:13:53 AM10/27/08

to ruby...@ruby-lang.org

On Oct 27, 2008, at 5:39 AM, Martin Duerst wrote:

> At 12:24 08/10/27, James Gray wrote:
>
>> They sure could, yeah. Our policy for TextMate development has
>> always
>> been that UTF-8 is king. We use it heavily and I'm sure some scripts
>> do contain multibyte characters in UTF-8.
>
> Wouldn't it be only these scripts (including those that contain
> \x escapes for UTF-8) that need the encoding indication at the top?
> (please note that literals with \u escapes are automatically UTF-8).

That's correct, just those scripts. I have no idea which ones they
are, but I could probably build a script to find them.

James Edward Gray II

Yukihiro Matsumoto

unread,

Oct 27, 2008, 12:15:13 PM10/27/08

to ruby...@ruby-lang.org

Hi,

In message "Re: [ruby-core:19550] Re: String literal encoding (Was: Default source encoding (Was: [Bug #680] csv.rb: CSV.parse is toolate when encoding is mismatch))"

on Tue, 28 Oct 2008 00:12:46 +0900, James Gray <ja...@grayproductions.net> writes:

|I wasn't aware -KU still worked though, as Michael pointed out. I
|thought for sure I had tried that and got a warning about it being
|ignored now.
|
|It may be that the TextMate team could use that. What all does it set
|in 1.9? Source encoding obviously. It seems to affect
|default_external as well, but not touch default_internal. Do I have
|that right? Does it have any other special effects?

-Ku (or -KU) specifies to

* default script encoding to be UTF-8
* default_external encoding to be UTF-8 unless it's specified
previously by -E or -U
* do not touch default_internal

|Will -KU stay supported for the foreseeable future?

Yes.

matz.

James Gray

unread,

Oct 27, 2008, 12:32:33 PM10/27/08

to ruby...@ruby-lang.org

OK, well this meet's TextMate's need then. Thank you.

James Edward Gray II

Michael Selig

unread,

Oct 27, 2008, 6:20:38 PM10/27/08

to ruby...@ruby-lang.org

On Mon, 27 Oct 2008 23:07:16 +1100, Nobuyoshi Nakada <no...@ruby-lang.org>
wrote:

> Michael Selig wrote in [ruby-core:19540]:

>> But my original point is that 1.8 scripts exist which contain multibyte
>> characters (eg UTF-8) which work fine under 1.8 without-K, but will fail
>> under 1.9 unless a magic comment or -K is provided.
>
> It just seemed working by chance.

Actually I think that any ruby program which uses ASCII everywhere except
in string literals will work without -K in 1.8. I think that this might be
a reasonably common scenario, at least in countries which use UTF-8 or an
ISO 8859 variant (outside the ASCII domain). These scripts will then fail
in 1.9 without magic comment or -K.

>> > But it's very ambiguous and dangerous to imply encodings. We
>> > can't trust locale for this purpose, at least.
>>
>> It's a trade-off between that and backward compatibility. I think the
>> "danger" is not high and it gives backward compatibility, so my vote
>> would
>> be to use it.
>
> And it will suddenly crash or behave weirdly by moving other
> locales.

Indeed - that's the danger. But I think the cost of losing backward
compatibility is worse.
How bad is this "danger" in reality? Simple scripts are often written just
to be used on the same computer or LAN. There are probably lots of these,
and there is almost certainly no danger there. I think that when people
send out a script to other countries it would be perfectly reasonable to
say that it must be run with -KU (or whatever) under 1.9 if their locale
is not UTF-8 (or whatever). If they don't, then there is a high chance it
will fail with an obvious error (eg: illegal multibyte char) - not a big
deal to diagnose. There is a very small chance that it may run and behave
weirdly. But then I would think that under 1.8 similar weird behaviour
would have happened anyhow.

> Anyway, I think I understand the needs to specify source
> encoding without magic comments. Is the option for that
> purpose an acceptable solution?

In my opinion it is an acceptable workaround to add -K to RUBYOPT
environment variable, but it is far from ideal.
My preference is still to default to the locale encoding like -e does.
Another alternative may be to default to ASCII-8BIT (which would be even
more 1.8 compatible :) ), but I haven't thought through if there are any
other repercussions to this.

Losing backward compatibility for "simple" ruby scripts is a major
negative IMHO.

Martin wrote "It may be somewhat suboptimal in the transition phase, but
...."
If I may say, I think that you guys might be underestimating the grief
caused when people upgrade to 1.9 or 2.0 and find that even simple scripts
no longer work.

Cheers
Mike

Nobuyoshi Nakada

unread,

Oct 31, 2008, 5:38:24 AM10/31/08

to ruby...@ruby-lang.org

Hi,

At Mon, 27 Oct 2008 21:07:16 +0900,
Nobuyoshi Nakada wrote in [ruby-core:19546]:

> Anyway, I think I understand the needs to specify source
> encoding without magic comments. Is the option for that
> purpose an acceptable solution?

Here is the patch to add options:

--encoding=external:internal:source
--external-encoding=enc
--internal-encoding=enc
--source-encoding=enc

Index: ruby.c
===================================================================
--- ruby.c (revision 20075)
+++ ruby.c (working copy)
@@ -623,5 +623,5 @@ dump_option(const char *str, int len, vo

static void
-set_internal_encoding_once(struct cmdline_options *opt, const char *e, int elen)
+set_option_encoding_once(const char *type, VALUE *name, const char *e, int elen)
{
VALUE ename;
@@ -630,27 +630,16 @@ set_internal_encoding_once(struct cmdlin
ename = rb_str_new(e, elen);

- if (opt->intern.enc.name &&
- rb_funcall(ename, rb_intern("casecmp"), 1, opt->intern.enc.name) != INT2FIX(0)) {
+ if (*name &&
+ rb_funcall(ename, rb_intern("casecmp"), 1, *name) != INT2FIX(0)) {
rb_raise(rb_eRuntimeError,
- "default_intenal already set to %s", RSTRING_PTR(opt->intern.enc.name));
+ "%s already set to %s", type, RSTRING_PTR(*name));
}
- opt->intern.enc.name = ename;
+ *name = ename;
}

-static void
-set_external_encoding_once(struct cmdline_options *opt, const char *e, int elen)
-{
- VALUE ename;
-
- if (!elen) elen = strlen(e);
- ename = rb_str_new(e, elen);
-
- if (opt->ext.enc.name &&
- rb_funcall(ename, rb_intern("casecmp"), 1, opt->ext.enc.name) != INT2FIX(0)) {
- rb_raise(rb_eRuntimeError,
- "default_external already set to %s", RSTRING_PTR(opt->ext.enc.name));
- }
- opt->ext.enc.name = ename;
-}
+#define set_internal_encoding_once(opt, e, elen) \
+ set_option_encoding_once("default_intenal", &opt->intern.enc.name, e, elen)
+#define set_external_encoding_once(opt, e, elen) \
+ set_option_encoding_once("default_extenal", &opt->ext.enc.name, e, elen)

static int
@@ -956,13 +945,29 @@ proc_options(int argc, char **argv, stru
char *p;
encoding:
- p = strchr(s, ':');
- if (p) {
- if (p > s)
- set_external_encoding_once(opt, s, p-s);
- if (*++p)
- set_internal_encoding_once(opt, p, 0);
- }
- else
- set_external_encoding_once(opt, s, 0);
+ do {
+# define set_encoding_part(type) \
+ if (!(p = strchr(s, ':'))) { \
+ set_##type##_encoding_once(opt, s, 0); \
+ break; \
+ } \
+ else if (p > s) { \
+ set_##type##_encoding_once(opt, s, p-s); \
+ }
+ set_encoding_part(external);
+ if (!*(s = ++p)) break;
+ set_encoding_part(internal);
+ if (!*(s = ++p)) break;
+ set_encoding_part(source);
+# undef set_encoding_part
+ } while (0);
+ }
+ else if (is_option_with_arg("internal-encoding", Qfalse, Qtrue)) {
+ set_internal_encoding_once(opt, s, 0);
+ }
+ else if (is_option_with_arg("external-encoding", Qfalse, Qtrue)) {
+ set_external_encoding_once(opt, s, 0);
+ }
+ else if (is_option_with_arg("source-encoding", Qfalse, Qtrue)) {
+ set_source_encoding_once(opt, s, 0);
}
else if (strcmp("version", s) == 0) {

--
Nobu Nakada

Nobuyoshi Nakada

unread,

Oct 31, 2008, 5:58:02 AM10/31/08

to ruby...@ruby-lang.org

Hi,

At Fri, 31 Oct 2008 18:38:24 +0900,
Nobuyoshi Nakada wrote in [ruby-core:19655]:

> +#define set_internal_encoding_once(opt, e, elen) \
> + set_option_encoding_once("default_intenal", &opt->intern.enc.name, e, elen)
> +#define set_external_encoding_once(opt, e, elen) \
> + set_option_encoding_once("default_extenal", &opt->ext.enc.name, e, elen)

Sorry, missed these 2 lines.

#define set_source_encoding_once(opt, e, elen) \
set_option_encoding_once("source", &opt->src.enc.name, e, elen)

--
Nobu Nakada

Martin Duerst

unread,

Oct 31, 2008, 6:05:25 AM10/31/08

to ruby...@ruby-lang.org

At 18:38 08/10/31, Nobuyoshi Nakada wrote:
>Hi,
>
>At Mon, 27 Oct 2008 21:07:16 +0900,
>Nobuyoshi Nakada wrote in [ruby-core:19546]:
>> Anyway, I think I understand the needs to specify source
>> encoding without magic comments. Is the option for that
>> purpose an acceptable solution?
>
>Here is the patch to add options:

Great work!

> --encoding=external:internal:source
> --external-encoding=enc
> --internal-encoding=enc
> --source-encoding=enc

I personally don't like the last one, and the :source in the first
one, but I guess there are situations where they can be very helpful
(e.g. testing with different encodings).

I also think that it would be good to have the values of --encoding
and -E look/work the same, so unless :source already works on -E,
I think having just --source-encoding for the case that the
source encoding must be set by an option should be okay.
This will also make it easier to distinguish in documentation
that --source-encoding is really only for very special occasions,
and declaring the source encoding in the script itself is strongly
preferred.

Nobuyoshi Nakada

unread,

Oct 31, 2008, 8:48:23 AM10/31/08

to ruby...@ruby-lang.org

Hi,

At Fri, 31 Oct 2008 19:05:25 +0900,
Martin Duerst wrote in [ruby-core:19657]:

> > --encoding=external:internal:source
> > --external-encoding=enc
> > --internal-encoding=enc
> > --source-encoding=enc
>
> I personally don't like the last one, and the :source in the first
> one, but I guess there are situations where they can be very helpful
> (e.g. testing with different encodings).
>
> I also think that it would be good to have the values of --encoding
> and -E look/work the same, so unless :source already works on -E,
> I think having just --source-encoding for the case that the
> source encoding must be set by an option should be okay.

-E equals to --encoding.

> This will also make it easier to distinguish in documentation
> that --source-encoding is really only for very special occasions,
> and declaring the source encoding in the script itself is strongly
> preferred.