String literals will act as byte buffers, just as they used to. However, when creating string object by using constructor, you can optionally specify the encoding of the input string.
String.new("\352\260\200", "utf-8")
Default value of the encoding is nil if $KCODE is not set or set to "none". Default encoding is 'utf-8' if $KCODE == 'u'. If encoding is nil, string objects will act just like old ruby strings we all know and love. If encoding is set to a specific charset, string's instance methods will act more reasonably according to its encoding. Following is the summary of what I'm thinking:
String#encoding gives character encoding name (e.g. "utf-8") String#[index] returns character string if encoding is set. If the encoding is not set, it returns fixnum as it used to. String#[] is always encoding aware if encoding is set. String#slice is always byte buffer operation regardless of the encoding. String#size always returns the number of bytes in the string. String#length returns the number of characters in the string according to the encoding specified. If the encoding is not set, it's same as String#size. String#+ will return utf-8 encoded string if two string's encoding does not match.
*, <<, <=>, ==, =~, capitalize, casecmp, center, chomp, chop, count, delete, downcase, each, each_line, eql?, gsub, match, succ, scan, split, strip, sub, upcase, upto will be all encoding aware if encoding is set.
The reason I'm differentiating between 'size' and 'length' is because some libraries (like rails) depend on them returning the byte size of the string. Maybe we can establish a customs that 'size' for byte size and 'length' for the number of characters. Same reasoning goes for '[]' and 'slice'.
For now, it will support only utf-8 encoding as ruby's regexp doesn't seem to support encodings other than ascii and utf-8. (I could use iconv to convert encoding internally to utf-8 for each method call, but at the moment, I think it's probably too costly and not worth it.)
I would love to get some feedback on this. Matz's feedback will be especially great since I want to make this as much forward compatible as possible with Ruby 2.0.
On 6/14/06, Dae San Hwang <dae...@gmail.com> wrote:
> String#size always returns the number of bytes in the string. > String#length returns the number of characters in the string > according to the encoding specified. If the encoding is not set, it's > same as String#size.
This is a bad change.
#size and #length are synonymous now and should remain so. Add a new method, like #character_count or something like that.
On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:
> The reason I'm differentiating between 'size' and 'length' is > because some libraries (like rails) depend on them returning the > byte size of the string. Maybe we can establish a customs that > 'size' for byte size and 'length' for the number of characters. > Same reasoning goes for '[]' and 'slice'.
I like these very much. Although the choice between [] and slice seem arbitrary (i.e. you could have swapped their meanings and it would have made just as much sense). #size vs. #length is perfect. and #[] being a Fixnum when their was no encoding but a character when there is is equally brilliant. I salute you sir!
> >>>>> "A" == Austin Ziegler <halosta...@gmail.com> writes: > A> #size and #length are synonymous now and should remain so. Add a new > A> method, like #character_count or something like that.
> Say this to matz :-)
I will. Matz, please see above. ;)
The problem I have with this change is that I know that in my code I have used #length and #size interchangeably depending on which reads better in context.
It's not a good, clear, and understandable change. It will *forever* require looking in ri or other resources to remember which one counts characters and which one counts bytes.
On 6/14/06, Austin Ziegler <halosta...@gmail.com> wrote:
> ... in my code I > have used #length and #size interchangeably depending on which reads > better in context.
I've never been a fan of the Ruby practice of having many names for the same thing, but I'm willing to be convinced. Can you give me an example of two string variables where getting the number of characters reads better with "length" for one and "size" for the other?
On 6/14/06, Mark Volkmann <r.mark.volkm...@gmail.com> wrote:
> On 6/14/06, Austin Ziegler <halosta...@gmail.com> wrote: > > ... in my code I > > have used #length and #size interchangeably depending on which reads > > better in context. > I've never been a fan of the Ruby practice of having many names for > the same thing, but I'm willing to be convinced. Can you give me an > example of two string variables where getting the number of characters > reads better with "length" for one and "size" for the other?
It's all code context. "name.length" reads better than "name.size" and "box.size" reads better than "box.length". Remember, in Ruby you *don't* know whether you're dealing with a String, Array, or Hash (or something else) when you're dealing with simple method calls. Similarly, I will use #map most of the time, but sometimes I'll use #collect.
In any case, these are well-established names and having them differ would be problematic. That *said*, I'll have to fix stuff in Ruby 2 for PDF::Writer because I'm currently doing byte counting, not character counting.
On Jun 14, 2006, at 11:56 PM, Logan Capaldo wrote:
> On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:
>> The reason I'm differentiating between 'size' and 'length' is >> because some libraries (like rails) depend on them returning the >> byte size of the string. Maybe we can establish a customs that >> 'size' for byte size and 'length' for the number of characters. >> Same reasoning goes for '[]' and 'slice'.
> I like these very much. Although the choice between [] and slice > seem arbitrary (i.e. you could have swapped their meanings and it > would have made just as much sense). #size vs. #length is perfect. > and #[] being a Fixnum when their was no encoding but a character > when there is is equally brilliant. I salute you sir!
Thanks for the kind words.
The reason I picked [] for encoding aware method is because String# [index] will be used to extract the letter and not the byte in Ruby 2.0 as mentioned in http://redhanded.hobix.com/inspect/ futurismUnicodeInRuby.html
so that "abc"[0] returns "a" instead of fixnum 97
A way to get a Nth byte of a byte buffer is probably still necessary and String#slice seems to be the logical one, I thought.
> On 6/14/06, Mark Volkmann <r.mark.volkm...@gmail.com> wrote: >> On 6/14/06, Austin Ziegler <halosta...@gmail.com> wrote: >> > ... in my code I >> > have used #length and #size interchangeably depending on which >> reads >> > better in context. >> I've never been a fan of the Ruby practice of having many names for >> the same thing, but I'm willing to be convinced. Can you give me an >> example of two string variables where getting the number of >> characters >> reads better with "length" for one and "size" for the other?
> It's all code context. "name.length" reads better than "name.size" and > "box.size" reads better than "box.length". Remember, in Ruby you > *don't* know whether you're dealing with a String, Array, or Hash (or > something else) when you're dealing with simple method calls. > Similarly, I will use #map most of the time, but sometimes I'll use > #collect.
Are you sure that "box" happened to be a variable for a string object? ;)
> In any case, these are well-established names and having them differ > would be problematic. That *said*, I'll have to fix stuff in Ruby 2 > for PDF::Writer because I'm currently doing byte counting, not > character counting.
My proposed change won't disturb anyone's existing codes unless you set $KCODE to be 'u' as well in that code. If you did set $KCODE to 'u' in your previous projects, you don't have to apply this hack (which hasn't been implemented yet) to that project.
Matz has said several times that he will maximize the breakage moving to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as implied in Guy Decoux's posting) I think I will just follow along. My goal is to provide Ruby 2.0 forward compatible unicode support until the move is complete.
On 6/14/06, Dae San Hwang <dae...@gmail.com> wrote:
> My proposed change won't disturb anyone's existing codes unless you > set $KCODE to be 'u' as well in that code. If you did set $KCODE to > 'u' in your previous projects, you don't have to apply this hack > (which hasn't been implemented yet) to that project.
Um. PDF::Writer is a library, and I think that I use both depending on how the code reads.
> Matz has said several times that he will maximize the breakage moving > to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as > implied in Guy Decoux's posting) I think I will just follow along. My > goal is to provide Ruby 2.0 forward compatible unicode support until > the move is complete.
Yes, I undertstand. Making #size and #length return different values is a mistake. Without referring to documentation, how would you know which returns the number of characters and which one returns the number of bytes?
They should always *either* return characters or bytes (preferably characters) and a separate call should be introduced for the alternative meaning. One that is explicit in its name to match its meaning.
On Jun 14, 2006, at 12:17 PM, Austin Ziegler wrote:
> They should always *either* return characters or bytes (preferably > characters) and a separate call should be introduced for the > alternative meaning. One that is explicit in its name to match its > meaning.
On 14/06/06, Dae San Hwang <dae...@gmail.com> wrote:
> For now, it will support only utf-8 encoding as ruby's regexp doesn't > seem to support encodings other than ascii and utf-8. (I could use > iconv to convert encoding internally to utf-8 for each method call, > but at the moment, I think it's probably too costly and not worth it.)
Regexp also supports EUC (which seems to work for EUC-KR as well as EUC-JP, incidentally) and Shift_JIS. Nevertheless, I think that starting with UTF-8 is the way to go.
> I would love to get some feedback on this. Matz's feedback will be > especially great since I want to make this as much forward compatible > as possible with Ruby 2.0.
I think it's a great idea. If you want any implementation assistance, I'd be glad to help (I've done quite a bit of Unicode hacking in Ruby).
Dae San Hwang wrote: > String literals will act as byte buffers, just as they used to. However, > when creating string object by using constructor, you can optionally > specify the encoding of the input string.
> String.new("\352\260\200", "utf-8")
I'd like to have a different interface, using named parameters.
> On Jun 14, 2006, at 11:56 PM, Logan Capaldo wrote:
> > On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:
> >> The reason I'm differentiating between 'size' and 'length' is > >> because some libraries (like rails) depend on them returning the > >> byte size of the string. Maybe we can establish a customs that > >> 'size' for byte size and 'length' for the number of characters. > >> Same reasoning goes for '[]' and 'slice'.
> > I like these very much. Although the choice between [] and slice > > seem arbitrary (i.e. you could have swapped their meanings and it > > would have made just as much sense). #size vs. #length is perfect. > > and #[] being a Fixnum when their was no encoding but a character > > when there is is equally brilliant. I salute you sir!
> Thanks for the kind words.
> The reason I picked [] for encoding aware method is because String# > [index] will be used to extract the letter and not the byte in Ruby > 2.0 as mentioned in http://redhanded.hobix.com/inspect/ > futurismUnicodeInRuby.html
> so that "abc"[0] returns "a" instead of fixnum 97
This behaviour - of [] returning different values depending on the argument has always made me a bit crazy. Does anyone know why it was done that way?
> On 6/14/06, Dae San Hwang <dae...@gmail.com> wrote: > > On Jun 14, 2006, at 11:56 PM, Logan Capaldo wrote:
> > > On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:
> > >> The reason I'm differentiating between 'size' and 'length' is > > >> because some libraries (like rails) depend on them returning the > > >> byte size of the string. Maybe we can establish a customs that > > >> 'size' for byte size and 'length' for the number of characters. > > >> Same reasoning goes for '[]' and 'slice'.
> > > I like these very much. Although the choice between [] and slice > > > seem arbitrary (i.e. you could have swapped their meanings and it > > > would have made just as much sense). #size vs. #length is perfect. > > > and #[] being a Fixnum when their was no encoding but a character > > > when there is is equally brilliant. I salute you sir!
> > Thanks for the kind words.
> > The reason I picked [] for encoding aware method is because String# > > [index] will be used to extract the letter and not the byte in Ruby > > 2.0 as mentioned in http://redhanded.hobix.com/inspect/ > > futurismUnicodeInRuby.html
> > so that "abc"[0] returns "a" instead of fixnum 97
> This behaviour - of [] returning different values depending on the > argument has always made me a bit crazy. Does anyone know why it was > done that way?
Yes, I undertstand. Making #size and #length return different values is a mistake. Without referring to documentation, how would you know which returns the number of characters and which one returns the number of bytes?
I cannot agree. "Length" (to me) unavoidably implies that it's the answer to the question "How LONG is it?" I expect the answer to be "n characters long."
"Size" is the answer to "How BIG is it?" as in "How much space does this thing take up?" and if it's a UTF-8 string, I expect an answer like "1 byte per character + one more byte per character not in the 7-bit ASCII range"
I would never have to look up which is which. Obviously Austin's mileage varies.
> On Jun 14, 2006, at 9:17, Austin Ziegler wrote: > Yes, I undertstand. Making #size and #length return different values > is a mistake. Without referring to documentation, how would you know > which returns the number of characters and which one returns the > number of bytes?
> I cannot agree. "Length" (to me) unavoidably implies that it's the > answer to the question "How LONG is it?" I expect the answer to be "n > characters long."
> "Size" is the answer to "How BIG is it?" as in "How much space does > this thing take up?" and if it's a UTF-8 string, I expect an answer > like "1 byte per character + one more byte per character not in the > 7-bit ASCII range"
> I would never have to look up which is which. Obviously Austin's > mileage varies.
As much as I like to say that I'm "from Ruby" these days, not everyone will be. Some languages use string.length(); others use string.size(). I do not think that the proposed distinction is meaningful and presents problems.
Dae San Hwang wrote: > The reason I'm differentiating between 'size' and 'length' is because > some libraries (like rails) depend on them returning the byte size of > the string. Maybe we can establish a customs that 'size' for byte size > and 'length' for the number of characters. Same reasoning goes for '[]' > and 'slice'.
Good idea. This separation of 'length' and 'size' methods is quite reasonable, in my opinion. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux)
> Yes, I undertstand. Making #size and #length return different values > is a mistake. Without referring to documentation, how would you know > which returns the number of characters and which one returns the > number of bytes?
> I cannot agree. "Length" (to me) unavoidably implies that it's the > answer to the question "How LONG is it?" I expect the answer to be "n > characters long."
> "Size" is the answer to "How BIG is it?" as in "How much space does this > thing take up?" and if it's a UTF-8 string, I expect an answer like "1 > byte per character + one more byte per character not in the 7-bit ASCII > range"
That's not a bad argument, but Hash#size and Array#size don't behave that way in ruby.
-- vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407
Leslie Viljoen wrote: > On 6/14/06, Leslie Viljoen <leslievilj...@gmail.com> wrote: >> On 6/14/06, Dae San Hwang <dae...@gmail.com> wrote: >> > so that "abc"[0] returns "a" instead of fixnum 97
>> This behaviour - of [] returning different values depending on the >> argument has always made me a bit crazy. Does anyone know why it was >> done that way?
> ..returning different *type* values I mean..
I've heard it's due to be fixed by end of next year.
Now, to Ruby's strings, a character is a byte, represented by a Fixnum.
> Leslie Viljoen wrote: > > On 6/14/06, Leslie Viljoen <leslievilj...@gmail.com> wrote: > >> On 6/14/06, Dae San Hwang <dae...@gmail.com> wrote: > >> > so that "abc"[0] returns "a" instead of fixnum 97
> >> This behaviour - of [] returning different values depending on the > >> argument has always made me a bit crazy. Does anyone know why it was > >> done that way?
> > ..returning different *type* values I mean..
> I've heard it's due to be fixed by end of next year.
> Now, to Ruby's strings, a character is a byte, represented by a Fixnum.
> > On Jun 14, 2006, at 9:17, Austin Ziegler wrote:
> > Yes, I undertstand. Making #size and #length return different values > > is a mistake. Without referring to documentation, how would you know > > which returns the number of characters and which one returns the > > number of bytes?
> > I cannot agree. "Length" (to me) unavoidably implies that it's the > > answer to the question "How LONG is it?" I expect the answer to be "n > > characters long."
> > "Size" is the answer to "How BIG is it?" as in "How much space does this > > thing take up?" and if it's a UTF-8 string, I expect an answer like "1 > > byte per character + one more byte per character not in the 7-bit ASCII > > range"
> That's not a bad argument, but Hash#size and Array#size don't behave > that way in ruby.
I agree with Austin on this - the distinction is too vague. I'd leave length and size the same and make a size_in_bytes method.
In message "Re: A plan for another unicode string hack" on Thu, 15 Jun 2006 17:08:41 +0900, "Leslie Viljoen" <leslievilj...@gmail.com> writes:
|I agree with Austin on this - the distinction is too vague. I'd leave |length and size the same and make a size_in_bytes method.
On my latest prototype (not checked in anywhere), String#length and String#size behave same, and there is String#buffer_size to return size in bytes. The method name might change in the future.
On 15/06/06, Yukihiro Matsumoto <m...@ruby-lang.org> wrote:
> |I agree with Austin on this - the distinction is too vague. I'd leave > |length and size the same and make a size_in_bytes method.
> On my latest prototype (not checked in anywhere), String#length and > String#size behave same, and there is String#buffer_size to return > size in bytes. The method name might change in the future.
Actually, this makes a lot of sense. Why would you ever want to know the actual byte length of a UTF-8 string? It's pretty meaningless for most string-processing tasks: the main times you would need it would be in allocation and interfacing with external systems and libraries. Thus, something like buffer_size maps to real-world usage extremely well, in my opinion.
On 6/15/06, Paul Battley <pbatt...@gmail.com> wrote:
> On 15/06/06, Yukihiro Matsumoto <m...@ruby-lang.org> wrote: > > |I agree with Austin on this - the distinction is too vague. I'd leave > > |length and size the same and make a size_in_bytes method.
> > On my latest prototype (not checked in anywhere), String#length and > > String#size behave same, and there is String#buffer_size to return > > size in bytes. The method name might change in the future.
> Actually, this makes a lot of sense. Why would you ever want to know > the actual byte length of a UTF-8 string? It's pretty meaningless for > most string-processing tasks: the main times you would need it would > be in allocation and interfacing with external systems and libraries. > Thus, something like buffer_size maps to real-world usage extremely > well, in my opinion.
Of course the confusion here is caused by measurement units. Size in bytes or size in characters? Length and size don't (clearly) indictate that distinction, and neither does buffer_size. The name should indicate the unit so that you could immediately see that adding (eg.) length_in_characters to length_in_bytes would be in error.