Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
A plan for another unicode string hack
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 41 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Dae San Hwang  
View profile  
 More options Jun 14 2006, 10:47 am
From: Dae San Hwang <dae...@gmail.com>
Date: Wed, 14 Jun 2006 23:47:35 +0900
Local: Wed, Jun 14 2006 10:47 am
Subject: A plan for another unicode string hack
Hi everyone.

I'm implementing yet another unicode string hacks. I'm trying to  
rewire String class so that it will act like Ruby 2.0 String class.  
(see http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html)

String literals will act as byte buffers, just as they used to.  
However, when creating string object by using constructor, you can  
optionally specify the encoding of the input string.

  String.new("\352\260\200", "utf-8")

Default value of the encoding is nil if $KCODE is not set or set to  
"none". Default encoding is 'utf-8' if $KCODE == 'u'.  If encoding is  
nil, string objects will act just like old ruby strings we all know  
and love.  If encoding is set to a specific charset, string's  
instance methods will act more reasonably according to its encoding.  
Following is the summary of what I'm thinking:

  String#encoding gives character encoding name (e.g. "utf-8")
  String#[index] returns character string if encoding is set. If the  
encoding is not set, it returns fixnum as it used to.
  String#[] is always encoding aware if encoding is set.
  String#slice is always byte buffer operation regardless of the  
encoding.
  String#size always returns the number of bytes in the string.
  String#length returns the number of characters in the string  
according to the encoding specified. If the encoding is not set, it's  
same as String#size.
  String#+ will return utf-8 encoded string if two string's encoding  
does not match.

  *, <<, <=>, ==, =~, capitalize, casecmp, center, chomp, chop,  
count, delete, downcase, each, each_line, eql?, gsub, match, succ,  
scan, split, strip, sub, upcase, upto will be all encoding aware if  
encoding is set.

The reason I'm differentiating between 'size' and 'length' is because  
some libraries (like rails) depend on them returning the byte size of  
the string. Maybe we can establish a customs that 'size' for byte  
size and 'length' for the number of characters. Same reasoning goes  
for '[]' and 'slice'.

For now, it will support only utf-8 encoding as ruby's regexp doesn't  
seem to support encodings other than ascii and utf-8. (I could use  
iconv to convert encoding internally to utf-8 for each method call,  
but at the moment, I think it's probably too costly and not worth it.)

I would love to get some feedback on this. Matz's feedback will be  
especially great since I want to make this as much forward compatible  
as possible with Ruby 2.0.

Thanks!

Daesan

Dae San Hwang
dae...@gmail.com


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austin Ziegler  
View profile  
 More options Jun 14 2006, 10:51 am
From: "Austin Ziegler" <halosta...@gmail.com>
Date: Wed, 14 Jun 2006 23:51:53 +0900
Local: Wed, Jun 14 2006 10:51 am
Subject: Re: A plan for another unicode string hack
On 6/14/06, Dae San Hwang <dae...@gmail.com> wrote:

>   String#size always returns the number of bytes in the string.
>   String#length returns the number of characters in the string
> according to the encoding specified. If the encoding is not set, it's
> same as String#size.

This is a bad change.

#size and #length are synonymous now and should remain so. Add a new
method, like #character_count or something like that.

-austin
--
Austin Ziegler * halosta...@gmail.com * http://www.halostatue.ca/
               * aus...@halostatue.ca * http://www.halostatue.ca/feed/
               * aus...@zieglers.ca


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Logan Capaldo  
View profile  
 More options Jun 14 2006, 10:56 am
From: Logan Capaldo <logancapa...@gmail.com>
Date: Wed, 14 Jun 2006 23:56:37 +0900
Local: Wed, Jun 14 2006 10:56 am
Subject: Re: A plan for another unicode string hack

On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:

> The reason I'm differentiating between 'size' and 'length' is  
> because some libraries (like rails) depend on them returning the  
> byte size of the string. Maybe we can establish a customs that  
> 'size' for byte size and 'length' for the number of characters.  
> Same reasoning goes for '[]' and 'slice'.

I like these very much. Although the choice between [] and slice seem  
arbitrary (i.e. you could have swapped their meanings and it would  
have made just as much sense). #size vs. #length is perfect. and #[]  
being a Fixnum when their was no encoding but a character when there  
is is equally brilliant. I salute you sir!

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ts  
View profile  
 More options Jun 14 2006, 10:57 am
From: ts <dec...@moulon.inra.fr>
Date: Wed, 14 Jun 2006 23:57:26 +0900
Local: Wed, Jun 14 2006 10:57 am
Subject: Re: A plan for another unicode string hack

>>>>> "A" == Austin Ziegler <halosta...@gmail.com> writes:

A> #size and #length are synonymous now and should remain so. Add a new
A> method, like #character_count or something like that.

 Say this to matz :-)

svg% cat b.rb
#!./ruby -ku
a = String.new("Peut-être qu'on n'était pas encore là..", "utf-8")
p a.length
p a.size
svg%

svg% ./b.rb
39
42
svg%

 old ruby_m17n implementation

Guy Decoux


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austin Ziegler  
View profile  
 More options Jun 14 2006, 11:09 am
From: "Austin Ziegler" <halosta...@gmail.com>
Date: Thu, 15 Jun 2006 00:09:17 +0900
Local: Wed, Jun 14 2006 11:09 am
Subject: Re: A plan for another unicode string hack
On 6/14/06, ts <dec...@moulon.inra.fr> wrote:

> >>>>> "A" == Austin Ziegler <halosta...@gmail.com> writes:
> A> #size and #length are synonymous now and should remain so. Add a new
> A> method, like #character_count or something like that.

>  Say this to matz :-)

I will. Matz, please see above. ;)

The problem I have with this change is that I know that in my code I
have used #length and #size interchangeably depending on which reads
better in context.

It's not a good, clear, and understandable change. It will *forever*
require looking in ri or other resources to remember which one counts
characters and which one counts bytes.

-austin
--
Austin Ziegler * halosta...@gmail.com * http://www.halostatue.ca/
               * aus...@halostatue.ca * http://www.halostatue.ca/feed/
               * aus...@zieglers.ca


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Volkmann  
View profile  
 More options Jun 14 2006, 11:21 am
From: "Mark Volkmann" <r.mark.volkm...@gmail.com>
Date: Thu, 15 Jun 2006 00:21:19 +0900
Local: Wed, Jun 14 2006 11:21 am
Subject: Re: A plan for another unicode string hack
On 6/14/06, Austin Ziegler <halosta...@gmail.com> wrote:

> ... in my code I
> have used #length and #size interchangeably depending on which reads
> better in context.

I've never been a fan of the Ruby practice of having many names for
the same thing, but I'm willing to be convinced. Can you give me an
example of two string variables where getting the number of characters
reads better with "length" for one and "size" for the other?

--
R. Mark Volkmann
Object Computing, Inc.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austin Ziegler  
View profile  
 More options Jun 14 2006, 11:30 am
From: "Austin Ziegler" <halosta...@gmail.com>
Date: Thu, 15 Jun 2006 00:30:20 +0900
Local: Wed, Jun 14 2006 11:30 am
Subject: Re: A plan for another unicode string hack
On 6/14/06, Mark Volkmann <r.mark.volkm...@gmail.com> wrote:

> On 6/14/06, Austin Ziegler <halosta...@gmail.com> wrote:
> > ... in my code I
> > have used #length and #size interchangeably depending on which reads
> > better in context.
> I've never been a fan of the Ruby practice of having many names for
> the same thing, but I'm willing to be convinced. Can you give me an
> example of two string variables where getting the number of characters
> reads better with "length" for one and "size" for the other?

It's all code context. "name.length" reads better than "name.size" and
"box.size" reads better than "box.length". Remember, in Ruby you
*don't* know whether you're dealing with a String, Array, or Hash (or
something else) when you're dealing with simple method calls.
Similarly, I will use #map most of the time, but sometimes I'll use
#collect.

In any case, these are well-established names and having them differ
would be problematic. That *said*, I'll have to fix stuff in Ruby 2
for PDF::Writer because I'm currently doing byte counting, not
character counting.

-austin
--
Austin Ziegler * halosta...@gmail.com * http://www.halostatue.ca/
               * aus...@halostatue.ca * http://www.halostatue.ca/feed/
               * aus...@zieglers.ca


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dae San Hwang  
View profile  
 More options Jun 14 2006, 11:47 am
From: Dae San Hwang <dae...@gmail.com>
Date: Thu, 15 Jun 2006 00:47:12 +0900
Local: Wed, Jun 14 2006 11:47 am
Subject: Re: A plan for another unicode string hack
On Jun 14, 2006, at 11:56 PM, Logan Capaldo wrote:

> On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:

>> The reason I'm differentiating between 'size' and 'length' is  
>> because some libraries (like rails) depend on them returning the  
>> byte size of the string. Maybe we can establish a customs that  
>> 'size' for byte size and 'length' for the number of characters.  
>> Same reasoning goes for '[]' and 'slice'.

> I like these very much. Although the choice between [] and slice  
> seem arbitrary (i.e. you could have swapped their meanings and it  
> would have made just as much sense). #size vs. #length is perfect.  
> and #[] being a Fixnum when their was no encoding but a character  
> when there is is equally brilliant. I salute you sir!

Thanks for the kind words.

The reason I picked [] for encoding aware method is because String#
[index] will be used to extract the letter and not the byte in Ruby  
2.0 as mentioned in http://redhanded.hobix.com/inspect/
futurismUnicodeInRuby.html

   so that "abc"[0] returns "a" instead of fixnum 97

A way to get a Nth byte of a byte buffer is probably still necessary  
and String#slice seems to be the logical one, I thought.

Dae San Hwang
dae...@gmail.com


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dae San Hwang  
View profile  
 More options Jun 14 2006, 12:09 pm
From: Dae San Hwang <dae...@gmail.com>
Date: Thu, 15 Jun 2006 01:09:04 +0900
Local: Wed, Jun 14 2006 12:09 pm
Subject: Re: A plan for another unicode string hack
On Jun 15, 2006, at 12:30 AM, Austin Ziegler wrote:

Are you sure that "box" happened to be a variable for a string  
object? ;)

> In any case, these are well-established names and having them differ
> would be problematic. That *said*, I'll have to fix stuff in Ruby 2
> for PDF::Writer because I'm currently doing byte counting, not
> character counting.

My proposed change won't disturb anyone's existing codes unless you  
set $KCODE to be 'u' as well in that code. If you did set $KCODE to  
'u' in your previous projects, you don't have to apply this hack  
(which hasn't been implemented yet) to that project.

Matz has said several times that he will maximize the breakage moving  
to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as  
implied in Guy Decoux's posting) I think I will just follow along. My  
goal is to provide Ruby 2.0 forward compatible unicode support until  
the move is complete.

Dae San Hwang
dae...@gmail.com


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austin Ziegler  
View profile  
 More options Jun 14 2006, 12:17 pm
From: "Austin Ziegler" <halosta...@gmail.com>
Date: Thu, 15 Jun 2006 01:17:17 +0900
Local: Wed, Jun 14 2006 12:17 pm
Subject: Re: A plan for another unicode string hack
On 6/14/06, Dae San Hwang <dae...@gmail.com> wrote:

> My proposed change won't disturb anyone's existing codes unless you
> set $KCODE to be 'u' as well in that code. If you did set $KCODE to
> 'u' in your previous projects, you don't have to apply this hack
> (which hasn't been implemented yet) to that project.

Um. PDF::Writer is a library, and I think that I use both depending on
how the code reads.

> Matz has said several times that he will maximize the breakage moving
> to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as
> implied in Guy Decoux's posting) I think I will just follow along. My
> goal is to provide Ruby 2.0 forward compatible unicode support until
> the move is complete.

Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and  which one returns the
number of bytes?

They should always *either* return characters or bytes (preferably
characters) and a separate call should be introduced for the
alternative meaning. One that is explicit in its name to match its
meaning.

-austin
--
Austin Ziegler * halosta...@gmail.com * http://www.halostatue.ca/
               * aus...@halostatue.ca * http://www.halostatue.ca/feed/
               * aus...@zieglers.ca


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
gwtm...@mac.com  
View profile  
 More options Jun 14 2006, 12:24 pm
From: gwtm...@mac.com
Date: Thu, 15 Jun 2006 01:24:19 +0900
Local: Wed, Jun 14 2006 12:24 pm
Subject: Re: A plan for another unicode string hack

On Jun 14, 2006, at 12:17 PM, Austin Ziegler wrote:

> They should always *either* return characters or bytes (preferably
> characters) and a separate call should be introduced for the
> alternative meaning. One that is explicit in its name to match its
> meaning.

+1

Gary Wright


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Battley  
View profile  
 More options Jun 14 2006, 12:50 pm
From: "Paul Battley" <pbatt...@gmail.com>
Date: Thu, 15 Jun 2006 01:50:44 +0900
Local: Wed, Jun 14 2006 12:50 pm
Subject: Re: A plan for another unicode string hack
On 14/06/06, Dae San Hwang <dae...@gmail.com> wrote:

> For now, it will support only utf-8 encoding as ruby's regexp doesn't
> seem to support encodings other than ascii and utf-8. (I could use
> iconv to convert encoding internally to utf-8 for each method call,
> but at the moment, I think it's probably too costly and not worth it.)

Regexp also supports EUC (which seems to work for EUC-KR as well as
EUC-JP, incidentally) and Shift_JIS. Nevertheless, I think that
starting with UTF-8 is the way to go.

> I would love to get some feedback on this. Matz's feedback will be
> especially great since I want to make this as much forward compatible
> as possible with Ruby 2.0.

I think it's a great idea. If you want any implementation assistance,
I'd be glad to help (I've done quite a bit of Unicode hacking in
Ruby).

Paul.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Daniel Schierbeck  
View profile  
 More options Jun 14 2006, 1:59 pm
From: Daniel Schierbeck <daniel.schierb...@gmail.com>
Date: Thu, 15 Jun 2006 02:59:16 +0900
Local: Wed, Jun 14 2006 1:59 pm
Subject: Re: A plan for another unicode string hack

Dae San Hwang wrote:
> String literals will act as byte buffers, just as they used to. However,
> when creating string object by using constructor, you can optionally
> specify the encoding of the input string.

>  String.new("\352\260\200", "utf-8")

I'd like to have a different interface, using named parameters.

   String.new("\352\260\200", encoding: "utf-8")

or

   String.new("\352\260\200", :encoding => "utf-8")

That way it's easier to extend String later on.

Cheers,
Daniel


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Leslie Viljoen  
View profile  
 More options Jun 14 2006, 2:56 pm
From: "Leslie Viljoen" <leslievilj...@gmail.com>
Date: Thu, 15 Jun 2006 03:56:05 +0900
Local: Wed, Jun 14 2006 2:56 pm
Subject: Re: A plan for another unicode string hack
On 6/14/06, Dae San Hwang <dae...@gmail.com> wrote:

This behaviour - of [] returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Leslie Viljoen  
View profile  
 More options Jun 14 2006, 2:56 pm
From: "Leslie Viljoen" <leslievilj...@gmail.com>
Date: Thu, 15 Jun 2006 03:56:48 +0900
Local: Wed, Jun 14 2006 2:56 pm
Subject: Re: A plan for another unicode string hack
On 6/14/06, Leslie Viljoen <leslievilj...@gmail.com> wrote:

..returning different *type* values I mean..

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dave Howell  
View profile  
 More options Jun 14 2006, 4:46 pm
From: Dave Howell <gro...@grandfenwick.net>
Date: Thu, 15 Jun 2006 05:46:04 +0900
Local: Wed, Jun 14 2006 4:46 pm
Subject: Re: A plan for another unicode string hack

On Jun 14, 2006, at 9:17, Austin Ziegler wrote:

Yes, I undertstand. Making #size and #length return different values
is a mistake. Without referring to documentation, how would you know
which returns the number of characters and  which one returns the
number of bytes?

I cannot agree. "Length" (to me) unavoidably implies that it's the
answer to the question "How LONG is it?" I expect the answer to be "n
characters long."

"Size" is the answer to "How BIG is it?" as in "How much space does
this thing take up?" and if it's a UTF-8 string, I expect an answer
like "1 byte per character + one more byte per character not in the
7-bit ASCII range"

I would never have to look up which is which. Obviously Austin's
mileage varies.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austin Ziegler  
View profile  
 More options Jun 14 2006, 4:51 pm
From: "Austin Ziegler" <halosta...@gmail.com>
Date: Thu, 15 Jun 2006 05:51:04 +0900
Local: Wed, Jun 14 2006 4:51 pm
Subject: Re: A plan for another unicode string hack
On 6/14/06, Dave Howell <gro...@grandfenwick.net> wrote:

As much as I like to say that I'm "from Ruby" these days, not everyone
will be. Some languages use string.length(); others use string.size().
I do not think that the proposed distinction is meaningful and
presents problems.

-austin
--
Austin Ziegler * halosta...@gmail.com * http://www.halostatue.ca/
               * aus...@halostatue.ca * http://www.halostatue.ca/feed/
               * aus...@zieglers.ca


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Suraj N. Kurapati  
View profile  
 More options Jun 15 2006, 12:13 am
From: "Suraj N. Kurapati" <skura...@ucsc.edu>
Date: Thu, 15 Jun 2006 13:13:40 +0900
Local: Thurs, Jun 15 2006 12:13 am
Subject: Re: A plan for another unicode string hack
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dae San Hwang wrote:
> The reason I'm differentiating between 'size' and 'length' is because
> some libraries (like rails) depend on them returning the byte size of
> the string. Maybe we can establish a customs that 'size' for byte size
> and 'length' for the number of characters. Same reasoning goes for '[]'
> and 'slice'.

Good idea. This separation of 'length' and 'size' methods is quite
reasonable, in my opinion.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFEkN5FmV9O7RYnKMcRApEsAJ968jHHafjyNdMBb9doKnfESaDc7ACfUlvS
F+LQH5TY5kehba7roMNfiq4=
=cgPr
-----END PGP SIGNATURE-----


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Joel VanderWerf  
View profile  
 More options Jun 15 2006, 12:52 am
From: Joel VanderWerf <vj...@path.berkeley.edu>
Date: Thu, 15 Jun 2006 13:52:25 +0900
Local: Thurs, Jun 15 2006 12:52 am
Subject: Re: A plan for another unicode string hack

That's not a bad argument, but Hash#size and Array#size don't behave
that way in ruby.

--
      vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dave Burt  
View profile  
 More options Jun 15 2006, 3:33 am
From: Dave Burt <d...@burt.id.au>
Date: Thu, 15 Jun 2006 16:33:07 +0900
Local: Thurs, Jun 15 2006 3:33 am
Subject: Re: A plan for another unicode string hack

Leslie Viljoen wrote:
> On 6/14/06, Leslie Viljoen <leslievilj...@gmail.com> wrote:
>> On 6/14/06, Dae San Hwang <dae...@gmail.com> wrote:
>> >    so that "abc"[0] returns "a" instead of fixnum 97

>> This behaviour - of [] returning different values depending on the
>> argument has always made me a bit crazy. Does anyone know why it was
>> done that way?

> ..returning different *type* values I mean..

I've heard it's due to be fixed by end of next year.

Now, to Ruby's strings, a character is a byte, represented by a Fixnum.

The new Ruby character will be a string:

?c  #=> "c"
"c"[0]  #=> "c"
"c"[0].ord  #=> 99

Cheers,
Dave


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Leslie Viljoen  
View profile  
 More options Jun 15 2006, 4:05 am
From: "Leslie Viljoen" <leslievilj...@gmail.com>
Date: Thu, 15 Jun 2006 17:05:46 +0900
Local: Thurs, Jun 15 2006 4:05 am
Subject: Re: A plan for another unicode string hack
On 6/15/06, Dave Burt <d...@burt.id.au> wrote:

Yahoo!

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Leslie Viljoen  
View profile  
 More options Jun 15 2006, 4:08 am
From: "Leslie Viljoen" <leslievilj...@gmail.com>
Date: Thu, 15 Jun 2006 17:08:41 +0900
Local: Thurs, Jun 15 2006 4:08 am
Subject: Re: A plan for another unicode string hack
On 6/15/06, Joel VanderWerf <vj...@path.berkeley.edu> wrote:

I agree with Austin on this - the distinction is too vague. I'd leave
length and size the same and make a size_in_bytes method.

Les


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Yukihiro Matsumoto  
View profile  
 More options Jun 15 2006, 4:55 am
From: Yukihiro Matsumoto <m...@ruby-lang.org>
Date: Thu, 15 Jun 2006 17:55:41 +0900
Local: Thurs, Jun 15 2006 4:55 am
Subject: Re: A plan for another unicode string hack
Hi,

In message "Re: A plan for another unicode string hack"
    on Thu, 15 Jun 2006 17:08:41 +0900, "Leslie Viljoen" <leslievilj...@gmail.com> writes:

|I agree with Austin on this - the distinction is too vague. I'd leave
|length and size the same and make a size_in_bytes method.

On my latest prototype (not checked in anywhere), String#length and
String#size behave same, and there is String#buffer_size to return
size in bytes.  The method name might change in the future.

                                                        matz.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Battley  
View profile  
 More options Jun 15 2006, 5:26 am
From: "Paul Battley" <pbatt...@gmail.com>
Date: Thu, 15 Jun 2006 18:26:24 +0900
Local: Thurs, Jun 15 2006 5:26 am
Subject: Re: A plan for another unicode string hack
On 15/06/06, Yukihiro Matsumoto <m...@ruby-lang.org> wrote:

> |I agree with Austin on this - the distinction is too vague. I'd leave
> |length and size the same and make a size_in_bytes method.

> On my latest prototype (not checked in anywhere), String#length and
> String#size behave same, and there is String#buffer_size to return
> size in bytes.  The method name might change in the future.

Actually, this makes a lot of sense. Why would you ever want to know
the actual byte length of a UTF-8 string? It's pretty meaningless for
most string-processing tasks: the main times you would need it would
be in allocation and interfacing with external systems and libraries.
Thus, something like buffer_size maps to real-world usage extremely
well, in my opinion.

Paul.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Leslie Viljoen  
View profile  
 More options Jun 15 2006, 5:43 am
From: "Leslie Viljoen" <leslievilj...@gmail.com>
Date: Thu, 15 Jun 2006 18:43:59 +0900
Local: Thurs, Jun 15 2006 5:43 am
Subject: Re: A plan for another unicode string hack
On 6/15/06, Paul Battley <pbatt...@gmail.com> wrote:

> On 15/06/06, Yukihiro Matsumoto <m...@ruby-lang.org> wrote:
> > |I agree with Austin on this - the distinction is too vague. I'd leave
> > |length and size the same and make a size_in_bytes method.

> > On my latest prototype (not checked in anywhere), String#length and
> > String#size behave same, and there is String#buffer_size to return
> > size in bytes.  The method name might change in the future.

> Actually, this makes a lot of sense. Why would you ever want to know
> the actual byte length of a UTF-8 string? It's pretty meaningless for
> most string-processing tasks: the main times you would need it would
> be in allocation and interfacing with external systems and libraries.
> Thus, something like buffer_size maps to real-world usage extremely
> well, in my opinion.

Of course the confusion here is caused by measurement units. Size in
bytes or size in characters? Length and size don't (clearly) indictate
that distinction, and neither does buffer_size. The name should
indicate the unit so that you could immediately see that adding (eg.)
length_in_characters to length_in_bytes would be in error.

Here's some naming convention insight:
http://www.joelonsoftware.com/articles/Wrong.html

Les


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 41   Newer >
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google