Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Ruby/Unicode library
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  4 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Rob Leslie  
View profile  
(1 user)  More options Jun 18 2006, 2:11 pm
From: Rob Leslie <r...@mars.org>
Date: Mon, 19 Jun 2006 03:11:32 +0900
Local: Sun, Jun 18 2006 2:11 pm
Subject: Ruby/Unicode library
Since there's been a lot of talk about Unicode lately, I thought I'd  
throw out a Ruby library I've been working on to support Unicode  
characters and strings based on the 4.1.0 standard and key  
specifications from the Unicode Consortium.

   ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2

The library adds an encoding property to native String objects, and  
allows conversion to and from Unicode::String and Unicode::Character.  
A default encoding is chosen based on $KCODE, or the default can be  
set/changed explicitly via String.default_encoding.

Unicode strings can be obtained by applying the + unary operator to  
native strings, e.g. +"Hello" (where the native string is encoded in  
the default encoding).

   % irb -I. -runicode -Ku
   irb(main):001:0> ustr = +"π is pi"
   => +"π is pi"

Native strings are obtained from Unicode strings by calling to_s,  
which accepts an optional argument to indicate the desired encoding.

   irb(main):002:0> str = ustr.to_s
   => "π is pi"
   irb(main):003:0> str.encoding
   => Unicode::Encoding::UTF8

Individual characters can be indexed from Unicode strings, returning  
a Unicode::Character object.

   irb(main):004:0> ustr[0]
   => U+03C0 GREEK SMALL LETTER PI

Case conversion is handled as with native strings.

   irb(main):005:0> ustr.upcase
   => +"Π IS PI"

Normalization is accomplished with the ~ unary operator.

   irb(main):006:0> ustr = +"mí"
   => +"mí"
   irb(main):007:0> ustr.to_a
   => [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH  
ACUTE]
   irb(main):008:0> (~ustr).each_char { |ch| p ch }
   U+006D LATIN SMALL LETTER M
   U+0069 LATIN SMALL LETTER I
   U+0301 COMBINING ACUTE ACCENT
   => +"mí"

There is much more -- character properties, text boundaries (grapheme  
clusters and words), Hangul decompositions, modular encodings (ASCII,  
Latin1, EUC, SJIS, UTF32, UTF16, UTF8) -- yet the project is  
unfinished. If anyone is interested in helping develop it further,  
let me know.

The library incorporates the entire Unicode 4.1.0 Character Database  
(demand-loaded!) which is why the archive is rather large.

Cheers,

--
Rob Leslie
r...@mars.org


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Julian 'Julik' Tarkhanov  
View profile  
 More options Jun 18 2006, 2:51 pm
From: Julian 'Julik' Tarkhanov <list...@julik.nl>
Date: Mon, 19 Jun 2006 03:51:11 +0900
Local: Sun, Jun 18 2006 2:51 pm
Subject: Re: Ruby/Unicode library

On 18-jun-2006, at 20:11, Rob Leslie wrote:

> Since there's been a lot of talk about Unicode lately, I thought  
> I'd throw out a Ruby library I've been working on to support  
> Unicode characters and strings based on the 4.1.0 standard and key  
> specifications from the Unicode Consortium.

Holy wow. But the tables are just _huge_.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rob Leslie  
View profile  
 More options Jun 18 2006, 3:12 pm
From: Rob Leslie <r...@mars.org>
Date: Mon, 19 Jun 2006 04:12:27 +0900
Local: Sun, Jun 18 2006 3:12 pm
Subject: Re: Ruby/Unicode library
On Jun 18, 2006, at 11:51 AM, Julian 'Julik' Tarkhanov wrote:

>> Since there's been a lot of talk about Unicode lately, I thought  
>> I'd throw out a Ruby library I've been working on to support  
>> Unicode characters and strings based on the 4.1.0 standard and key  
>> specifications from the Unicode Consortium.

> Holy wow. But the tables are just _huge_.

I should point out that I'm not presently using most of these tables;  
Unihan.txt alone is 27M. They're included purely for completeness as  
I've been developing the library.

No doubt the actual data storage requirements can be reduced  
considerably.

--
Rob Leslie
r...@mars.org


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Battley  
View profile  
 More options Jun 18 2006, 5:33 pm
From: "Paul Battley" <pbatt...@gmail.com>
Date: Mon, 19 Jun 2006 06:33:59 +0900
Local: Sun, Jun 18 2006 5:33 pm
Subject: Re: Ruby/Unicode library
On 18/06/06, Rob Leslie <r...@mars.org> wrote:

> I should point out that I'm not presently using most of these tables;
> Unihan.txt alone is 27M. They're included purely for completeness as
> I've been developing the library.

> No doubt the actual data storage requirements can be reduced
> considerably.

That's an impressive achievement. It looks like a textbook
implementation. Thanks for sharing!

Coincidentally, I just dug up my own dormant UnicodeData.txt-based
effort - nowhere near as developed as yours - and hacked a bit on it
today, trying out some storage-reduction ideas. I'm looking forward to
trying things with your library.

Paul.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google