Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Message from discussion Ruby/Unicode library

View parsed - Show only message text

Received: by 10.36.55.6 with SMTP id d6mr8642476nza;
        Sun, 18 Jun 2006 11:16:51 -0700 (PDT)
Return-Path: <ruby-talk-ad...@ruby-lang.org>
Received: from beryllium.ruby-lang.org (beryllium.ruby-lang.org [210.163.138.100])
        by mx.googlegroups.com with ESMTP id c18si207342pyc.2006.06.18.11.16.49;
        Sun, 18 Jun 2006 11:16:51 -0700 (PDT)
Received-SPF: pass (googlegroups.com: best guess record for domain of ruby-talk-ad...@ruby-lang.org designates 210.163.138.100 as permitted sender)
Received: from beryllium.ruby-lang.org (beryllium.ruby-lang.org [127.0.0.1])
	by beryllium.ruby-lang.org (Postfix) with ESMTP id 6D50733CE1;
	Mon, 19 Jun 2006 03:16:36 +0900 (JST)
Received: from localhost (beryllium.ruby-lang.org [127.0.0.1])
	by beryllium.ruby-lang.org (Postfix) with ESMTP id 4067D33D5B
	for <ruby-t...@ruby-lang.org>; Mon, 19 Jun 2006 03:09:40 +0900 (JST)
Received: from beryllium.ruby-lang.org ([127.0.0.1])
	by localhost (beryllium.ruby-lang.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id 29635-01 for <ruby-t...@ruby-lang.org>;
	Mon, 19 Jun 2006 03:09:40 +0900 (JST)
Received: from station.mars.org (station.mars.org [216.75.55.11])
	by beryllium.ruby-lang.org (Postfix) with ESMTP id C218C33CE1
	for <ruby-t...@ruby-lang.org>; Mon, 19 Jun 2006 03:09:39 +0900 (JST)
Received: from [IPv6:2002:465f:feb6:1:211:24ff:fe31:8414] (ptah.underbit.com [IPv6:2002:465f:feb6:1:211:24ff:fe31:8414])
	(authenticated bits=0)
	by station.mars.org (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k5II9bqG024425
	(version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT)
	for <ruby-t...@ruby-lang.org>; Sun, 18 Jun 2006 11:09:37 -0700
Delivered-To: ruby-t...@ruby-lang.org
Date: Mon, 19 Jun 2006 03:11:32 +0900
Posted: Sun, 18 Jun 2006 11:09:36 -0700
From: Rob Leslie <r...@mars.org>
Reply-To: ruby-t...@ruby-lang.org
Subject: Ruby/Unicode library
To: ruby-t...@ruby-lang.org (ruby-talk ML)
Message-Id: <3D218B7A-07C1-4A6C-AE72-747F1460F0BF@mars.org>
X-ML-Name: ruby-talk
X-Mail-Count: 197946
X-MLServer: fml [fml 4.0.3 release (20011202/4.0.3)]; post only (only members can post)
X-ML-Info: If you have a question, send e-mail with the body
	"help" (without quotes) to the address ruby-talk-...@ruby-lang.org;
	help=<mailto:ruby-talk-...@ruby-lang.org?body=help>
X-Mailer: Apple Mail (2.750)
X-Original-To: ruby-t...@ruby-lang.org
X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at ruby-lang.org
X-Spam-Checker-Version: SpamAssassin 3.0.6 (2005-12-07) on 
	beryllium.ruby-lang.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.5 required=7.0 tests=BAYES_00,
	CONTENT_TYPE_PRESENT,MIMEQENC,QENCPTR1,RCVDFRMLOCALIP,
	X_MAILER_PRESENT autolearn=disabled version=3.0.6
Mime-Version: 1.0 (Apple Message framework v750)
Content-Type: text/plain; charset=UTF-8; delsp=yes; format=flowed
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
Lines: 72
List-Id: ruby-talk.ruby-lang.org
List-Software: fml [fml 4.0.3 release (20011202/4.0.3)]
List-Post: <mailto:ruby-t...@ruby-lang.org>
List-Owner: <mailto:ruby-talk-ad...@ruby-lang.org>
List-Help: <mailto:ruby-talk-...@ruby-lang.org?body=help>
List-Unsubscribe: <mailto:ruby-talk-...@ruby-lang.org?body=unsubscribe>

Since there's been a lot of talk about Unicode lately, I thought I'd =20
throw out a Ruby library I've been working on to support Unicode =20
characters and strings based on the 4.1.0 standard and key =20
specifications from the Unicode Consortium.

   ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2

The library adds an encoding property to native String objects, and =20
allows conversion to and from Unicode::String and Unicode::Character. =20=

A default encoding is chosen based on $KCODE, or the default can be =20
set/changed explicitly via String.default_encoding.

Unicode strings can be obtained by applying the + unary operator to =20
native strings, e.g. +"Hello" (where the native string is encoded in =20
the default encoding).

   % irb -I. -runicode -Ku
   irb(main):001:0> ustr =3D +"=CF=80 is pi"
   =3D> +"=CF=80 is pi"

Native strings are obtained from Unicode strings by calling to_s, =20
which accepts an optional argument to indicate the desired encoding.

   irb(main):002:0> str =3D ustr.to_s
   =3D> "=CF=80 is pi"
   irb(main):003:0> str.encoding
   =3D> Unicode::Encoding::UTF8

Individual characters can be indexed from Unicode strings, returning =20
a Unicode::Character object.

   irb(main):004:0> ustr[0]
   =3D> U+03C0 GREEK SMALL LETTER PI

Case conversion is handled as with native strings.

   irb(main):005:0> ustr.upcase
   =3D> +"=CE=A0 IS PI"

Normalization is accomplished with the ~ unary operator.

   irb(main):006:0> ustr =3D +"m=C3=AD"
   =3D> +"m=C3=AD"
   irb(main):007:0> ustr.to_a
   =3D> [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH =20=

ACUTE]
   irb(main):008:0> (~ustr).each_char { |ch| p ch }
   U+006D LATIN SMALL LETTER M
   U+0069 LATIN SMALL LETTER I
   U+0301 COMBINING ACUTE ACCENT
   =3D> +"m=C3=AD"

There is much more -- character properties, text boundaries (grapheme =20=

clusters and words), Hangul decompositions, modular encodings (ASCII, =20=

Latin1, EUC, SJIS, UTF32, UTF16, UTF8) -- yet the project is =20
unfinished. If anyone is interested in helping develop it further, =20
let me know.

The library incorporates the entire Unicode 4.1.0 Character Database =20
(demand-loaded!) which is why the archive is rather large.

Cheers,

--=20
Rob Leslie
r...@mars.org




Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google