Message from discussion
Ruby/Unicode library
Received: by 10.36.55.6 with SMTP id d6mr8642476nza;
Sun, 18 Jun 2006 11:16:51 -0700 (PDT)
Return-Path: <ruby-talk-ad...@ruby-lang.org>
Received: from beryllium.ruby-lang.org (beryllium.ruby-lang.org [210.163.138.100])
by mx.googlegroups.com with ESMTP id c18si207342pyc.2006.06.18.11.16.49;
Sun, 18 Jun 2006 11:16:51 -0700 (PDT)
Received-SPF: pass (googlegroups.com: best guess record for domain of ruby-talk-ad...@ruby-lang.org designates 210.163.138.100 as permitted sender)
Received: from beryllium.ruby-lang.org (beryllium.ruby-lang.org [127.0.0.1])
by beryllium.ruby-lang.org (Postfix) with ESMTP id 6D50733CE1;
Mon, 19 Jun 2006 03:16:36 +0900 (JST)
Received: from localhost (beryllium.ruby-lang.org [127.0.0.1])
by beryllium.ruby-lang.org (Postfix) with ESMTP id 4067D33D5B
for <ruby-t...@ruby-lang.org>; Mon, 19 Jun 2006 03:09:40 +0900 (JST)
Received: from beryllium.ruby-lang.org ([127.0.0.1])
by localhost (beryllium.ruby-lang.org [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id 29635-01 for <ruby-t...@ruby-lang.org>;
Mon, 19 Jun 2006 03:09:40 +0900 (JST)
Received: from station.mars.org (station.mars.org [216.75.55.11])
by beryllium.ruby-lang.org (Postfix) with ESMTP id C218C33CE1
for <ruby-t...@ruby-lang.org>; Mon, 19 Jun 2006 03:09:39 +0900 (JST)
Received: from [IPv6:2002:465f:feb6:1:211:24ff:fe31:8414] (ptah.underbit.com [IPv6:2002:465f:feb6:1:211:24ff:fe31:8414])
(authenticated bits=0)
by station.mars.org (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k5II9bqG024425
(version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT)
for <ruby-t...@ruby-lang.org>; Sun, 18 Jun 2006 11:09:37 -0700
Delivered-To: ruby-t...@ruby-lang.org
Date: Mon, 19 Jun 2006 03:11:32 +0900
Posted: Sun, 18 Jun 2006 11:09:36 -0700
From: Rob Leslie <r...@mars.org>
Reply-To: ruby-t...@ruby-lang.org
Subject: Ruby/Unicode library
To: ruby-t...@ruby-lang.org (ruby-talk ML)
Message-Id: <3D218B7A-07C1-4A6C-AE72-747F1460F0BF@mars.org>
X-ML-Name: ruby-talk
X-Mail-Count: 197946
X-MLServer: fml [fml 4.0.3 release (20011202/4.0.3)]; post only (only members can post)
X-ML-Info: If you have a question, send e-mail with the body
"help" (without quotes) to the address ruby-talk-...@ruby-lang.org;
help=<mailto:ruby-talk-...@ruby-lang.org?body=help>
X-Mailer: Apple Mail (2.750)
X-Original-To: ruby-t...@ruby-lang.org
X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at ruby-lang.org
X-Spam-Checker-Version: SpamAssassin 3.0.6 (2005-12-07) on
beryllium.ruby-lang.org
X-Spam-Level:
X-Spam-Status: No, score=-5.5 required=7.0 tests=BAYES_00,
CONTENT_TYPE_PRESENT,MIMEQENC,QENCPTR1,RCVDFRMLOCALIP,
X_MAILER_PRESENT autolearn=disabled version=3.0.6
Mime-Version: 1.0 (Apple Message framework v750)
Content-Type: text/plain; charset=UTF-8; delsp=yes; format=flowed
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
Lines: 72
List-Id: ruby-talk.ruby-lang.org
List-Software: fml [fml 4.0.3 release (20011202/4.0.3)]
List-Post: <mailto:ruby-t...@ruby-lang.org>
List-Owner: <mailto:ruby-talk-ad...@ruby-lang.org>
List-Help: <mailto:ruby-talk-...@ruby-lang.org?body=help>
List-Unsubscribe: <mailto:ruby-talk-...@ruby-lang.org?body=unsubscribe>
Since there's been a lot of talk about Unicode lately, I thought I'd =20
throw out a Ruby library I've been working on to support Unicode =20
characters and strings based on the 4.1.0 standard and key =20
specifications from the Unicode Consortium.
ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2
The library adds an encoding property to native String objects, and =20
allows conversion to and from Unicode::String and Unicode::Character. =20=
A default encoding is chosen based on $KCODE, or the default can be =20
set/changed explicitly via String.default_encoding.
Unicode strings can be obtained by applying the + unary operator to =20
native strings, e.g. +"Hello" (where the native string is encoded in =20
the default encoding).
% irb -I. -runicode -Ku
irb(main):001:0> ustr =3D +"=CF=80 is pi"
=3D> +"=CF=80 is pi"
Native strings are obtained from Unicode strings by calling to_s, =20
which accepts an optional argument to indicate the desired encoding.
irb(main):002:0> str =3D ustr.to_s
=3D> "=CF=80 is pi"
irb(main):003:0> str.encoding
=3D> Unicode::Encoding::UTF8
Individual characters can be indexed from Unicode strings, returning =20
a Unicode::Character object.
irb(main):004:0> ustr[0]
=3D> U+03C0 GREEK SMALL LETTER PI
Case conversion is handled as with native strings.
irb(main):005:0> ustr.upcase
=3D> +"=CE=A0 IS PI"
Normalization is accomplished with the ~ unary operator.
irb(main):006:0> ustr =3D +"m=C3=AD"
=3D> +"m=C3=AD"
irb(main):007:0> ustr.to_a
=3D> [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH =20=
ACUTE]
irb(main):008:0> (~ustr).each_char { |ch| p ch }
U+006D LATIN SMALL LETTER M
U+0069 LATIN SMALL LETTER I
U+0301 COMBINING ACUTE ACCENT
=3D> +"m=C3=AD"
There is much more -- character properties, text boundaries (grapheme =20=
clusters and words), Hangul decompositions, modular encodings (ASCII, =20=
Latin1, EUC, SJIS, UTF32, UTF16, UTF8) -- yet the project is =20
unfinished. If anyone is interested in helping develop it further, =20
let me know.
The library incorporates the entire Unicode 4.1.0 Character Database =20
(demand-loaded!) which is why the archive is rather large.
Cheers,
--=20
Rob Leslie
r...@mars.org