Unicode Normalization

4 views
Skip to first unread message

Marcos Fragomeni

unread,
Aug 5, 2010, 2:39:39 PM8/5/10
to xtf-...@googlegroups.com, Joao Alberto de Oliveira Lima
Hi Dan,

we had the need to normalize unicode strings before
indexing because our data providers use both
forms of diacritics: combined (like e^) and precomposed
characters (like ê).

More on unicode normalization
http://en.wikipedia.org/wiki/Unicode_equivalence

To accomplish that I wrote a "UnicodeNormalizationFilter.java"
(attached) and applied it just after LowerCaseFilter in
XTFTextAnalyzer.

In case you want add it to XTF there is an issue:
the java.text.Normalizer class is available only
since Java 1.6.

Cheers,

Marcos Fragomeni
Systems Analyst
Federal Senate of Brazil

Marcos Fragomeni

unread,
Aug 5, 2010, 2:51:14 PM8/5/10
to xtf-...@googlegroups.com, Joao Alberto de Oliveira Lima
Now with the attachment.


Cheers,

Marcos Fragomeni
Systems Analyst
Federal Senate of Brazil



UnicodeNormalizationFilter.java

Martin Haye

unread,
Aug 5, 2010, 3:12:55 PM8/5/10
to xtf-...@googlegroups.com, Joao Alberto de Oliveira Lima
Hello Marcos,

Actually there is already a class in XTF that performs normalization but it doesn't appear to get used for XML documents. I'm a bit surprised that Java's XML parser isn't already normalizing. Anyway, here's the class: WEB-INF/org/cdlib/xtf/util/Normalizer.java

Any chance your filter could use the XTF class? Then it'll work with Java 1.5 as well.

--Martin



From: Marcos Fragomeni <marcos.f...@gmail.com>
Reply-To: <xtf-...@googlegroups.com>
Date: Thu, 5 Aug 2010 15:51:14 -0300

To: <xtf-...@googlegroups.com>
Cc: Joao Alberto de Oliveira Lima <joao...@senado.gov.br>
Subject: [xtf-devel] Fwd: Unicode Normalization
--
You received this message because you are subscribed to the Google Groups "XTF Developer list" group.
To post to this group, send email to xtf-...@googlegroups.com.
To unsubscribe from this group, send email to xtf-devel+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/xtf-devel?hl=en.

Marcos Fragomeni

unread,
Aug 5, 2010, 4:30:02 PM8/5/10
to xtf-...@googlegroups.com, Joao Alberto de Oliveira Lima
Thanks, Martin.

I'm sending the changed class using WEB-INF/org/cdlib/xtf/util/Normalizer.java.
UnicodeNormalizationFilter.class

Marcos Fragomeni

unread,
Aug 5, 2010, 4:32:30 PM8/5/10
to xtf-...@googlegroups.com, Joao Alberto de Oliveira Lima
The .java file.

---------- Forwarded message ----------
From: Marcos Fragomeni <marcos.f...@gmail.com>
Date: 2010/8/5
Subject: Re: [xtf-devel] Fwd: Unicode Normalization
To: xtf-...@googlegroups.com
Cc: Joao Alberto de Oliveira Lima <joao...@senado.gov.br>


UnicodeNormalizationFilter.java

Martin Haye

unread,
Aug 11, 2010, 7:53:04 PM8/11/10
to xtf-...@googlegroups.com, Joao Alberto de Oliveira Lima
Hi Marcos,

I'm making good progress integrating normalization into the indexer. Two questions:

  1. Do you have a sample file containing some non-normalized words so I can test the code?
  2. Should I also add normalization to query term processing, or should we rely on web browsers to do it?

Thanks,

--Martin



From: Marcos Fragomeni <marcos.f...@gmail.com>
Reply-To: <xtf-...@googlegroups.com>
Date: Thu, 5 Aug 2010 17:32:30 -0300

To: <xtf-...@googlegroups.com>
Cc: Joao Alberto de Oliveira Lima <joao...@senado.gov.br>
Subject: Fwd: [xtf-devel] Fwd: Unicode Normalization

Marcos Fragomeni

unread,
Aug 12, 2010, 10:22:57 AM8/12/10
to xtf-...@googlegroups.com, Joao Alberto de Oliveira Lima
Hi Martin,

I've attached a sample file with some non-normalized characters taken from our database.

I believe we should normalize query term processing as well. I haven't found anything related
to Unicode normalization being ensured by HTTP protocol and the normalization won't
weight much the query term processing.


Marcos Fragomeni

marcos.f...@gmail.com
(61) 8165-4874

"A felicidade é o caminho!"


2010/8/11 Martin Haye <marti...@ucop.edu>
non-normalized.txt
Reply all
Reply to author
Forward
0 new messages