Hebrew Transliteration

664 views
Skip to first unread message

Yuval Adam

unread,
Feb 10, 2014, 9:47:35 AM2/10/14
to pywe...@googlegroups.com
Does anyone know of transliteration packages that support Hebrew?

So far I've found

But neither seem to support Hebrew just yet.

I need a simple function that can do
>>> transliterate('אבגד')
abgd

for as many languages as possible, Hebrew included.

Transliterate supports registering custom languages, and I can easily add Hebrew, but I'm wondering what other solutions I'm overlooking.

Shlomi Fish

unread,
Feb 10, 2014, 11:44:33 AM2/10/14
to pywe...@googlegroups.com, amir.a...@mail.huji.ac.il
Hi Yuval,

On Mon, 10 Feb 2014 16:47:35 +0200
Yuval Adam <yuv...@gmail.com> wrote:

> Does anyone know of transliteration packages that support Hebrew?
>
> So far I've found
> https://pypi.python.org/pypi/translitcodec
> https://pypi.python.org/pypi/transliterate
>
> But neither seem to support Hebrew just yet.
>
> I need a simple function that can do
> >>> transliterate('אבגד')
> abgd
>

Well, if by transliteration you mean תעתיק of one human language to another
(not the UNIX tr command that blindly translates one character to another,
which is also implemented in Perl 5's tr/// and y/// operator and can be used
to do limited-to-ASCII rot13, lowercase/uppercase or switch-case or similar
things), then the only thing I know is this:

https://metacpan.org/release/Lingua-IT-Ita2heb

It's a CPAN module (based on Perl and Moose) that was originally written by
Amir Aharoni (CCed to this message) and which he and I refactored to use Moose
that transliterates from Italian to Hebrew (The code is modular but quite
complex , as natural language processing code goes). He considered adapting it
to transliterating to/from several other languages but it has not been done
yet. I should note that it may have originally been intended as something rather
limited for some work on Wikimedia projects, and may not have worked in the
general case.

Properly transliterating from Hebrew (at least in Ktiv Chasser - without
diacritics) or from English will be much harder than from Italian.

> for as many languages as possible, Hebrew included.
>
> Transliterate supports registering custom languages, and I can easily add
> Hebrew, but I'm wondering what other solutions I'm overlooking.
>

Now I see that it seems what you've shown was not exactly what linguists (and
non-software-developers) call transliteration:
https://en.wikipedia.org/wiki/Transliteration , but rather indeed something
like the UNIX tr command, which doesn't yield different text with the same
sound as the original, but in a different alphabet/script+language. If so, I may
suggest just implementing something similar yourself and sending a pull
request/patch/etc. for inclusion in the upstream project. That's because it
seems that implementing it yourself will be faster than waiting for something
similar to be found.

Regards,

Shlomi Fish

--
-----------------------------------------------------------------
Shlomi Fish http://www.shlomifish.org/
First stop for Perl beginners - http://perl-begin.org/

And truth be told, I miss you.
And truth be told, I’m lying.
— The All American Rejects, “Gives You Hell”

Please reply to list if it's a mailing list post - http://shlom.in/reply .

Yuval Adam

unread,
Feb 10, 2014, 11:56:45 AM2/10/14
to pywe...@googlegroups.com, amir.a...@mail.huji.ac.il
Thanks for the thorough answer Shlomi.

I should've explained my use case -
I'm generating a default username from a user's first and last name, which might be in a non-english language, and I prefer it to be an ASCII char (even though unicode is supported in this case), it can always be overridden later on and I don't really care about precision. (Consider how Facebook does the same when generating default page names for non-english pages).

Unidecode is a Python package that seems to get the job done in most languages -
--
You received this message because you are subscribed to the Google Groups "PyWeb-IL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyweb-il+u...@googlegroups.com.
To post to this group, send email to pywe...@googlegroups.com.

Arik Baratz

unread,
Feb 10, 2014, 6:47:34 PM2/10/14
to pywe...@googlegroups.com

What's the purpose of the transliterated text?

It's not as easy as it seems, because hebrew words don't contain the vowels inside them (for example ברקת would be "bareket" - the information causing the first syllable to have the vowel 'a' and the second and third to have the vowel 'e' is not contained in the word itself, unless it has diacritics).

There's a scholar who has been busy with precisely this type of problems for a while - Uzzi Ornan from the Technion. He also invented his own transliteration (or rather transcription) technique that also helps with searches in Hebrew. He adds a few characters (for example '$' for שׁ but 's' for שׂ, an apostrophe for א or ע that represent a consonant like in the word אוכל which will become 'oxel - ח is 'x' - but not when it's a vowel, so בא would be "ba") based on the idea that the character should represent the correct pronounciation. Then it can be converted to more common script like $ --> sh, x --> ch or h, etc.

There are some articles on his home page at http://www.cs.technion.ac.il/~ornan and additionally:

http://www.academia.edu/4751321/A_Morphologically-Analyzed_CHILDES_Corpus_of_Hebrew
http://delivery.acm.org/10.1145/1120000/1118645/p8-ornan.pdf

I can't find actual code but I didn't dig too deep.

-- Arik






Fruch

unread,
Mar 6, 2014, 12:34:18 AM3/6/14
to pywe...@googlegroups.com
Yoval, 

the package you mentioned is quite straight forward, 
fixing it to also support Hebrew, should be quite simple.
almost trivial, just find out the Unicode ID of hebrew, and update that file

Israel
Reply all
Reply to author
Forward
0 new messages