Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

reading hebrew text file

351 views
Skip to first unread message

hag...@gmail.com

unread,
Oct 17, 2005, 10:11:19 AM10/17/05
to
I have a hebrew text file, which I want to read in python
I don't know which encoding I need to use & how I do that

thanks,
hagai

Alex Martelli

unread,
Oct 17, 2005, 10:20:19 AM10/17/05
to
<hag...@gmail.com> wrote:

> I have a hebrew text file, which I want to read in python
> I don't know which encoding I need to use & how I do that

As for the "how", look to the codecs module -- but if you don't know
what codec the textfile is written in, I know of no ways to guess from
here!-)


Alex

jep...@unpythonic.net

unread,
Oct 17, 2005, 10:26:46 AM10/17/05
to hag...@gmail.com, pytho...@python.org
I looked for "VAV" in the files in the "encodings" directory
(/usr/lib/python2.4/encodings/*.py on my machine). I found that the following
character encodings seem to include hebrew characters:
cp1255
cp424
cp856
cp862
iso8859-8
A file containing hebrew text might be in any one of these encodings, or
any unicode-based encoding.

To open an encoded file for reading, use
f = codecs.open(file, 'r', encoding='...')
Now, calls like 'f.readline()' will return unicode strings.

Here's an example, using a file in UTF-8 I have laying around:
>>> f = codecs.open("/users/jepler/txt/UTF-8-demo.txt", "r", "utf-8")
>>> for i in range(5): print repr(f.readline())
...
u'UTF-8 encoded sample plain-text file\n'
u'\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\u203e\n'
u'\n'
u'Markus Kuhn [\u02c8ma\u02b3k\u028as ku\u02d0n] <mk...@acm.org> \u2014 1999-08-20\n'
u'\n'

Jeff

Fredrik Lundh

unread,
Oct 17, 2005, 10:31:31 AM10/17/05
to pytho...@python.org
hag...@gmail.com wrote:

> I have a hebrew text file, which I want to read in python
> I don't know which encoding I need to use

that's not a good start. but maybe it's one of these:

http://sites.huji.ac.il/tex/hebtex_fontsrep.html

?

> how I do that

f = open(myfile)
text = f.readline()

followed by one of

text = text.decode("iso-8859-8")
text = text.decode("cp1255")
text = text.decode("cp862")

alternatively, use:

f = codecs.open(myfile, "r", encoding)

to get a stream that decodes things on the fly.

</F>

hag...@gmail.com

unread,
Oct 18, 2005, 12:08:57 PM10/18/05
to
realy thanks

hagai

0 new messages