Extract dictionary from txt file in Python?

Masoud Komeily

unread,

Dec 28, 2014, 7:38:42 PM12/28/14

to nltk-...@googlegroups.com

Hi,

From below lists written manually, I can make a dictionary ('words' include Persian font):

words = ['آیا', 'او', 'روزی', 'می‌آید']

heads = [4, 4, 4, 0]

But when I extract the same 2 lists ('words' from the 2nd column, and 'heads' from 15th column) the dictionary is empty! The txt file is like this:

1 آیا آیا ADV ADV _ 4 PART _ _

2 او او PRO PRO _ 4 SBJ _ _

3 روزی روز N N _ 4 NVE _ _

4 می‌آید آمد#آ V V _ 0 ROOT _ _

And here is my code:

# .*. coding: utf-8 .*.

from __future__ import unicode_literals

from collections import defaultdict

myfile = open('parsed.txt', "r")

raw = myfile.read()

raw = unicode(raw, encoding="utf-8")

lines = raw.splitlines()

col = []

words = []

heads = []

labels = []

rev_heads = defaultdict(list)

for line in lines:

if line.strip():

col = line.split()

key=col[1], value = col[15]

rev_heads[str(key)].append(str(value))

print rev_heads

myfile.colse()

What's wrong with my code?

Thanks for your help.

- Masoud

Fred Mailhot

unread,

Dec 28, 2014, 8:55:19 PM12/28/14

to nltk-...@googlegroups.com

Are the columns in that text file separated by tabs or spaces? Also, your indexing seems to be confused. You say "second column for words" and index it with [1], but say "15th column" for heads, but index it with [15]. Here's some (untested!) code:

import codecs

to_dict = []

for line in codecs.open("parsed.txt", "r", "utf8"):

cols = line.strip().split()

if not cols:

continue

#for tab-separated

#cols = line.strip().split("\t")

to_dict.append((cols[1], cols[14]))

rev_heads = dict(to_dict)

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexis Dimitriadis

unread,

Dec 29, 2014, 12:28:20 PM12/29/14

to nltk-...@googlegroups.com

Is the file really utf-8 encoded? If it's something like UCS-16, it'll be full of null bytes.

And try using `key` and `value` instead of `str(key)` and `str(value)` when you add to the dictionary. But applying `str()` to Persian text should be giving you a UnicodeEncodeError, since there's no ascii equivalent. If you don't get an error, something is probably wrong elsewhere.

Alexis

Reply all

Reply to author

Forward