Extract dictionary from txt file in Python?

39 views
Skip to first unread message

Masoud Komeily

unread,
Dec 28, 2014, 7:38:42 PM12/28/14
to nltk-...@googlegroups.com
Hi,

From below lists written manually, I can make a dictionary ('words' include Persian font):
  words = ['آیا', 'او', 'روزی', 'می‌آید']
  heads = [4, 4, 4, 0]

But when I extract the same 2 lists ('words' from the 2nd column, and 'heads' from 15th column) the dictionary is empty! The txt file is like this:
1 آیا آیا ADV ADV _ 4 PART _ _
2 او او PRO PRO _ 4 SBJ _ _
3 روزی روز N N _ 4 NVE _ _
4 می‌آید آمد#آ V V _ 0 ROOT _ _


And here is my code:

# .*. coding: utf-8 .*.
from __future__ import unicode_literals
from collections import defaultdict

myfile = open('parsed.txt', "r")
raw = myfile.read()
raw = unicode(raw, encoding="utf-8")
lines = raw.splitlines()

col = []
words = []
heads = []
labels = []

rev_heads = defaultdict(list)

for line in lines:
    if line.strip():
        col = line.split()
        key=col[1], value = col[15]
        rev_heads[str(key)].append(str(value))
print rev_heads
myfile.colse()


What's wrong with my code?
Thanks for your help.
- Masoud

Fred Mailhot

unread,
Dec 28, 2014, 8:55:19 PM12/28/14
to nltk-...@googlegroups.com
Are the columns in that text file separated by tabs or spaces? Also, your indexing seems to be confused. You say "second column for words" and index it with [1], but say "15th column" for heads, but index it with [15]. Here's some (untested!) code:

import codecs

to_dict = []
for line in codecs.open("parsed.txt", "r", "utf8"):
    cols = line.strip().split()
    if not cols:
        continue
    #for tab-separated
    #cols = line.strip().split("\t")
    to_dict.append((cols[1], cols[14]))
rev_heads = dict(to_dict)


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexis Dimitriadis

unread,
Dec 29, 2014, 12:28:20 PM12/29/14
to nltk-...@googlegroups.com
Is the file really utf-8 encoded? If it's something like UCS-16, it'll be full of null bytes.

And try using `key` and `value` instead of `str(key)` and `str(value)` when you add to the dictionary. But applying `str()` to Persian text should be giving you a UnicodeEncodeError, since there's no ascii equivalent. If you don't get an error, something is probably wrong elsewhere.

Alexis

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

Reply all
Reply to author
Forward
0 new messages