How to import - large corpora in csv format

1,281 views
Skip to first unread message

Jembatan

unread,
May 22, 2017, 4:36:43 AM5/22/17
to nltk-users

Hi all, 

I'm totally new to NLTK and coding more generally - please excuse any totally naive questions/assumptions. 

I'm hoping to import a CSV file of an Indonesian language corpus and create simple word frequencies from the data. Looking at the CSV in Excel I can see that the transcriptions are in a particular column, with other data, such as speaker and duration in other columns. The corpus is quite large - around 900,000 utterances.  

In the information about the files it says: "The files generally adhere to http://tools.ietf.org/html/rfc4180 and are UTF-8 encoded."

I would appreciate any suggestions about where I should start? Online training modules (I'm already started on NLTK book), relevant discussion threads, as well as other types of advice would be greatly appreciated. 

Thanks!


Alex Rudnick

unread,
May 22, 2017, 4:54:37 PM5/22/17
to nltk-...@googlegroups.com
Hey there,

This is the place to ask questions, no worries! :D

I don't think NLTK has particular support for CSV files (somebody can
correct me if I'm mistaken), but it's OK because Python itself does!

One possible approach would be to use Python's CSV support like this:
https://docs.python.org/3/library/csv.html

... and just read all the utterances out of your file and then write
them into a flat text file, which should be easier to analyze with
NLTK, like so: http://www.nltk.org/book/ch02.html#loading-your-own-corpus
Let us know how it goes!

--
-- alexr

Jembatan

unread,
May 27, 2017, 4:16:46 PM5/27/17
to nltk-users
Hi Alex, 

Thanks for your ideas, I'm having a look at that first link now. A couple of questions:
1. Am I correct in thinking that converting it to a flat text file would mean that I won't be able to use NLTK to tell me information that was not in the utterance column - such as how often a word is used by a particular speaker?
2. How long would you expect it to take to write a 516 MB CSV file with over 1 million rows into a flat text file? And how large would you expect that file to be?
3. I'm a little worried that the file also contains data from other corpora that were recorded by the same institute. Unfortunately, I can't check because when I open the CSV it says the whole file wont load - I think this is just because it is so large - and while what opens has over 1 million rows of data - it seems to still be from the corpus I want. Is there a way to use Python to extract only those transcriptions marked with a particular session id (I have all the session ids for the corpus I want)? (Note: session ids are another column). 
4. Can you get a sense of the complexity of this work? If yes, what level of expertise would you think necessary to do it? 

Again, sorry for any naivety. Also a big thank you for your time! 

Dimitriadis, A. (Alexis)

unread,
May 27, 2017, 5:45:46 PM5/27/17
to nltk-...@googlegroups.com
Hi Jembatan,

The answer to your question 2 is, “not very long; try it!” (and “smaller than the original file”). But I have a different suggestion than Alex: There’s no real reason to create an intermediate file. Use Python’s `csv` module as Alex suggested, and you can easily work with the text of the relevant column. You can loop over the csv file line by line (so that you won’t need to load the whole file at once), and you can use the nltk’s methods to tokenize it and process it the same as you would an ordinary text file. Supposing each row contains the text of one sentence in column 7 (counting from 0), your code might be as simple as this:

    import csv
    fp = open(“filename.csv”)
    myreader = csv.reader(fp)

    for row in myreader:
        text = row[7]
        words = nltk.word_tokenize(text)
        <Now do something with this sentence>

Of course, the other columns are available too, so you can easily check the speaker, the session id and anything else you need at any time. (Actually you’ll probably want to use `csv.DictReader(0)` which allows you to refer to columns by their name. It’s just easier to make up an example with the plain `csv.reader()`).

4. Can you get a sense of the complexity of this work? If yes, what level of expertise would you think necessary to do it? 

It all depends on what you want to do, but working with this kind of file format and computing simple lexicostatistics is pretty straightforward. All you need is a moderately good grasp of Python, and some familiarity with what the NLTK offers. Begin with a good python tutorial, then come back to the nltk book and study the first five chapters. By then you’ll probably know what else you need, and you can pick and choose what else to study.

Good luck,

Alexis





Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jembatan

unread,
Jun 7, 2017, 9:37:05 AM6/7/17
to nltk-users
Hi Alexis and Alex, 

Thank you both for your help. I'm back with more questions. Unfortunately, I realised that the CSV had other projects in it so in the meantime I've been busy finding help to selectively write parts of the CSV into a new file which has only the data I want. 

Now I have 998 .txt files with the various transcribed sessions. Would you suggest writing these into one .txt file for analysis or is there a way to get nltk to run through that many files? 

Using the nltk book instructions on loading your own corpus I can only read one file at a time (see code below). At this point I really just want to run a simple word frequency analysis which incorporates all the files but am finding this annoyingly hard to do. 

Any help appreciated,

Jembatan



import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/myname/MPI_EVA_dins'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()

#this gives me a list of all 998 fileids

wordlists.words('HIZ-020301.txt')

#And this prints the first part of the transcription in that file

Dimitriadis, A. (Alexis)

unread,
Jun 7, 2017, 1:26:39 PM6/7/17
to nltk-...@googlegroups.com
Hi Jembatan,

Call `wordlists.words()` without arguments to get all words from all files in your corpus. It works exactly like the nltk’s own corpora (which use the same family of readers). You can also specify a list of files (e.g., a sublist of `wordlists.fileids()`) as the argument. Do yourself a favor and study the nltk book, it’s all laid out here.

Alexis


Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

Jembatan

unread,
Jun 8, 2017, 8:51:38 PM6/8/17
to nltk-users
Hi Alexis, 

Sorry my last set of questions didn't really make clear my problem. I had tried wordlists.words() and had misread the error message as being problems finding the files as it didn't occur when I gave a particular file name as the argument. I'm guessing this is due to strange characters in the transcribed language? The error message I get with no file name is pasted at the bottom. I now realise it is the same error that I encountered when first trying to read in the original CSV. 

A friend helped me overcome the issue reading in the CSV with
with open("/Users/myname/myname_code/testmydata.csv",encoding='utf-8' , errors='ignore') as csvfile:
     readCSV = csv.reader(csvfile, delimiter=',')

so I tried to insert something similar

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/zaramaxwell-smith/MPI_EVA_dins'
wordlists = PlaintextCorpusReader(corpus_root, '.*', encoding='utf-8' , errors='ignore')

  File "<stdin>", line 1

    wordlists = PlaintextCorpuReader(corpus_root, '.*', encoding='utf-8' , errors='ignore)

                                                                                         ^

SyntaxError: EOL while scanning string literal



I'm guessing that this is an impossible attempt to modify how PlaintextCorpusReader (a module?) is working with the files? 


Is there a way to do something like this within NLTK or will I need to modify my files beforehand? 



Again, apologies for my inarticulate and probably very roundabout way of stating my problems. 


Jembatan




>>> wordlists.words()

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/collections.py", line 226, in __repr__

    for elt in self:

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 402, in iterate_from

    for tok in piece.iterate_from(max(0, start_tok-offset)):

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from

    tokens = self.read_block(self._stream)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/plaintext.py", line 122, in _read_word_block

    words.extend(self._word_tokenizer.tokenize(stream.readline()))

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1148, in readline

    new_chars = self._read(readsize)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1380, in _read

    chars, bytes_decoded = self._incr_decode(bytes)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1411, in _incr_decode

    return self.decode(bytes, 'strict')

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/utf_8.py", line 16, in decode

    return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 10: invalid continuation byte

>>> 




 


Reply all
Reply to author
Forward
0 new messages