I'm attaching some code to do what you describe, take a look and see if this is what you're looking for.
You should be able to run it with the following command
The only unusual external dependency is `nltk`, which I assume you already have installed.
The expected output has have the following format
a list of all the words present in the book ,
some text files ? The words in the list could be
the order in which these words occur and only want to
want to track the unique words , you can probably get
to generate a list of words from a text file /
I believe - to a list of all the words present
me to build a word list from some text files ?
? The words in the list could be highly precisely distinguished
case , or maybe the list could be slightly smarter ,
the counter to generate a list of words from a text
convert a book - in this case an EPUB , which
( Andronix ). I imagine this would be something NLTK could
, Julius -- You received this message because you are subscribed
group . To unsubscribe from this group and stop receiving emails
. com . To view this discussion on the web ,
(Note that the "input" documents are the three previous emails on this thread and are hard-coded as variable BOOKS. You'll have to implement the appropriate input ingestion code for this script to work on your particular setting.)
Just for sake of clarity, I think there are several concepts involved in what you described:
- "convert a book to a list of all the words present in the book" ---> Under that definition, that's simply the vocabulary of the texts, the unique set of tokens occurring in those texts. In the code attached, that's the behavior implemented as function `a_list_of_all_the_words_present_in_the_books`.
- "as I believe it's called, a concordance" ---> Not exactly, as far as I know. The technical definition of a concordance also requires those words to appear in context (a window of words to the left or to the right, as shown in The expected output above) and, since the contexts are non-unique, the result can't be "just the vocabulary" in that sense. The result usually contains many more output records. In the code attached, this behavior is implemented as function `a_list_of_all_the_words_present_in_the_books_along_with_contexts`.
- "you can probably get away with just using Python's built-in Counter. This can also give you frequencies, in case you decide you need those." ---> Ilia's explanation is already great. I've simply implemented it for convenience, it's function `a_histogram_of_all_the_words_present_in_the_books` in the code attached.
I've also implemented a very basic lemmatization function for case and grammatical number (function `lemmatize`, which requires running `empirical_lemmatizer` first, refer to the code for details).
Hope this helps, cheers!