Skip to first unread message

Julius Hamilton

Jul 6, 2021, 6:22:17 PM7/6/21

I'd like to perform a very simple function: convert a book - in this case an EPUB, which is a directory of files in XML, I believe - to a list of all the words present in the book, or as I believe it's called, a concordance.

The tool AntConc was recommended to me, but I don't think it's open source, even though free, and I can't run the executable on the system I'm on (Andronix).

I imagine this would be something NLTK could do. Could anyone recommend a method that would allow me to build a word list from some text files? The words in the list could be highly precisely distinguished, i.e. even sensitive to case, or maybe the list could be slightly smarter, i.e. able to recognize plural forms and convert to singular, for example.

Thanks very much,

Jul 20, 2021, 4:37:40 AM7/20/21
to nltk-users
Hi Julius!

Assuming you don't care about the order in which these words occur and only want to track the unique words, you can probably get away with just using Python's built-in Counter. This can also give you frequencies, in case you decide you need those.

If I understood your task correctly, I would expect the bulk of the work to lie in extracting the text from EPUB's XML format.

I hope that helps,

Julius Hamilton

Jul 25, 2021, 5:35:24 PM7/25/21
Thanks very much.
Could you please provide some sample code for using the counter to generate a list of words from a text file/string?

Thanks very much,

You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web, visit

Jordi Carrera

Jul 27, 2021, 4:38:59 AM7/27/21
to nltk-users
Hey Julius,

I'm attaching some code to do what you describe, take a look and see if this is what you're looking for.

You should be able to run it with the following command

The only unusual external dependency is `nltk`, which I assume you already have installed.

The expected output has have the following format
             a list of all the       words    present in the book ,              
         some text files ? The       words    in the list could be               
      the order in which these       words    occur and only want to             
      want to track the unique       words    , you can probably get             
         to generate a list of       words    from a text file /                 
              I believe - to a        list    of all the words present           
            me to build a word        list    from some text files ?             
            ? The words in the        list    could be highly precisely distinguished
           case , or maybe the        list    could be slightly smarter ,        
     the counter to generate a        list    of words from a text               
           convert a book - in        this    case an EPUB , which               
       ( Andronix ). I imagine        this    would be something NLTK could      
      , Julius -- You received        this    message because you are subscribed 
   group . To unsubscribe from        this    group and stop receiving emails    
               . com . To view        this    discussion on the web ,            
(Note that the "input" documents are the three previous emails on this thread and are hard-coded as variable BOOKS. You'll have to implement the appropriate input ingestion code for this script to work on your particular setting.)

Just for sake of clarity, I think there are several concepts involved in what you described:
  1. "convert a book to a list of all the words present in the book" ---> Under that definition, that's simply the vocabulary of the texts, the unique set of tokens occurring in those texts. In the code attached, that's the behavior implemented as function `a_list_of_all_the_words_present_in_the_books`.
  2. "as I believe it's called, a concordance" ---> Not exactly, as far as I know. The technical definition of a concordance also requires those words to appear in context (a window of words to the left or to the right, as shown in The expected output above) and, since the contexts are non-unique, the result can't be "just the vocabulary" in that sense. The result usually contains many more output records. In the code attached, this behavior is implemented as function `a_list_of_all_the_words_present_in_the_books_along_with_contexts`.
  3. "you can probably get away with just using Python's built-in Counter. This can also give you frequencies, in case you decide you need those." ---> Ilia's explanation is already great. I've simply implemented it for convenience, it's function `a_histogram_of_all_the_words_present_in_the_books` in the code attached.
I've also implemented a very basic lemmatization function for case and grammatical number (function `lemmatize`, which requires running `empirical_lemmatizer` first, refer to the code for details).

Hope this helps, cheers!
Reply all
Reply to author
0 new messages