Concordance

Julius Hamilton

unread,

Jul 6, 2021, 6:22:17 PM7/6/21

to nltk-...@googlegroups.com

Hey,

I'd like to perform a very simple function: convert a book - in this case an EPUB, which is a directory of files in XML, I believe - to a list of all the words present in the book, or as I believe it's called, a concordance.

The tool AntConc was recommended to me, but I don't think it's open source, even though free, and I can't run the executable on the system I'm on (Andronix).

I imagine this would be something NLTK could do. Could anyone recommend a method that would allow me to build a word list from some text files? The words in the list could be highly precisely distinguished, i.e. even sensitive to case, or maybe the list could be slightly smarter, i.e. able to recognize plural forms and convert to singular, for example.

Thanks very much,

Julius

ilia.k...@gmail.com

unread,

Jul 20, 2021, 4:37:40 AM7/20/21

to nltk-users

Hi Julius!

Assuming you don't care about the order in which these words occur and only want to track the unique words, you can probably get away with just using Python's built-in Counter. This can also give you frequencies, in case you decide you need those.

If I understood your task correctly, I would expect the bulk of the work to lie in extracting the text from EPUB's XML format.

I hope that helps,

Ilia

Julius Hamilton

unread,

Jul 25, 2021, 5:35:24 PM7/25/21

to nltk-...@googlegroups.com

Thanks very much.

Could you please provide some sample code for using the counter to generate a list of words from a text file/string?

Thanks very much,

Julius

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/nltk-users/5fd579ff-d66d-4b31-8fba-f153c567ad28n%40googlegroups.com.

Jordi Carrera

unread,

Jul 27, 2021, 4:38:59 AM7/27/21

to nltk-users

Hey Julius,

I'm attaching some code to do what you describe, take a look and see if this is what you're looking for.

You should be able to run it with the following command

```python concordancer.py```

The only unusual external dependency is `nltk`, which I assume you already have installed.

The expected output has have the following format

```

a list of all the words present in the book ,

some text files ? The words in the list could be

the order in which these words occur and only want to

want to track the unique words , you can probably get

to generate a list of words from a text file /

I believe - to a list of all the words present

me to build a word list from some text files ?

? The words in the list could be highly precisely distinguished

case , or maybe the list could be slightly smarter ,

the counter to generate a list of words from a text

convert a book - in this case an EPUB , which

( Andronix ). I imagine this would be something NLTK could

, Julius -- You received this message because you are subscribed

group . To unsubscribe from this group and stop receiving emails

. com . To view this discussion on the web ,

```

(Note that the "input" documents are the three previous emails on this thread and are hard-coded as variable BOOKS. You'll have to implement the appropriate input ingestion code for this script to work on your particular setting.)

Just for sake of clarity, I think there are several concepts involved in what you described:

"convert a book to a list of all the words present in the book" ---> Under that definition, that's simply the vocabulary of the texts, the unique set of tokens occurring in those texts. In the code attached, that's the behavior implemented as function `a_list_of_all_the_words_present_in_the_books`.
"as I believe it's called, a concordance" ---> Not exactly, as far as I know. The technical definition of a concordance also requires those words to appear in context (a window of words to the left or to the right, as shown in The expected output above) and, since the contexts are non-unique, the result can't be "just the vocabulary" in that sense. The result usually contains many more output records. In the code attached, this behavior is implemented as function `a_list_of_all_the_words_present_in_the_books_along_with_contexts`.
"you can probably get away with just using Python's built-in Counter. This can also give you frequencies, in case you decide you need those." ---> Ilia's explanation is already great. I've simply implemented it for convenience, it's function `a_histogram_of_all_the_words_present_in_the_books` in the code attached.

I've also implemented a very basic lemmatization function for case and grammatical number (function `lemmatize`, which requires running `empirical_lemmatizer` first, refer to the code for details).