Find most common sentences/phrases in text

dskiol...@gmail.com

unread,

Oct 22, 2017, 8:17:34 AM10/22/17

to nltk-users

Hi,

I’m helping a friend out with a translating project. The work is pretty manual and requires reading lots of strings and translating them manually one by one. I noticed that in several cases there are very common recurring phrases. Like “The answer is:” etc. So I got the idea that one could automate a large part of the translation job if we parse the entire text and find the most common recurring phrases/sentences. If I then translate these manually then we can insert/replace them in the text and save a lot of time.

Problem is, I know nothing about this topic or if its even possible.

Summary of what I want to do:

Write a python script which parses an xml-file containing a bunch of strings, then in an unsupervised manner, finds common phrases like “The answer is:, Click here” etc.

I browsed through the NLTK textbook and found some stuff about frequency plots and concurrence, but this seems to only work for individual words or bigrams. Does the problem get too complex with sentances?

Thank for any helpful info.

sujitpal

unread,

Oct 22, 2017, 11:17:51 AM10/22/17

to nltk-users

One approach may be to just construct n-grams (maybe for n=1 to 5) and then just count the frequencies, optionally getting rid of the words that also have low IDF. Another approach may be to extract phrases using some pre-built phrase chunker (OpenNLP has one not sure about NLTK) and then counting the frequencies.

If you think sentences are recurring (perhaps because the text might have been generated using some template), then you could chunk the text into sentences (nltk.sent_tokenize()) and do a similar count. I suspect that you might find this less useful though.

-sujit

Dimitriadis, A. (Alexis)

unread,

Oct 22, 2017, 12:29:11 PM10/22/17

to nltk-...@googlegroups.com

It’s pretty easy to make ngrams of all useful lengths— a matter of a few lines of code. The nltk even has a function `nltk.everygrams()` that makes it easier still.

However, I don’t think this will gain you much since the chunks will not necessarily be translateable as a unit. Have you considered running the text through a machine translation service like google translate? It probably won’t work that great, but still it should be better than anything you can hack together yourself.

Alexis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dskiol...@gmail.com

unread,

Oct 23, 2017, 3:56:53 PM10/23/17

to nltk-users

Thanks both of you for the fast input and help.

I though of another perhaps similar and more brute force approach.

How about that I for each string in my file, normalize it, remove spaces and compare with remaining strings in text with the Hamming distance. If the distace is, say > 0.95 (perhaps there are slight variations) then I will assume that the string is recurring, and print it in my output file while counting number of occurences. Also I remove the found string from the search space to make things more efficient.

An output file will be a table with: each recurring string, # of times and percentage of all strings. If I then manually translate this, I can replace all recurring

Possible shortcomings I see with this approach are:

1. Hamming distance is a poor metric to use

2. Time complexity of approach. I get it to O(N*N-1) the number of strings are in the thousands.

Any feedback to this?

Have a nice evening!

Sujit Pal

unread,

Oct 23, 2017, 4:32:12 PM10/23/17

to nltk-...@googlegroups.com

I think the problem with your approach is one you pointed out, that of number of comparisons needed. I am guessing your concern may be that you would overlook small differences in the n-gram/phrase counting approach. Hamming is a decent measure, but there are others as well (Jaccard and Cosine for example).

One other possibility may be to combine my suggestion with yours, i.e., construct the n-gram or phrase then normalize it using the approach you suggested.

Another possibility (slightly more heavyweight than counting but probably less than your matching strategy) is to MinHash the strings you want to compare. The theory is explained in chapter 3 of the Mining of Massive Datasets book [1] and there is a Python library called datasketch [2] which implements MinHash, see the SO discussion here for usage example [3].

[1] http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

[2] https://github.com/ekzhu/datasketch

[3] https://stackoverflow.com/questions/25114338/approximate-string-matching-using-lsh

--
You received this message because you are subscribed to a topic in the Google Groups "nltk-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nltk-users/kwao-veBKfI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nltk-users+unsubscribe@googlegroups.com.

dskiol...@gmail.com

unread,

Oct 24, 2017, 3:33:23 PM10/24/17

to nltk-users

Hi,

I have not disregarded your suggestion. Its just that I've never done an n-grams and got stuck in reading about it. Is this http://www.nltk.org/book/ch05.html#n_gram_tagger_index_term what you mean when you mean with the n-gram method? If not, could you please refer me to a some proper litterature or tutorial?

I will check the links you wrote as well. Thanks! I'll keep you posted on my progress once I've actually implemented something.

To unsubscribe from this group and all its topics, send an email to nltk-users+...@googlegroups.com.

Sujit Pal

unread,

Oct 24, 2017, 4:26:47 PM10/24/17

to nltk-...@googlegroups.com

I did not mean to imply that I thought you disregarded my solution/advice or anything, apologies if it came across that way. In any case, even if you did, you are closest to your problem and you would know the best solution(s) to use, so no worries. For the n-gram stuff I meant something along the lines described in the accepted answer in the SO link below:

https://stackoverflow.com/questions/32441605/generating-ngrams-unigrams-bigrams-etc-from-a-large-corpus-of-txt-files-and-t

-sujit

To unsubscribe from this group and all its topics, send an email to nltk-users+unsubscribe@googlegroups.com.

Dimitriadis, A. (Alexis)

unread,

Oct 24, 2017, 4:29:43 PM10/24/17

to nltk-...@googlegroups.com

I guess you must be bored with the repetitive translation, but be aware that your forays into armchair NLP stand a very low chance of speeding things up. The difficulty of language computation problems is notoriously prone to underestimation.

Anyway since you’re lost, here’s how to make a frequency table of all repeating sequences of at least 2 words/tokens (not crossing sentence boundaries).

data = …

wordgroups = nltk.FreqDist(gram for s in data for gram in nltk.everygrams(s, 2))

`data` must be in the format used by the nltk: A list of sentences, where each sentence is a list of words. The above will define a “frequency distribution” that you can examine to find out the most common sequences, etc. Look up “FreqDist” in the nltk book etc.

Have fun,

Alexis

Sujit Pal

unread,

Oct 24, 2017, 4:42:30 PM10/24/17

to nltk-...@googlegroups.com

I do somewhat agree with Alexis on his point though... you will be able to isolate highly repetitive phrases and translate them once with the approaches we have been talking about, so that will reduce your translation burden somewhat. But be aware that phrases may differ in meaning depending on context, so this may not be as effective as it might appear.

You will probably have better success with tackling the machine translation job as a whole, if you follow his suggestion of relying on a third party tool such as Google translate - it's not perfect, but it's close enough for most practical purposes.

But you know your data best, you would be the best judge of that.

-sujit

To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to a topic in the Google Groups "nltk-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nltk-users/kwao-veBKfI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nltk-users+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward