Levenshtein edit-distance and Ratio

3,849 views
Skip to first unread message

pierre baral

unread,
Dec 1, 2011, 2:00:07 AM12/1/11
to nltk-...@googlegroups.com
Hi,

I'm pretty new with NLTK, so let's ask stupid questions ;)

I want to compare various sentences in my program in order to find similar sentences.

Example of similar sentences (for me):
string1 = "La Chine abaisse les ratios de réserves obligatoires des banques"
string2 = "La Chine a abaissé les ratios des réserves des banques"

In the past, I used the python-Levenshtein library which is pretty good.
https://github.com/miohtama/python-Levenshtein/tree/

In this library you have access to the equivalent of edit_distance() but also to a ratio.
I do not know how this ratio is computed, but it is pretty good (better than just the result of the edit_distance).

I do not find any ratio in NLTK? So does someone know what this ratio is (I have read a part of the library code
but that's too difficult for me).

Is there a way to have this ratio on NLTK! ? :-)

Does I use the good technique? or Should I use jaccard_distance or something else?

Best Regards,

Pierre

Alexis Dimitriadis

unread,
Dec 1, 2011, 7:03:15 AM12/1/11
to nltk-...@googlegroups.com
So does someone know what this ratio is (I have read a part of the library code
but that's too difficult for me).

I took a look out of curiosity. The python-Levenshtein ratio is computed as follows (in ratio_py):

    return (lensum - ldist) / lensum

ldist is the Levenshtein distance, lensum is the sum of the two string lengths. If lensum is zero (two empty strings), ratio_py returns 1.0 as a special case. Have fun with it!

Alexis

static PyObject*
ratio_py(PyObject *self, PyObject *args)
{
  size_t lensum;
  long int ldist;
  if ((ldist = levenshtein_common(args, "ratio", 1, &lensum)) < 0)
    return NULL;
  if (lensum == 0)
    return PyFloat_FromDouble(1.0);
  return PyFloat_FromDouble((double)(lensum - ldist)/(lensum));
}
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

-- 
_____________________________________________
 
Alexis Dimitriadis
A.Dimi...@uu.nl
+31-30-253-6568
 
Utrecht Institute of Linguistics OTS
Trans 10
3512 JK Utrecht
The Netherlands

pierre baral

unread,
Dec 1, 2011, 8:18:07 AM12/1/11
to nltk-...@googlegroups.com
Thanks Alexis, I did not think that was so simple, but it works better in my case :)
I paste my test file :p

#!/usr/bin/python

from __future__ import division
from Levenshtein import *

import nltk

string1 = "La Chine abaisse les ratios de reserves obligatoires des banques"
string2 = "La Chine a abaisse les ratios des reserves des banques"

len1 = len(string1)
len2 = len(string2)
lensum = len1 + len2
levdist = nltk.edit_distance(string1, string2)

print "len1: %s" % len1
print "len2: %s" % len2
print "lensum: %s" % lensum
print "levdist: %s" % levdist

nltkratio = ((lensum - levdist) / lensum)
print "nltkratio %s" % nltkratio

pylevratio = ratio(string1, string2)
print "pylevratio %s" % pylevratio

$ python compareratios.py
len1: 64
len2: 54
lensum: 118
levdist: 16
nltkratio 0.864406779661
pylevratio 0.864406779661


BTW, is it the good method I use to compare similarity? or is there another one in NLTK?

Regards,

2011/12/1 Alexis Dimitriadis <alexis.di...@gmail.com>

Steven Bird

unread,
Dec 1, 2011, 8:52:30 PM12/1/11
to nltk-...@googlegroups.com
On 2 December 2011 00:18, pierre baral <pierre...@gmail.com> wrote:
> BTW, is it the good method I use to compare similarity?

Note that the latest issue of the computational linguistics journal,
published yesterday, has an article on Levenshtein distance as used to
measure language similarity:

http://www.aclweb.org/anthology/J/J11/#4000

-Steven Bird

Stuart Robinson

unread,
Dec 1, 2011, 8:59:13 PM12/1/11
to nltk-...@googlegroups.com
Yes, and the conclusion is basically that it's not very good for that task, which isn't exactly surprising, given how linguistically naive an approach it represents.

pierre baral

unread,
Dec 2, 2011, 1:33:53 AM12/2/11
to nltk-...@googlegroups.com
Ok for the naive approach, but is there another method implemented in NLTK?
Is there something I'm missing? (I hope no ;))

2011/12/2 Stuart Robinson <stuartpr...@gmail.com>

Alexis Dimitriadis

unread,
Dec 2, 2011, 6:17:00 AM12/2/11
to nltk-...@googlegroups.com
BTW, is it the good method I use to compare similarity?
What are you trying to do? What's a "good" similarity metric depends on what you're using it for.

Best,

Alexis

pierre baral

unread,
Dec 2, 2011, 12:45:46 PM12/2/11
to nltk-...@googlegroups.com
I just want to compare Titles from Articles (so short sentences).

2011/12/2 Alexis Dimitriadis <alexis.di...@gmail.com>

Luciano M. Guasco

unread,
Dec 10, 2011, 3:30:36 PM12/10/11
to nltk-...@googlegroups.com
Hi Pierre. I used Euclidean distance over features as same words in two sentences, synonyms, head of NP using chunking,etc. 

But may be here can give a tip about Latent Semantics? I heard it could work well with sentence similarity. I will search some paper about it, cause I am working on sentence similarity right now... Euclidean is ok, but many sentences has the same semantic without sharing words, synonims, or either the structure. 
If i get something new, I will post it. 

Bests, 
Luciano.

pranav waila

unread,
May 11, 2016, 5:23:45 AM5/11/16
to nltk-users, pierre...@gmail.com
The problem in this approach that its not considering the cases when one of the string is of 0 length.
You need to check the case i guess.

Regards:
Pranav Waila
Reply all
Reply to author
Forward
0 new messages