I am trying to implement a way of showing similar documents (html pages,
mostly), and sorting them according to the degree of similarity.
The easiest way to do this is to calculate an Edit Distance (i.e. the
minimum number of changes required to transform one document to another)
from the target document.
However, the naive Levenshtein algorithm is O(n^2), and when you are
comparing 2 10k documents, it can take quite a while (9 or so seconds on
my Thinkpad T30 per comparison!)
Does anyone know more computationally efficient comparison algorithms?
Many thanks!
Rogan
--
Rogan Dawes
*ALL* messages to dis...@dawes.za.net will be dropped, and added
to my blacklist. Please respond to "nntp AT dawes DOT za DOT net"
Rogan Dawes wrote:
> [Levenshtein algorithm]
>
> Does anyone know more computationally efficient comparison algorithms?
You may modify the Levenshtein-Distance in that way that it uses words
(sentences/paragraphs) instead of characters. "diff" compares lines, for
example.
Ciao,
Ingo
Thanks for the suggestion. I'll have to see how easy that is to implement.
diff x y | wc -l
http://citeseer.ist.psu.edu/broder97resemblance.html
It was good enough for altavista,
Richard
Rogan Dawes wrote:
> I am trying to implement a way of showing similar documents (html pages,
> mostly), and sorting them according to the degree of similarity.
Count the number of common trigrams, possibly not character based
but (normalized) word based. It might help to weight
trigrams according to frequency or according to the
frequencies of the words contained.
Never tried this for whole documents, but
for words or multi word terms trigram
similarity seems to be not too bad.
Harald.
Interesting! Thanks for the link.
Would you be willing to do the opposite, namely measure the quantity of
*difference* instead of a measure of similarity?
> The easiest way to do this is to calculate an Edit Distance (i.e. the
> minimum number of changes required to transform one document to
> another) from the target document.
Note: Now you're doing the opposite of what you originally said, now
you're computing a measure if *difference* instead of similiarity. So
now despite what you originally said, you're taking my preferred
overall methodology. Within that generic, more specifically have you
considered a ProxHash as a proxy for Edit Distance? Basically for each
document you compile statistics of trigrams or whatever you think will
make a good proxy for position in abstract information space, then you
feed that into a ProxHash which reduces the unbounded-dimension
trigram-or-whatever point to a point in a fixed dimension normalized
linear space with Euclidean metric easily and quickly computable. At
that point with document -> trigramStatistics -> euclideanVector
accomplished, performing clustering on such vectors is much faster than
directly performing clustering on the original documents, because
calculating Euclidean distance (sqrt of sum of squares of
per-coordinate differences) is much faster than directly computing Edit
Distance between two original documents. Once you have the proxhashes
clustered, you can use them directly to presume distances between
original documents are proportional to proxhash distances, if that's a
good enough approximation for your purposes, or you can use compute
actual Edit Distance just for those few documents that are close enough
together to be considered for user's result set.
By the way, your question has nothing to do with Java, so you should
have posted to comp.programming instead, and I'm cross-posting my
response there.