Ideally, I am looking for an elegant way of comparing two non- identical strings. I want to find the longest substring of stringOne that is the #sameAs any substring of stringTwo, what that substring is, and have some indicator of how similar the two strings are to each other (like 68% similar) . My first few attempts at writing this method have sprawled out well beyond anything I could call worthy of ST style and grace. Besides its inherent ugliness, I haven't even got it fully functional yet. Basically I am just taking the substrings of stringOne, from largest to smallest, and comparing them to the substrings of stringTwo, largest to smallest. I am making the substrings by removing characters #at: index whatever . It is very tedious and ugly. There must be a better way.
> I want to find the longest substring of stringOne > that is the #sameAs any substring of stringTwo, what that substring > is, and have some indicator of how similar the two strings are to each > other (like 68% similar) .
Hi Adam, I have not implemented this, but might suggest checking 'The Algorithm Design Manual', S. Skiena, ISBN 0-387-94860-0. Among the algorithms is pg 422-424. It discusses and input of a set, S, of strings S1...Sn. To answer, what the longest string S' such that for each Si, 1<=i<=n, the characters of S appear as a subsequence of Si ? There is an accompanying CD with source and links to implementations. There are also related problems discussed as approximate string matching, and shortest commmon superstring. Hope this point you in the rigth direction. ACG
> > I want to find the longest substring of stringOne > > that is the #sameAs any substring of stringTwo, what that substring > > is, and have some indicator of how similar the two strings are to each > > other (like 68% similar) .
> Hi Adam, > I have not implemented this, but might suggest checking 'The Algorithm > Design Manual', S. Skiena, ISBN 0-387-94860-0. Among the algorithms > is pg 422-424. It discusses and input of a set, S, of strings S1...Sn. > To answer, what the longest string S' such that for each Si, 1<=i<=n, > the characters of S appear as a subsequence of Si ? There is an > accompanying CD with source and links to implementations. There are > also related problems discussed as approximate string matching, and > shortest commmon superstring. > Hope this point you in the rigth direction. > ACG
thanks. I found a copy online. It was (and still is) a little over my head. I waded through it last night until about midnight, went to sleep, woke up, and rewrote my method like this: findLongestSubstringBetween: S and: T | substring l t i n m test | l := S size. t := 0. i := 1. [l > 0] whileTrue: [[t < i] whileTrue: [n := t + 1. m := t + l. substring := S copyFrom: n to: m. test := T findString: substring startingAt: 1. test = 0 ifFalse: [^substring]. t := t + 1]. l := l - 1. i := i + 1. t := 0] ^nil
Besides being relatively short, it actually finds the longest substring. Right now this method resides in a class I made, but I wonder if it go somewhere else. Any thoughts?
Adam <a...@crumptons.org> wrote: > findLongestSubstringBetween: S and: T > | substring l t i n m test | > l := S size. > t := 0. > i := 1. > [l > 0] whileTrue: > [[t < i] whileTrue: > [n := t + 1. > m := t + l. > substring := S copyFrom: n to: m. > test := T findString: substring startingAt: 1. > test = 0 ifFalse: [^substring]. > t := t + 1]. > l := l - 1. > i := i + 1. > t := 0] > ^nil
> Besides being relatively short, it actually finds the longest > substring. Right now this method resides in a class I made, but I > wonder if it go somewhere else. Any thoughts?
I'd either use S as receiver and define String>>findLongestSubstringIn: T or use T as receiver and define String>>findLongestSubstringOf: S
"Between ... and ..." sounds very symmetrical in its arguments to me, while your code searches for the longest substring of S in T. IMO, it's a better match to your implementation.
> > > I want to find the longest substring of stringOne > > > that is the #sameAs any substring of stringTwo, what that substring > > > is, and have some indicator of how similar the two strings are to each > > > other (like 68% similar) .
> > Hi Adam, > > I have not implemented this, but might suggest checking 'The Algorithm > > Design Manual', S. Skiena, ISBN 0-387-94860-0. Among the algorithms > > is pg 422-424. It discusses and input of a set, S, of strings S1...Sn. > > To answer, what the longest string S' such that for each Si, 1<=i<=n, > > the characters of S appear as a subsequence of Si ? There is an > > accompanying CD with source and links to implementations. There are > > also related problems discussed as approximate string matching, and > > shortest commmon superstring. > > Hope this point you in the rigth direction. > > ACG
> thanks. I found a copy online. It was (and still is) a little over my > head. I waded through it last night until about midnight, went to > sleep, woke up, and rewrote my method like this: > findLongestSubstringBetween: S and: T > | substring l t i n m test | > l := S size. > t := 0. > i := 1. > [l > 0] whileTrue: > [[t < i] whileTrue: > [n := t + 1. > m := t + l. > substring := S copyFrom: n to: m. > test := T findString: substring startingAt: 1. > test = 0 ifFalse: [^substring]. > t := t + 1]. > l := l - 1. > i := i + 1. > t := 0] > ^nil
> Besides being relatively short, it actually finds the longest > substring. Right now this method resides in a class I made, but I > wonder if it go somewhere else. Any thoughts?
> Adam
Elegant solution.
Below code is same as yours. I just made it more Smalltalkish(I hope).
findLongestSubstringBetween: S and: T | length subString | length := S size. length to: 1 by: -1 do: [:l | 0 to: length - l do: [:t | subString := S copyFrom: t + 1 to: t + l. (T findString: subString startingAt: 1) ~= 0 ifTrue: [^subString] ]]. ^nil