For example
I want to search for Chicken in the string
string s1 = "This is Great Chicken";
string s2 = "Chicken";
return s1.IndexOf(s2) >= 0
The trick is how would I search for: Person doing the search
misspelled the word Chickin (Happens all the time);
string s1 = "This is Great Chicken";
string s2 = "Chickin";
return ???
or
Worst yet
string s1 = "This is Great Chickin";
string s2 = "Chicken";
Person enter the search phrase misspelled the word Chicken;
string s1 = "This is Great Chickin";
string s2 = "Chicken";
Check out the Regex namespace.
If you search something such as "Chickin" in a string which does not
contain this token, such as "This is Great Chicken", what would you
expect to return? I wouldn't expect it to return anything. Maybe you
wanna do something like google's key words suggestion if user entry
has some typos?
Correction: I meant the System.Text.RegularExpressions Namespace
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.aspx
Regex is a class of this namespace.
> Any idea on how I would be able to do a search within C# that does
> ranges or words
There are a variety of techniques for dealing with partial matching as you
describe.
I've actually implemented a very simple approach myself in the past, in
which I do a sort of character-by-character merge between the target text
and the search word, finding the spot in the target text where the search
word fits the best with the least number of excluded characters. This
would work perfectly for the examples you gave, but whether that would
address your more general goals I can't say.
If that sounds promising, I might be able to dig up the code and post it
here. It's pretty inefficient, but it's small and simple so what the
heck? :)
You might also Google for "soundex". There's even a Wikipedia article on
the topic:
http://en.wikipedia.org/wiki/Soundex
That's a commonly used technique for matching words based on phonetics and
if you want a solution that is based more on language than on typography,
it might be a better approach.
A Google search on "dictionary spell check algorithm" also turned up a
number of promising links.
Pete
Hi,
I do not how to do it in code, I'm pretty sure that no version of the
framework provides such a feature.
In T-SQL you have this feature though take a look for SOUNDEX function
I do not know how to use a Regex in this case.can you elaborate?
> The trick is how would I search for: Person doing the search
> misspelled the word Chickin (Happens all the time);
The answer is that what you're looking for is not at all simple. I'd
recommend you do a Google search to see if you can find sample code for a
spell checker.
But, with addition to already mentioned Soundex, you might like to check
Levenshtein distance
(http://en.wikipedia.org/wiki/Levenshtein_distance). It tells how
similar two texts are; i.e. how many simple edit commands are needed to
make the second text from the first. For example distance between
Chicken and Chikin is one.
About Soundex: I have not studied it much, but I have a feeling Knuth
built it for english words. How well does it apply to other languages
or names?
-
Arto Viitanen
>>> The trick is how would I search for: Person doing the search
>>> misspelled the word Chickin (Happens all the time);
>>
>> The answer is that what you're looking for is not at all simple. I'd
>> recommend you do a Google search to see if you can find sample code for a
>> spell checker.
> Spelling checker does not help. For example the text contains "Cut" and
> the user searches for "Cat". Both are spelled right, so the search gives
> no result.
I agree, IF you say, "Hey, Mr. Spell Checker algorithm, tell me what's
misspelled in this sentence." However, if you are in control of the code,
I'd think you'd be able to say, "Hey, Mr. Spell Checker algorithm, make a
list of all the probable misspellings of 'chicken' and tell me if this
string contains any of them." See where I'm going?
> But, with addition to already mentioned Soundex, you might like to check
> Levenshtein distance
> (http://en.wikipedia.org/wiki/Levenshtein_distance). It tells how
> similar two texts are; i.e. how many simple edit commands are needed to
> make the second text from the first. For example distance between
> Chicken and Chikin is one.
Sounds good.
> About Soundex: I have not studied it much, but I have a feeling Knuth
> built it for english words. How well does it apply to other languages
> or names?
I've heard people say that Soundex is an ancient crutch and shouldn't be
relied upon too much.
That sounds interesting. Is there any chance of seeing it as a
CodeProject article? :D
> [...]
> Spelling checker does not help. For example the text contains "Cut" and
> the user searches for "Cat". Both are spelled right, so the search gives
> no result.
A spell checker itself wouldn't help. But if you apply the same algorithm
that a spell checker would, but with a dictionary that only has the word
you're specifically looking for at the moment, presumably those algorithms
would spit out some reasonable assessment of "closeness" for the word as
compared to the text being searched.
After all, for a spell checker to evaluate what the "best match" is for
misspelled words, it has to be able to compute some metric so that
possible matches can be ordered according to "closeness".
For all I know, the mentioned "Levenshtein distance" computation is in
fact something that a spell checker might use. But regardless, it seems
both would be good starting points for the question.
Pete
> [...]
>> I've actually implemented a very simple approach myself in the past,
>> in which I do a sort of character-by-character merge between the target
>> text and the search word, finding the spot in the target text where the
>> search word fits the best with the least number of excluded
>> characters. This would work perfectly for the examples you gave, but
>> whether that would address your more general goals I can't say.
>>
> [SNIP]
>> Pete
>
> That sounds interesting. Is there any chance of seeing it as a
> CodeProject article? :D
Nope. But I'll look for the code and try to remember to post it here.
After that, someone else can post it to CodeProject if they like.
Note that I wasn't kidding when I said it's not efficient code. The
application I needed it for involved very short strings and brute force
worked great. I haven't looked at the Levenshtein article yet, but I
suspect that the algorithm described there is similar to what I did, but
with some actual thought and cleverness applied to it so that it has a
reasonable computational cost. :)
Pete
That would be nice, thank you.
> Note that I wasn't kidding when I said it's not efficient code. The
> application I needed it for involved very short strings and brute force
> worked great. I haven't looked at the Levenshtein article yet, but I
> suspect that the algorithm described there is similar to what I did, but
> with some actual thought and cleverness applied to it so that it has a
> reasonable computational cost. :)
>
I see. However perhaps it could be improved. It would be interesting, at
least, to see the approach you used, so if you have the time and mood,
I'd be most grateful if you could post such code.
Thanks. Best Regards.
> [...]
>> Note that I wasn't kidding when I said it's not efficient code. The
>> application I needed it for involved very short strings and brute force
>> worked great. I haven't looked at the Levenshtein article yet, but I
>> suspect that the algorithm described there is similar to what I did,
>> but with some actual thought and cleverness applied to it so that it
>> has a reasonable computational cost. :)
>
> I see. However perhaps it could be improved. It would be interesting, at
> least, to see the approach you used, so if you have the time and mood,
> I'd be most grateful if you could post such code.
Here you go (see below). Note that in my original problem, I was
comparing two complete strings and producing a metric that indicated how
similar the strings were, rather than searching for one string in
another. But you can easily modify this approach by using the same basic
technique, but only looking at substrings of the string being searched.
Though, actually...now that I look at this implementation again, I think
you could probably feed it a string to be searched and a string to look
for, and it would work okay. It could even be modified to return the
character index of the starting point of the best match for the string
being looked for. You'd want to limit how far it looks ahead, maybe
simply by delimiting attempted matches, or otherwise take into account
gaps within the matched characters, because otherwise it will consider
itself to have successfully matched the string even if each character it
found was actually just located in a different word. But other than that,
it's probably pretty close to what could work.
The basic idea below is to start out doing a basic string compare, but
when reaching a character that doesn't match, scan forward in each string
to try to match the current character in the other string (for two scans
each mis-match). If a matching character can be found, continue trying to
match characters again, looking for the point at which a comparison of the
remaining characters returns the highest "score". In this context, the
score is the number of characters that the algorithm could match before
running out of characters in one of the strings.
You'll note that the algorithm basically enumerates all of the possible
ways that the strings could match. There's essentially a binary search
tree that bifurcates each time there's a character mis-match, which means
that the cost of the algorithm has a worst-case cost that is exponential
by the length of the longest search. In practice, it will hit the
worst-case scenario only rarely, but it's at least worth understanding,
especially since you don't have to get very far into the algorithm for
even an average-case scenario to be relatively expensive as compared to
some other algorithm that might be designed more optimally.
Like I said, I was dealing with such short strings, I made no attempt at
all to be clever. It worked very well for my needs at the time, but I
make no promises about anyone else's. :)
Pete
static int ScoreStrings(string str1, string str2)
{
int ich = 0, cchMax = Math.Min(str1.Length, str2.Length);
int scoreMax;
// Scan the strings, stopping at the first difference
while (ich < cchMax && str1[ich] == str2[ich])
{
ich++;
}
// If we've exhausted one of the strings, that's the best we're
// going to do
if (ich == cchMax)
{
return ich;
}
// For each string, scan forward looking for the current
character
// in the other. Check each possible match, and use the best
score
// we find.
scoreMax = Math.Max(ich,
Math.Max(ScoreStringsScan(str1.Substring(ich), str2, ich),
ScoreStringsScan(str2.Substring(ich), str1, ich)));
return scoreMax;
}
static int ScoreStringsScan(string str1, string str2, int ichStart)
{
int scoreMax = 0;
for (int ichScan = ichStart; ichScan < str2.Length; ichScan++)
{
if (str1[0] == str2[ichScan])
{
int score = ScoreStrings(str1,
str2.Substring(ichScan));
if (score > scoreMax)
{
scoreMax = score;
}
}
}
return scoreMax;
}
Thanks a lot Pete, it looks interesting.
Regards,
Fernando.