Rob wrote: > Does anyone know the algorithm used to compare strings in the Similar() > function? I have to create a similar function, excuse the pun, in SQL > Server.
would it be possible to send the sample to me, too? I have used SIMILAR since about 10 years to do "automated customer data clearing" stuff, and it works quite well - in conjunction with LIKE and the like... I once had done tests with user-defined and external functions (e.g. with the "Levenshtein" algorithm) but they were far too slow in contrast to SIMILAR.
Fortunately, the algorithm seems to have been the same since V5.5...so I would like to have a chance to look at it more closely.
TIA Volker (Feeling fine not to have to port that to MS SQL / ASE)
> Rob wrote: > > Does anyone know the algorithm used to compare strings in the Similar() > > function? I have to create a similar function, excuse the pun, in SQL > > Server.
You may need to send me an email first or give me another email address. My email to you bounced with the following message: "Mail rejected for policy reasons."
-john.
-- John Smirnios Senior Software Developer iAnywhere Solutions Engineering
> would it be possible to send the sample to me, too? > I have used SIMILAR since about 10 years to do "automated customer data > clearing" stuff, and it works quite well - in conjunction with LIKE and the > like... > I once had done tests with user-defined and external functions (e.g. with > the "Levenshtein" algorithm) but they were far too slow in contrast to > SIMILAR.
> Fortunately, the algorithm seems to have been the same since V5.5...so I > would like to have a chance to look at it more closely.
> TIA > Volker > (Feeling fine not to have to port that to MS SQL / ASE)
> "John Smirnios" <smirnios_at_sybase.com> wrote in > news:462ceb90$1@forums-1-dub... >> Yes. I'll send you a sample via email.
>> Whitepapers, TechDocs, bug fixes are all available through the iAnywhere >> Developer Community at http://www.ianywhere.com/developer
>> Rob wrote: >>> Does anyone know the algorithm used to compare strings in the Similar() >>> function? I have to create a similar function, excuse the pun, in SQL >>> Server.
thanks for sending the source code! - I still have to study it in detail...
I just have read about UCA and collation tailoring in the 10.0.1 docs.
As this seems to one of your favourite topics:
In the application I talked of, we typically have to compare German names (of persons and places). As you may know, these may contain "umlauts" like 'ä' or special characters like 'ß' (the "sharp s"). However, in older, restricted charsets (or in internationalized uses like mail addresses), these umlauts have often been expanded to two characters, e.g. 'ä' to 'ae' or 'ß' to 'ss'. So one task we face is to have 'ä' and 'ae' to compare to be the same.
AFAIK, single-byte collations can only compare characters one by one and therefore can not treat 'ä' and 'ae' as wanted. Is this the same for unicode collations, or could I establish some rule to make 'ä' and 'ae' the same?
(So far, we have solved this problem by storing both the original names and an "normalized" form, where umlauts are expanded and everything is uppercase and some phonetic simplifications are done (e.g. 'ph' sounds like 'f' and is therefore normalized to 'f'). The normalized form is stored as an computed field and is automatically calculated by an user-defined function. Comparisons are then done on the normalized forms. This works well with the typical German '1252LATIN1" single-byte collation.)
Any hint if UCA may give better facilities is highly appreciated...
> You may need to send me an email first or give me another email address. > My email to you bounced with the following message: "Mail rejected for > policy reasons."
> > would it be possible to send the sample to me, too? > > I have used SIMILAR since about 10 years to do "automated customer data > > clearing" stuff, and it works quite well - in conjunction with LIKE and the > > like... > > I once had done tests with user-defined and external functions (e.g. with > > the "Levenshtein" algorithm) but they were far too slow in contrast to > > SIMILAR.
> > Fortunately, the algorithm seems to have been the same since V5.5...so I > > would like to have a chance to look at it more closely.
> > TIA > > Volker > > (Feeling fine not to have to port that to MS SQL / ASE)
> > "John Smirnios" <smirnios_at_sybase.com> wrote in > > news:462ceb90$1@forums-1-dub... > >> Yes. I'll send you a sample via email.
> >> Whitepapers, TechDocs, bug fixes are all available through the iAnywhere > >> Developer Community at http://www.ianywhere.com/developer
> >> Rob wrote: > >>> Does anyone know the algorithm used to compare strings in the Similar() > >>> function? I have to create a similar function, excuse the pun, in SQL > >>> Server.
In queries, UCA collations definitely use UCA to perform the comparison in a linguistically correct way so that 'SS' = 'ß' (not sure off hand if you need to specify the right locale/tailoring for that though).
However, the code for the "similar" function in SQLAnywhere is still performed using a character-by-character match as seen in the code I sent you. When scanning a two strings such as 'SS' and 'ß', it will try to match the first 'S' with 'ß' and not find a match (since 'S' != 'ß').
If it's any consolation, the UPPER function will convert 'ß' to 'SS' if you are using an ICU collation (again, the correct locale/tailoring may be needed).
-john.
-- John Smirnios Senior Software Developer iAnywhere Solutions Engineering
> thanks for sending the source code! - I still have to study it in detail...
> I just have read about UCA and collation tailoring in the 10.0.1 docs.
> As this seems to one of your favourite topics:
> In the application I talked of, we typically have to compare German names > (of persons and places). > As you may know, these may contain "umlauts" like 'ä' or special characters > like 'ß' (the "sharp s"). > However, in older, restricted charsets (or in internationalized uses like > mail addresses), these umlauts have often been expanded to two characters, > e.g. 'ä' to 'ae' or 'ß' to 'ss'. > So one task we face is to have 'ä' and 'ae' to compare to be the same.
> AFAIK, single-byte collations can only compare characters one by one and > therefore can not treat 'ä' and 'ae' as wanted. > Is this the same for unicode collations, or could I establish some rule to > make 'ä' and 'ae' the same?
> (So far, we have solved this problem by storing both the original names and > an "normalized" form, where umlauts are expanded and everything is uppercase > and some phonetic simplifications are done (e.g. 'ph' sounds like 'f' and is > therefore normalized to 'f'). The normalized form is stored as an computed > field and is automatically calculated by an user-defined function. > Comparisons are then done on the normalized forms. > This works well with the typical German '1252LATIN1" single-byte collation.)
> Any hint if UCA may give better facilities is highly appreciated...
> Volker
> "John Smirnios" <smirnios_at_sybase.com> schrieb im Newsbeitrag > news:462e1b2a$1@forums-1-dub... >> You may need to send me an email first or give me another email address. >> My email to you bounced with the following message: "Mail rejected for >> policy reasons."
>> Whitepapers, TechDocs, bug fixes are all available through the iAnywhere >> Developer Community at http://www.ianywhere.com/developer
>> Volker Barth wrote: >>> John,
>>> would it be possible to send the sample to me, too? >>> I have used SIMILAR since about 10 years to do "automated customer data >>> clearing" stuff, and it works quite well - in conjunction with LIKE and > the >>> like... >>> I once had done tests with user-defined and external functions (e.g. > with >>> the "Levenshtein" algorithm) but they were far too slow in contrast to >>> SIMILAR.
>>> Fortunately, the algorithm seems to have been the same since V5.5...so I >>> would like to have a chance to look at it more closely.
>>> TIA >>> Volker >>> (Feeling fine not to have to port that to MS SQL / ASE)
>>> "John Smirnios" <smirnios_at_sybase.com> wrote in >>> news:462ceb90$1@forums-1-dub... >>>> Yes. I'll send you a sample via email.
>>>> Whitepapers, TechDocs, bug fixes are all available through the > iAnywhere >>>> Developer Community at http://www.ianywhere.com/developer
>>>> Rob wrote: >>>>> Does anyone know the algorithm used to compare strings in the > Similar() >>>>> function? I have to create a similar function, excuse the pun, in SQL >>>>> Server.
I would love to; however, when I asked for permission a long time ago to send out the source for "similar" I was told to include a statement to the effect of "you are free to use it and modify it however you like but you cannot redistribute the source without Sybase's permission". That pretty much precludes posting the source. C'est la vie. I don't mind emailing it to whomever asks for it.
-john.
-- John Smirnios Senior Software Developer iAnywhere Solutions Engineering
> Would it be possible to post this on a website somewhere (or just > include it in a response to this post)? I'm sure others might be > interested too.
So I guess I'm going to do some tests with UCA in the (not so near) future... ...though the particular solution we are using now may still be more appropriate to treat names like 'Stefan' and 'Stephan' (which are mis-spelled or mixed up quite often) as equal.
> In queries, UCA collations definitely use UCA to perform the comparison > in a linguistically correct way so that 'SS' = 'ß' (not sure off hand if > you need to specify the right locale/tailoring for that though).
> However, the code for the "similar" function in SQLAnywhere is still > performed using a character-by-character match as seen in the code I > sent you. When scanning a two strings such as 'SS' and 'ß', it will try > to match the first 'S' with 'ß' and not find a match (since 'S' != 'ß').
> If it's any consolation, the UPPER function will convert 'ß' to 'SS' if > you are using an ICU collation (again, the correct locale/tailoring may > be needed).
> > thanks for sending the source code! - I still have to study it in detail...
> > I just have read about UCA and collation tailoring in the 10.0.1 docs.
> > As this seems to one of your favourite topics:
> > In the application I talked of, we typically have to compare German names > > (of persons and places). > > As you may know, these may contain "umlauts" like 'ä' or special characters > > like 'ß' (the "sharp s"). > > However, in older, restricted charsets (or in internationalized uses like > > mail addresses), these umlauts have often been expanded to two characters, > > e.g. 'ä' to 'ae' or 'ß' to 'ss'. > > So one task we face is to have 'ä' and 'ae' to compare to be the same.
> > AFAIK, single-byte collations can only compare characters one by one and > > therefore can not treat 'ä' and 'ae' as wanted. > > Is this the same for unicode collations, or could I establish some rule to > > make 'ä' and 'ae' the same?
> > (So far, we have solved this problem by storing both the original names and > > an "normalized" form, where umlauts are expanded and everything is uppercase > > and some phonetic simplifications are done (e.g. 'ph' sounds like 'f' and is > > therefore normalized to 'f'). The normalized form is stored as an computed > > field and is automatically calculated by an user-defined function. > > Comparisons are then done on the normalized forms. > > This works well with the typical German '1252LATIN1" single-byte collation.)
> > Any hint if UCA may give better facilities is highly appreciated...
> > Volker
> > "John Smirnios" <smirnios_at_sybase.com> schrieb im Newsbeitrag > > news:462e1b2a$1@forums-1-dub... > >> You may need to send me an email first or give me another email address. > >> My email to you bounced with the following message: "Mail rejected for > >> policy reasons."
> >> Whitepapers, TechDocs, bug fixes are all available through the iAnywhere > >> Developer Community at http://www.ianywhere.com/developer
> >> Volker Barth wrote: > >>> John,
> >>> would it be possible to send the sample to me, too? > >>> I have used SIMILAR since about 10 years to do "automated customer data > >>> clearing" stuff, and it works quite well - in conjunction with LIKE and > > the > >>> like... > >>> I once had done tests with user-defined and external functions (e.g. > > with > >>> the "Levenshtein" algorithm) but they were far too slow in contrast to > >>> SIMILAR.
> >>> Fortunately, the algorithm seems to have been the same since V5.5...so I > >>> would like to have a chance to look at it more closely.
> >>> TIA > >>> Volker > >>> (Feeling fine not to have to port that to MS SQL / ASE)
> >>> "John Smirnios" <smirnios_at_sybase.com> wrote in > >>> news:462ceb90$1@forums-1-dub... > >>>> Yes. I'll send you a sample via email.
> >>>> Whitepapers, TechDocs, bug fixes are all available through the > > iAnywhere > >>>> Developer Community at http://www.ianywhere.com/developer
> >>>> Rob wrote: > >>>>> Does anyone know the algorithm used to compare strings in the > > Similar() > >>>>> function? I have to create a similar function, excuse the pun, in SQL > >>>>> Server.
пятница, 27 апреля 2007 г., 17:42:29 UTC+2 пользователь John Smirnios написал:
> I would love to; however, when I asked for permission a long time ago to > send out the source for "similar" I was told to include a statement to > the effect of "you are free to use it and modify it however you like but > you cannot redistribute the source without Sybase's permission". That > pretty much precludes posting the source. C'est la vie. I don't mind > emailing it to whomever asks for it.
> Rob wrote:
> > Does anyone know the algorithm used to compare strings in the Similar() > > function? I have to create a similar function, excuse the pun, in SQL > > Server.
> > TIA
> > P.S I hate SQL Server!
John,
I need to implement the Sybase Similar() function on Teradata. It'll be great if you can share the algorithm with me.