Transformation to use??

34 views
Skip to first unread message

Rutvik Sheth

unread,
Sep 23, 2012, 11:30:56 PM9/23/12
to ld...@googlegroups.com
Hi,

I have been using the ldif single and hadoop instance on a test basis and needed help in selecting the best transformation to use in the linkspec file. 
My use case is to match company names and generate sameAs links. In order to do so I need to remove the suffixes in the company names such as Ltd, Co, Corp etc. 
I started  by using the replace transformation, but i find i can only replace one string using that and not give it a list. 
I tried using the removeValues transformation but that does not seem to work at all.
I tused the regexReplace transformation and while this works, it replaces the regex from the entire company name and not just the suffix. So for eg. It would not work for 
Colgate Co Ltd. because it would convert it into "lgate" since (Co and Ltd) are regex i have specified.
I tried using tokenize but that reduces the amount of matches i get in the result. Please suggest what I can use.

Best,

Rutvik.

a.sc...@mediaeventservices.com

unread,
Sep 24, 2012, 9:30:43 AM9/24/12
to ld...@googlegroups.com
Hi Rutvik,

you can solve this problem by changing the regex to only match at the end of the string with a dollar sign at the end:

\s+(Corp|Co|Inc)$

Here also at least one space must precede the suffix.

However, for your example to work, you would need to add "Co\s+Ltd" to the list of suffixes, because the "Co" alone would not match this regex anymore.

Cheers,
Andreas
Reply all
Reply to author
Forward
0 new messages