Thank you for your time to answer my questions, dear Owen
i have done the first step and create a "cleaned" version of the value (and when you resolve the clusters you can use the exact phrasing for the corrected value as you want "Add column based on this column" to create a duplicate set of values and then use that copy to do the removal of stop words etc., and then cluster on that column - this would allow you to find the clusters ignoring the stop words, and create a "cleaned" version of the value
where i do not the results i need, is how i will manage to point the cleaned value with the original one. I need to know the original value that got in the cluster,
since i will make the replacements like this:
Wiley
wiley and sons
These will not be in the a cluster in the fingerprint algorithm, after the removal of "and", "sons", since hey will be considered as a single value
they will be both transformed to wiley, so in the new column i will have many fewer clusters than my original data would produce
Am i wrong?