How alter the key collision + fingerprint method

49 views
Skip to first unread message

Petros Liveris

unread,
Sep 15, 2020, 4:07:08 AM9/15/20
to OpenRefine
Hello,

for my specific needs, i need to make a small adjustment to the way the key collision + fingerprint algorithm works. In a step of creating the clusters, the algorithm removes all punctuation and control characters. (In this step, i need to also remove some stop words: "&", "and", "co", "the" for example).

Where in the source code should I make this change?

and what would be the steps required to make a new compiled file for linux again?

Thank you very much in advance

Tom Morris

unread,
Sep 15, 2020, 3:43:44 PM9/15/20
to openr...@googlegroups.com
The openrefine-dev list is the best place to ask developer questions, but changing the source code seems like it should be a last resort. 

Are you sure you can't just make a copy of the column of interest and preprocess it to remove your stopwords before clustering? That would be a lot less work.

Tom

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CANN0m7aMDK6C9ZDufAaN3sXwPt_tRbQfr5_Wb0YbjDXnAPVTVA%40mail.gmail.com.

Petros Liveris

unread,
Sep 16, 2020, 7:15:19 AM9/16/20
to OpenRefine
i need the exact same behavior, as punctuation does. Do not take into account the (&, and, co) for the creation of the clusters, but in the results, i need to have a suggestion based on the original values, and also in the clusters results, i need to see all original values.

Francis & Taylor, 
Francis and Taylor
Francis Taylor co

if i removed the stopwords and then made the clusters,

all of the above would become 

Francis Taylor (and would not even form a cluster)

If it still somehow could be achieved via the GUI, please let me know

thank you again

Tom Morris

unread,
Sep 16, 2020, 12:36:26 PM9/16/20
to openr...@googlegroups.com
I would use OpenRefine's record mode for this.

On Wed, Sep 16, 2020 at 7:15 AM Petros Liveris <petros....@gmail.com> wrote:
i need the exact same behavior, as punctuation does. Do not take into account the (&, and, co) for the creation of the clusters, but in the results, i need to have a suggestion based on the original values, and also in the clusters results, i need to see all original values.

Francis & Taylor, 
Francis and Taylor
Francis Taylor co

if i removed the stopwords and then made the clusters,

all of the above would become 

Francis Taylor (and would not even form a cluster)

If it still somehow could be achieved via the GUI, please let me know

I would duplicate the column, move it to the left, remove your stop words, sort on it, make the sort permanent, to end up with this:

Normalized Name

Name

Francis Taylor

Francis & Taylor

Francis Taylor

Francis and Taylor

Francis Taylor

Francis Taylor co


You could then sort on normalized name, make the sort permanent, Blank Down, then start your clustering operations. All of the rows with the same Normalized Name will be grouped into the same record when displayed in OpenRefine's Record mode.

You may need to vary the order of operations slightly depending on what you're trying to achieve, but this basic outline should work for a large number of cases.

An entirely different approach would be to reconcile against a reconciliation service that knows most or all of the name variants.

Tom
Reply all
Reply to author
Forward
0 new messages