N-Gram Fingerprint method need to add stop words

120 views
Skip to first unread message

code4l...@gmail.com

unread,
Nov 18, 2020, 7:12:43 AM11/18/20
to OpenRefine

Dear All,

i face some issues with my github account, it does not let me create an issue.
The issue i need to create is the following:

I need to accomplish the same as this issue describes:
https://github.com/OpenRefine/OpenRefine/issues/3200
but for the N-Gram Fingerprint method.
in which part of the code of the N-Gram Fingerprint algorithm, should i make the alteration, like below?
In the line of code that punctuation is removed, I need to add some words like publisher, editor, etc...
https://github.com/OpenRefine/OpenRefine/blob/c76e2b9a461ed5b353ebf5c80e0e0cad2163331c/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java#L93

in the above code line, one can alter the string to be processed, so in there i can add my stop words. But in the N-Gram Fingerprint algorithm, i have not found the line of code that i could make the same with the string, i am looking at the place of code that the punctuation is removed from the string, so there i can add my string alteration with the removal of the stop words.

If you believe it is necessary, please create an issue on github, since my account is flagged.

Do not know why, i just created from my campus network.

Thank you in advance for your kind help,

Owen Stephens

unread,
Dec 9, 2020, 10:28:01 AM12/9/20
to OpenRefine
It may depend on how often you need to do this and what your exact use case is, but if you want to do this without making changes to code, then you can use the "Add column based on this column" to create a duplicate set of values and then use that copy to do the removal of stop words etc., and then cluster on that column - this would allow you to find the clusters ignoring the stop words, and create a "cleaned" version of the value (and when you resolve the clusters you can use the exact phrasing for the corrected value as you want

Best wishes

Owen

Petros Liveris

unread,
Dec 10, 2020, 6:38:34 AM12/10/20
to OpenRefine
Thank you for your time to answer my questions, dear Owen

i have done the first step and create a "cleaned" version of the value (and when you resolve the clusters you can use the exact phrasing for the corrected value as you want "Add column based on this column" to create a duplicate set of values and then use that copy to do the removal of stop words etc., and then cluster on that column - this would allow you to find the clusters ignoring the stop words, and create a "cleaned" version of the value

where i do not the results i need, is how i will manage to point the cleaned value with the original one. I need to know the original value that got in the cluster,

since i will make the replacements like this:

Wiley 
wiley and sons 

These will not be in the a cluster in the fingerprint algorithm, after the removal of "and", "sons", since hey will be considered as a single value

they will be both transformed to  wiley, so in the new column i will have many fewer clusters than my original data would produce

Am i wrong?


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/c142a301-f8bd-42ae-9dd8-d6c945992c98n%40googlegroups.com.

Owen Stephens

unread,
Dec 10, 2020, 6:49:15 AM12/10/20
to OpenRefine
I agree this isn't a perfect approach, but it may help you.
What I suggest is:

Once you have the new column of data without stopwords, you can then apply a transform to create the "fingerprint" in that column

value.fingerprint()

Then create a text facet based on this column, and another based on the original column.
The values in the first facet represent the clusters you would have found if you'd been able to do a fingerprint clustering which removed the stop words first.
Select a value in the first facet to see the original values that would have fallen into that cluster in the second facet
Correct any values in the second facet (either directly or through transformations/hand edits in the data grid)

I hope that helps explain how you can use this approach

Owen

Petros Liveris

unread,
Dec 10, 2020, 6:57:15 AM12/10/20
to OpenRefine
i have also tried this approach, but its problem is that i cannot get the facets' results exported. This approach would mean manually fix all values and in my case  i need to somehow automate the process 

Reply all
Reply to author
Forward
0 new messages