External Clustering API

86 views
Skip to first unread message

Dave Waterworth

unread,
Oct 14, 2021, 10:41:06 PM10/14/21
to OpenRefine
Is it possible to cluster a column using an external API? I want to find longest common prefixes of a column so I can remove it. Clustering using a custom distance works, for example

1 - len(LCP(s,t)) / avg(len(s), len(t))

I could implement my own web service to do this, but my initial evaluation of Open Refine leads me to beleive that I need to write a Java based extension and I don't have any knowledge of Java. Is there an easier way? I also want to integrate other functions such as ML models or even hueristic based column transforms but not sure how to go about it.

Vladimir Stavrov

unread,
Nov 15, 2021, 4:04:49 PM11/15/21
to OpenRefine
I did something similar,  hopefully this video could give an idea how to do that - https://youtube.com/watch?v=Uqsrp04erfM&feature=share
Briefly, we need 2 web-services to solve this task: 1) publishing service, which accepts, let say, facets data from desired column and does initial clustering, and 2) search or identification service which returns, let say, cluster number for cell data. This service can be called using 'create column by fetching url'. This approach is explained in video above. No any Java programming is required. 

пятница, 15 октября 2021 г. в 05:41:06 UTC+3, wat...@gmail.com:

Antonin Delpeuch (lists)

unread,
Nov 15, 2021, 4:38:22 PM11/15/21
to openr...@googlegroups.com
Hi both,

Very interesting! Thank you Dave for the request and Vladimir for the video!

I think this would definitely be worth implementing. I have opened an
issue about it:
https://github.com/OpenRefine/OpenRefine/issues/4301
I hope I got the use case right?

Antonin
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openrefine/b9cd3dc0-b2d0-48e8-873a-376f281a758en%40googlegroups.com
> <https://groups.google.com/d/msgid/openrefine/b9cd3dc0-b2d0-48e8-873a-376f281a758en%40googlegroups.com?utm_medium=email&utm_source=footer>.

Vladimir Stavrov

unread,
Nov 16, 2021, 7:48:01 AM11/16/21
to OpenRefine
Hi Antonin,
thank you for opening issue, from my point of view it looks appropriate.

I would add a couple more features, or at least try to discuss their usefulness for community:
1) export of facets to json file.
It would be great if facets and their data, opened at the left panel, could be saved to json file by single button click,
something like ...{"FacetVariable":{values:[list of values], frequencies:[list of frequencies]}...}

2) Application history export to/import from json file
Sure, we could use copy/paste text in this window, but in context of my application
it would be more user-friendly way to set up correspondence between base columns, coming from operations json,
and columns of the current project.
It would allow community to exchange operation history json files, solving particular problems.

Vladimir
вторник, 16 ноября 2021 г. в 00:38:22 UTC+3, Antonin Delpeuch (lists):

Antonin Delpeuch (lists)

unread,
Nov 16, 2021, 11:37:25 AM11/16/21
to openr...@googlegroups.com
Hi Vladimir,

Thanks for the other feature requests!
On 16/11/2021 13:48, Vladimir Stavrov wrote:
> 1) export of facets to json file.
> It would be great if facets and their data, opened at the left panel,
> could be saved to json file by single button click,
> something like ...{"FacetVariable":{values:[list of values],
> frequencies:[list of frequencies]}...}

I think we already have the possibility to export the frequencies from
text facets to a TSV form, are you aware of this feature? (It is not so
visible I think)

> 2) Application history export to/import from json file
> Sure, we could use copy/paste text in this window, but in context of my
> application
> it would be more user-friendly way to set up correspondence between base
> columns, coming from operations json,
> and columns of the current project.
> It would allow community to exchange operation history json files,
> solving particular problems.

Yes… That is also something I would love to do, I hope to go more in
this direction in the 4.0 branch for the new architecture. Stay tuned!

Antonin
> <https://groups.google.com/d/msgid/openrefine/b9cd3dc0-b2d0-48e8-873a-376f281a758en%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/openrefine/b9cd3dc0-b2d0-48e8-873a-376f281a758en%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openrefine/c1436066-0511-46ca-ad79-1631e4b8e674n%40googlegroups.com
> <https://groups.google.com/d/msgid/openrefine/c1436066-0511-46ca-ad79-1631e4b8e674n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Vladimir Stavrov

unread,
Nov 17, 2021, 4:53:49 AM11/17/21
to OpenRefine
Hi Antonin,

regarding p.1 - I have found discussion about that - https://github.com/OpenRefine/OpenRefine/pull/2685
The only way I found to export facets with content is using clipboard to copy/paste content of each facet individually after pressing of "Facet choices" link.

When I said about "single facet export button", I mentioned "export facet as json" button somewhere between "Refresh"/"Reset All"/"Remove All" buttons
under the Facet/Filter tab.
As I see from discussion, this feature stays in a queue for a long time....

Regarding the 2nd point - it would be great to have this feature in coming release/branch 4.0, thank you!

вторник, 16 ноября 2021 г. в 19:37:25 UTC+3, Antonin Delpeuch (lists):

Antonin Delpeuch (lists)

unread,
Nov 17, 2021, 4:59:52 AM11/17/21
to openr...@googlegroups.com
Hi Vladimir,

On 17/11/2021 10:53, Vladimir Stavrov wrote:
> As I see from discussion, this feature stays in a queue for a long time....

Yes unfortunately our capacity is quite limited… We have more than 650
open issues and it can take a few hours to solve just a single one. We
always welcome new contributors though!

>
> Regarding the 2nd point - it would be great to have this feature in
> coming release/branch 4.0, thank you!

It will probably not be in the upcoming release of the 4.0 branch, but I
hope to be working on this (among other things) in the following ones.

Best,
Antonin
> <https://groups.google.com/d/msgid/openrefine/c1436066-0511-46ca-ad79-1631e4b8e674n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/openrefine/c1436066-0511-46ca-ad79-1631e4b8e674n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openrefine/e008141a-764f-43d0-9e6a-e4fe7c27814cn%40googlegroups.com
> <https://groups.google.com/d/msgid/openrefine/e008141a-764f-43d0-9e6a-e4fe7c27814cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages