Add column by fetching URLs: caching?

184 views
Skip to first unread message

Antonin Delpeuch (lists)

unread,
Mar 4, 2017, 2:30:21 PM3/4/17
to openr...@googlegroups.com
Hi,

Looking at the source code of OpenRefine, it seems that if I use "Create
column by fetching URLs", the same URL can be fetched many times (if
some value occurs multiple times in the dataset).

Wouldn't it be better to fetch every URL only once? That would require
some caching. I'm not familiar with the internals of OpenRefine but I'm
happy to implement that! I just need to know what kind of cache I should
use for that.

Cheers,
Antonin

Thad Guidry

unread,
Mar 4, 2017, 5:40:40 PM3/4/17
to openr...@googlegroups.com
Hi Antonin !

Welcome back !

OpenRefine is a power tool, as you know.  You can slice and dice a column's values...or remove rows that have duplicate values in that column...PRIOR to running the "Add column by fetching URLs".

We even have a Facet for that function !  Facet -> Customized Facets -> Duplicates facet

The expectation of removing duplicate values in a column lies with our users and their use of OpenRefine's functions to Flag or Star rows that have duplicate values...and then use All -> Edit rows -> Remove all matching rows   , if they want to.. or not...they can run the Fetch URL while having the Duplicates facet clicked on only the "false" rows :) :)

Lots of hidden power, just play around and visit all the sub menus...and if you get lost or can't figure something out...we're here to help !

Antonin Delpeuch (lists)

unread,
Mar 4, 2017, 6:14:54 PM3/4/17
to openr...@googlegroups.com
Hi Thad,

Right, but in my case I don't want to remove matching rows. Here is the
use case:

I have two columns, A and B. For instance A is an organization and B is
its country. I want to fetch a column C using a URL that only depends on
B. This will yield many duplicate urls as many rows share the same value
for B. However the data in column A is important so I don't want to
merge all rows with the same value in B!

And I don't want the column C to be filled for only one row per value in
B: I want all rows to have the value obtained at the URL.

Concerning the method with the facet: I'm not sure to see how you would
do that. If I filter on the duplicate facet with "true", I get only the
duplicate rows. If it's "false", I get only the ones with a single
occurrence. But I want to fetch the URL for all values!

I can probably start with the ones with "false", but then I would need
to do the ones with "true" after having merged them. And then "un-merge"
and duplicate the results of the URLs to fill the whole column?

That seems quite complicated for something that could be handled very
simply by the software!

Antonin
> +ThadGuidry <https://www.google.com/+ThadGuidry>
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Thad Guidry

unread,
Mar 4, 2017, 6:54:13 PM3/4/17
to openr...@googlegroups.com
Then don't do the OPTIONAL ... All -> Edit rows -> Remove all matching rows

Just use Facets... to temporarily change the view (the grid) to only show non-duplicate rows on Column C...and Fetch on the URLs in Column C.


If your data doesn't look like the above...then perhaps put it into that form using OpenRefine's Facets and Fill Down, etc.
Then perform your de-duplicating facet on Column C
Then change to RECORDS MODE !  Not ROWS MODE.
Move your KEY column into Column 1 position... (I have no idea what that might be for you, but perhaps its your Column B ???
Then perform the fetch on Column C by Add column by fetching URLs into Column D

Records mode is an important step for your use case that your not performing and probably need to :)

Happy to do a hangout with you on Google, just ping...I'm online.

Antonin Delpeuch (lists)

unread,
Mar 5, 2017, 5:23:56 AM3/5/17
to openr...@googlegroups.com
Thanks Thad for the explanations but I still don't see how to do it!
Fetching the URL in records mode still fetches the URLs multiple times
for each record.

Here is my dataset:

fruit,city
apple,milano
pear,milano
plum,trento
strawberry,milano
banana,nairobi
kiwi,trento

Let's say I want to add a column by fetching URLs based on the "city"
column, so that URLs are etched only once for a given value of the
"city" column.

To make sure every URL is fetched only once, I use this URL expression:
"https://www.random.org/integers/?num=1&min=1&max=100&col=1&base=10&format=plain&rnd=new&city="+value

This will return a random number every time the URL is fetched, ignoring
the city passed as parameter. Once URLs have been fetched, the dataset
should look like this:

fruit,city,number
apple,milano,34
pear,milano,34
plum,trento,89
strawberry,milano,34
banana,nairobi,8
kiwi,trento,89

If you could give me a step by step solution for that dataset it would
be fantastic. I think this could be a good thing to write in the docs
too (I can help do that once I have understood your solution). But I
still believe this should really be handled by the software!

Antonin

Thad Guidry

unread,
Mar 5, 2017, 11:28:09 AM3/5/17
to openr...@googlegroups.com
Ah, now that you have given me a good grid example, I can see that you really don't have Records in the traditional sense, but just Row data with a need for holding the original value at each row (not at each Record, since in your case each Row IS YOUR Record)... So...

We live in a world where Copy Paste simplifies our lives for this use case.  In OpenRefine you don't want to lose your original values, so we use the idea of Copy Paste to accomplish that.  We can use "Add Column based on this column" whenever we want to do a Copy Paste of a Column in OpenRefine :)

So just copy paste your City column values to a new Column C to work with it even more ! ...while still keeping the original column B values so that you don't lose those original values.  (if things ever get cluttered with many columns, just View -> Collapse a few of them) 

With your grid populated, do 

1. Edit Column - > Add column based on this column...and call it copycity or whatever
2. Sort ... as text (a-z) and click OK
3. On the blue Sort link dropdown along the top of the grid, use "Reorder rows permanently"  <-- Important Step !!
4. On your copycity column, use Edit cells -> Blank down
5. Now you should have only 1 value for each city in your copycity column
6. On copycity, Edit column -> Add column based on this column using your GREL:
7. Have fun and pay us back for the help by creating a small video showing this and give us a link so that I can put it on the wiki to help others in the future with a tutorial.

Here is the OpenRefine Project of the steps above that you can import into OpenRefine

-Thad

Antonin Delpeuch (lists)

unread,
Mar 5, 2017, 12:27:07 PM3/5/17
to openr...@googlegroups.com
OK… Do you agree that my use case is quite common and that users might
not want to follow these 6 steps just to save a few HTTP requests?
Wouldn't some caching be useful then?

BTW about paying back, I've spent most of my week-end demonstrating
OpenRefine and Wikidata at a hackathon for Open Data Day:
https://github.com/sparcopen/open-research-doathon

A few outcomes:
- new Wikidata and OpenRefine enthusiasts!
- bug fixes in the reconciliation interface!
- updated OpenRefine doc about reconciliation, with pictures!
https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation

Cheers,
Antonin
> File
> FetchOnceOnly.openrefine.tar.…
> <https://drive.google.com/file/d/0B533WzlrxWraVURFd1BMYTVKSzA/view?usp=drivesdk>

Thad Guidry

unread,
Mar 5, 2017, 12:30:27 PM3/5/17
to openr...@googlegroups.com
Oops, forgot the last step !

You'll then need to just Export to CSV and then Create a NEW project based on that CSV File.
Then with that new project,
On its city column,
  Edit Column -> Add column based on this column

   cell.cross("MyOriginalProject","city")[0].cells.int.value


Unfortunately, we don't have a way to do an INLINE CROSS() function while staying in the current project.
But I can see that being a cool feature for your use case if we did have that function...it would save you from having to even create a New project and then do cross() inside of it.

-Thad

Thad Guidry

unread,
Mar 5, 2017, 12:40:08 PM3/5/17
to openr...@googlegroups.com
Yes would be useful...perhaps it can be the same function under the existing one and call it "Add column by fetching Unique URLs".  Which behind the scenes could use simple HashMap to hold only the unique URLs needed and maybe have an checkbox option on the dialog to also "copy retrieved value to other record rows" which would save you from the new project creation and cross() function.

I don't have the necessary full Java chops to implement that but I bet others could hack on that feature, and I'd be glad to merge it in.

Thad Guidry

unread,
Mar 5, 2017, 12:50:46 PM3/5/17
to openr...@googlegroups.com
Btw Antonin,
If you are inclined to hack on OpenRefine (and I would be happy to help guide you) on the Fetch Unique URLs feature...I'd probably choose implementing it with Apache Ignite's data grid and its on-heap off-heap features among other nice to have features. https://ignite.apache.org/features/datagrid.html

Thad Guidry

unread,
Mar 5, 2017, 12:59:03 PM3/5/17
to openr...@googlegroups.com
OH WOW ! Thanks so much for updating our Wiki about Reconciling !!!

Antonin Delpeuch (lists)

unread,
Mar 6, 2017, 6:01:52 PM3/6/17
to openr...@googlegroups.com
Isn't Ignite a bit overkill in that case? I mean, the cache will likely
never be distributed as it only exists within one "Add column by
fetching URLs" operation.

Plus, it does not seem to be used yet, so it would add a dependency.

Otherwise, a simple HashMap would seem to work…

Antonin

Thad Guidry

unread,
Mar 6, 2017, 8:08:12 PM3/6/17
to openr...@googlegroups.com
Big Data ? :)

Yes its overkill but i was thinking that others might want/need it for large lookup lists.  But pragmatically I agree, it adds more complexity for us where its already a corner case anyways.  Those few that have billions of rows can lease some time on AWS instances I suppose.

HashMap is our goto class for dealing with things like this and we already use it in many areas in OpenRefine.

Reply all
Reply to author
Forward
0 new messages