Importing data stored remotely

31 views
Skip to first unread message

Daniel Fenn

unread,
Sep 19, 2017, 9:33:57 PM9/19/17
to OpenRefine
Hi,

I've never used OpenRefine before, and I'm trying to figure out if it will work for my use case. All of my data is stored on a remote cluster, and I do all of my work there. I've been able to run OpenRefine remotely and connect to it just fine, but I can't figure out how to import data from the remote machine that it's actually running on. I see there's a client-side option for importing data from the computer, but why no server-side option?

This functionality is essential for me to use the software, since I can't work with my data locally. Ideally, I'd like to be able to run OpenRefine in parallel on the cluster, but whether or not I attempt that will be dependent on if I can even get the data imported.

Can anyone tell me if this is possible?

Thanks,

Dan

Thad Guidry

unread,
Sep 19, 2017, 10:27:59 PM9/19/17
to openrefine
Nope.

OpenRefine was designed to work with data locally.  Then you can export the transformed data.

It sounds like you really want to use something like Apache NiFi https://nifi.apache.org/
which is really cool for working with and moving data between systems and transforming it.  This is called ETL.  Extract, Transform, Load.
Nifi can ingress from anywhere if need be via stream or batch onto disk if not enough memory, egress out to wherever, but Nifi is not a distributed computing platform but moves data from/to them while transforming if need be.

The Hadoop Summit 2016 and also Oscon 2015 videos are excellent

​Data provenance can be disabled, see their wiki and community docs and of course ask questions on their mailing list.

Good luck Daniel and if you can tell me more about your problem I'd be happy to help further.

Reply all
Reply to author
Forward
0 new messages