Hi Friends,
I recently followed this very well-written guide:
http://jonathansoma.com/lede/algorithms-2017/servers/openrefine/
..to host an OpenRefine instance on my digitalocean account. There were some false starts (go directly to the last part of making it run in background, and it's normal for it to be frozen for 45 mins during startup - check back in an hour!) but it all worked out great and now our team (all remotely located) has an on-the-cloud OpenRefine instance which we all can visit from browser (even on our tablets!) and do a little bit of data cleaning work now and then on our common projects. It has immensely boosted the amount of work we're able to get done, with no more having to send project files back and forth etc. (By the way you can just as easily launch it for everyone to use over your local wifi network.. but in my case the team mates aren't at the same place so we needed to go cloud). One major benefit is that I can juice up the server to very high performance for day or a few hours when we know there'll be lot of action happening, and afterwards we can wind it down to the most basic 2GB level again. Only amounts to a few dollars. The guide also talks about doing this on Amazon, I haven't tried that way yet.
But there's also a big vulnerability : Our OpenRefine instance is accessible to *anyone* who can get the short IP address and put a :3333 at the end. There are IP based restriction mechanism instructions given on that guide, but that didn't cut it for us because we need to be able to have different people from all over the place access it every now and then, even new people, without somebody having to go under the ssh hood continuously and mess around there with root privileges. Also, one person's IP address (including the top levels) changes depending on whether they're accessing from home broadband, office wifi, phone hotspot, 4G dongle, their family member's phone hotspot, etc etc. And then somebody's going to be at a cafe or an airport at 3am and I'll have to give an entire airport access to our stuff and I'll be sleepy and mess everything by accident.. nope, wrong solution!
What would really work for us here is a simple password-based access restriction when one goes on our URL. I've seen that in several back-ends : You go on a URL you're "not supposed to" and you get a browser prompt demanding a username and password. I've seen it ages ago long before API keys, sophisticated server systems etc came about - you could put that restriction on a simple files folder, so it's something old-school. I want to implement that on our server and specifically on this OpenRefine endpoint that's simply http://[My digitalocean IP address]:3333 . But I'm not able to find out how. If anyone can point me in the right direction I'd really appreciate it.
If anyone needs one-on-one help in starting OpenRefine on their cloud server account then you can ping me, I'll be glad to help you get set up.
By the way, a tip for co-ordination : we've set up a github repo just for deciding amongst ourselves which choice to go with when clustering data. There's no code or anything - we're just using the Issues section - the interface is awesome, very helpful to keep track of things. Take a look: https://github.com/answerquest/hyd-stop-names-cleaning/issues
One link:
- https://github.com/asapach/peerflix-server/wiki/How-to-put-a-password-on-peerflix-server - A guide that might be on to it, but again doesn't look like the old school thing I was referring to and dear Lord why so many way to mess up royally!
note that you need the mod_proxy and mod_proxy_http installed (most linux distributions don't install those by default so you have to hunt for them). If you're building apache on your own, you have to enable them in your ./configure stage because they are not built by default.
--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
"This tutorial is a bit more updated and you may find it easier to follow. Please note that you will be able to skip Step 2, as you already have your backend server.
https://www.digitalocean.com/community/tutorials/how-to-use-apache-as-a-reverse-proxy-with-mod_proxy-on-ubuntu-16-04
We also have a tutorial available about how to set up the password protection, which you can find here:
https://www.digitalocean.com/community/tutorials/how-to-set-up-password-authentication-with-apache-on-ubuntu-16-04
"