How to password-protect an on-the-net OpenRefine instance?

266 views
Skip to first unread message

Nikhil VJ

unread,
Dec 12, 2018, 11:56:21 AM12/12/18
to OpenRefine

Hi Friends,


I recently followed this very well-written guide:

http://jonathansoma.com/lede/algorithms-2017/servers/openrefine/


..to host an OpenRefine instance on my digitalocean account. There were some false starts (go directly to the last part of making it run in background, and it's normal for it to be frozen for 45 mins during startup - check back in an hour!) but it all worked out great and now our team (all remotely located) has an on-the-cloud OpenRefine instance which we all can visit from browser (even on our tablets!) and do a little bit of data cleaning work now and then on our common projects. It has immensely boosted the amount of work we're able to get done, with no more having to send project files back and forth etc. (By the way you can just as easily launch it for everyone to use over your local wifi network.. but in my case the team mates aren't at the same place so we needed to go cloud). One major benefit is that I can juice up the server to very high performance for day or a few hours when we know there'll be lot of action happening, and afterwards we can wind it down to the most basic 2GB level again. Only amounts to a few dollars. The guide also talks about doing this on Amazon, I haven't tried that way yet.


But there's also a big vulnerability : Our OpenRefine instance is accessible to *anyone* who can get the short IP address and put a :3333 at the end. There are IP based restriction mechanism instructions given on that guide, but that didn't cut it for us because we need to be able to have different people from all over the place access it every now and then, even new people, without somebody having to go under the ssh hood continuously and mess around there with root privileges. Also, one person's IP address (including the top levels) changes depending on whether they're accessing from home broadband, office wifi, phone hotspot, 4G dongle, their family member's phone hotspot, etc etc. And then somebody's going to be at a cafe or an airport at 3am and I'll have to give an entire airport access to our stuff and I'll be sleepy and mess everything by accident.. nope, wrong solution!


What would really work for us here is a simple password-based access restriction when one goes on our URL. I've seen that in several back-ends : You go on a URL you're "not supposed to" and you get a browser prompt demanding a username and password. I've seen it ages ago long before API keys, sophisticated server systems etc came about - you could put that restriction on a simple files folder, so it's something old-school. I want to implement that on our server and specifically on this OpenRefine endpoint that's simply http://[My digitalocean IP address]:3333 . But I'm not able to find out how. If anyone can point me in the right direction I'd really appreciate it.


If anyone needs one-on-one help in starting OpenRefine on their cloud server account then you can ping me, I'll be glad to help you get set up.


By the way, a tip for co-ordination : we've set up a github repo just for deciding amongst ourselves which choice to go with when clustering data. There's no code or anything - we're just using the Issues section - the interface is awesome, very helpful to keep track of things. Take a look: https://github.com/answerquest/hyd-stop-names-cleaning/issues


One link:

- https://github.com/asapach/peerflix-server/wiki/How-to-put-a-password-on-peerflix-server - A guide that might be on to it, but again doesn't look like the old school thing I was referring to and dear Lord why so many way to mess up royally!


Thad Guidry

unread,
Dec 12, 2018, 12:53:48 PM12/12/18
to openr...@googlegroups.com

Nikhil VJ

unread,
Dec 12, 2018, 9:40:47 PM12/12/18
to OpenRefine
Hi Thad,

Thanks for pointing me to the developer group conversation.

There are some things that may be too obvious to mention for people very familiar with managing servers, but others may be clueless about. So I'm listing some questions that are coming up for me. I will be exploring this in the coming week or two (have to work on some other things right now, R&D takes time!), will post back here if I find out any answers myself.


1. How do I begin if I've just SSH'd into my server? I'm logged in as root user, copying the prompt output below:

root@ubuntu-openrefine:~# ls
helloworld.txt  nohup.out  openrefine-3.1  openrefine-linux-3.1.tar.gz
root@ubuntu-openrefine:~# ls /
bin   etc         lib         media  proc  sbin  sys  var
boot  home        lib64       mnt    root  snap  tmp  vmlinuz
dev   initrd.img  lost+found  opt    run   srv   usr

It's a Ubuntu 16.04 droplet hosted on a digitalocean.com account. 


2. Where do I save this file mentioned in that thread, whose contents start from <VirtualHost ... to  </VirtualHost>
What should it be named as?


3. How do I set up a /path/to/my/password.htpasswd file on my server? Just make a simple text file having the password written in it? Wouldn't there be a username to set up as well? And would I have to create a user on the Ubuntu OS of my server? Which users group, what privileges? Do they need to be specified to have write-access to the OpenRefine data folder : ~/.local/share/openrefine/ which is presently under my root user ?


4. I understand that after this I should run the OpenRefine instance normally so it starts at the default http://127.0.0.1:3333/ endpoint instead of being available-to-all on 0.0.0.0 . And the <Virtualhost... file mentioned above will redirect internet visitors coming on http://[MY_IP]:80 to the OpenRefine endpoint. Please correct if wrong.


5. After saving the configuration file in step 2, are there any services / programs to shut down and restart? Should I reboot the droplet/server from my digitalocean dashboard?


6. Any install commands about this?
note that you need the mod_proxy and mod_proxy_http installed (most linux distributions don't install those by default so you have to hunt for them). If you're building apache on your own, you have to enable them in your ./configure stage because they are not built by default.


7. Any "oh yeah I forgot to mention you also need to... " ?


Additional : Risks of having multiple users:
I'll add here that yes we have read the warnings, disclaimers and we understand the over-writing issues with multiple people working on the same OpenRefine instance, and have our co-ordination ways to ensure that only one person is making edits at a time. 
I wanted to ask how is the data saved internally once a [text facet > cluster] operation is done : Is the entire dataset or entire column including rows we weren't dealing with, over-written with a freshly minted version having the changes (aka full overwrite)? Or, are only the affected rows / cells changed? If it's the latter then we can still have some degree of concurrency.

Looking forward to setting this up!

Regards
Nikhil VJ
Pune, India


PS: Apologies if the formatting is out of whack. I'm composing this message from google groups, I've seen they don't render well in emails.

Thad Guidry

unread,
Dec 12, 2018, 10:18:36 PM12/12/18
to openr...@googlegroups.com
Nikhil,

I don't have time to teach these things through email, nor would I want to.  Understand that EVERYONE's time is valuable.  And this mailing list is for helping users with OpenRefine as it was designed for, and you are taking it into areas that it was not designed for, and asking those of us who designed it to help with that.

I understand this is challenging for you, so my suggestion is to continue learning or perhaps hire someone trusted to assist you and guide you through Upwork.com or Fiverr.com

Best of Luck!


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nikhil VJ

unread,
Dec 12, 2018, 10:44:40 PM12/12/18
to OpenRefine
Hi Thad, that's totally fine, no worries.

I posted on the mailing list precisely to reach out to other users (and not the developer team) in case somebody already knows. So even my follow-up email with the questions was intended for the whole list and not just for you - I'm so sorry if I made it sound that way. I've been in your situation and can understand. I don't want to bug the development team members at all regarding this non-standard use case.. you folks please continue doing your amazing work and ignore this one.

Paul R Pival

unread,
Dec 13, 2018, 12:52:02 PM12/13/18
to OpenRefine
Definitely beyond my area of expertise as well Nikhil, but you might find the following helpful for question 3:

or

Paul

Nikhil VJ

unread,
Dec 14, 2018, 11:29:48 PM12/14/18
to OpenRefine
Hi Folks,

I contacted DigitalOcean support, they've said sure its possible and have shared some guides for it. I'll get back here when something to share. Just sharing the link they gave below in case there's someone else who also wants to try out.

I had asked another question that concerned multiple-users (but hey just imagine it's just a single-user thing only and I'm just being curious): How does OpenRefine save project data?  Full file over-write, or record-level ops?

If I facet > cluster > merge about 50 values in a column in a dataset having a million rows, then does the entire column (or table) get over-written with a fresh version? Or are only those specific records altered while the remaining data is untouched? I'm 75% sure it's record-level only, but want to double-check.

-----

Links shared by DigitalOcean support folks:

"This tutorial is a bit more updated and you may find it easier to follow. Please note that you will be able to skip Step 2, as you already have your backend server. 
https://www.digitalocean.com/community/tutorials/how-to-use-apache-as-a-reverse-proxy-with-mod_proxy-on-ubuntu-16-04
We also have a tutorial available about how to set up the password protection, which you can find here:
https://www.digitalocean.com/community/tutorials/how-to-set-up-password-authentication-with-apache-on-ubuntu-16-04
"

-Nikhil VJ
Pune, India

Tony Hirst

unread,
Dec 17, 2018, 11:59:29 AM12/17/18
to OpenRefine
One way would be to proxy OpenRefine via JuptyerHub (eg https://github.com/betatim/openrefineder ) and piggyback on JupyterHub's authentication.


--tony

Nikhil VJ

unread,
Dec 31, 2018, 12:25:32 PM12/31/18
to OpenRefine
Hi folks,

Thanks a ton for the help. Happy to report that we've achieved password-protection of our online OpenRefine instance.

I followed 2 guides shared by DigitalOcean support team, but there were many places where I had to change something to get ahead. I'll compile a detailed walkthrough later on, right now sharing the two main links:


And here's a comment with some details on last leg.

Tony Hirst

unread,
Jan 7, 2019, 7:45:25 AM1/7/19
to OpenRefine
Hi

I've just posted a recipe of my own for getting up and running with OpenRefine on Digital Ocean behind nginx simple auth. I think this is a minimal recipe, so it;d be good to compare notes with what you have.

I generally work at the personal/disposable application so good practice when it comes to production practices is completely missing (not least because I don't really know what good practice is!)


On my to do list is another variant that will show how to achieve a similar effect using docker-compose, rather than installing "raw" Linux packages.

--tony

Thad Guidry

unread,
Jan 7, 2019, 9:44:16 AM1/7/19
to openr...@googlegroups.com
Tony,

Feel free to add that blog link to our new Wiki page on Jupyter somewhere as a non-production example with nginx simple auth.



Tony Hirst

unread,
Jan 7, 2019, 2:12:57 PM1/7/19
to OpenRefine
Thad

That post doesn't really have anything to do with Jupyter? It's about running OpenRefine on a remote host.

However, this post - https://blog.ouseful.info/2019/01/07/autostarting-a-headless-openrefine-server-in-repo2docker-with-start-config-file/ - on autostarting a headless OpenRefine server in MyBinder is relevant, and I have already added it to the OR/Jupyter wiki page ;-)

--tony

Tony Hirst

unread,
Jan 8, 2019, 11:26:32 AM1/8/19
to OpenRefine
Following up of my tinkering from yesterday, here's a hopefully easier - cut-and-paste - way of setting up an authenticated OpenRefine server on Digital Ocean cloud host: https://blog.ouseful.info/2019/01/08/authenticated-openrefine-server-on-digital-ocean-redux/

--tony

On Wednesday, 12 December 2018 16:56:21 UTC, Nikhil VJ wrote:
Reply all
Reply to author
Forward
0 new messages