Server Configurations: What is Used in The Dataverse Community

157 views
Skip to first unread message

Sherry Lake

unread,
Apr 5, 2021, 4:35:30 PM4/5/21
to Dataverse Users Community
Hello,

I am evaluating a server upgrade for our Local Dataverse Repository (University of Virginia). 

My main question is should we continue to use a local VMWare host OR go to the cloud?

What are the advantages in going to the cloud? (I think I know, but would love to get your responses to verify what I think).

If you are using the cloud, what cloud environment?

Thanks for any and all responses.

Thanks.
Sherry Lake


ofuuzo ofuuzo

unread,
Apr 26, 2021, 7:29:17 AM4/26/21
to Dataverse Users Community
Hi Sherry,
Did you receive any answer to your question. We are planning moving our dataverse, DataverseNO.no to the cloud. 

Thanks
Obi

Philipp at UiT

unread,
Apr 28, 2021, 7:25:19 AM4/28/21
to Dataverse Users Community
Hi Sherry,

As my colleague Obi said, we're planning to move DataverseNO to the cloud.
Generally, I think one of the main reasons why our IT department wants to move services to the cloud is because they don't want to use resources on the operation and maintenance of local servers.
More specifically, some of the advantages of moving a Dataverse installation to the cloud, could be better support for upload and download of larger files? Maybe some organizations already running their Dataverse instance in the cloud could add some more advantages? And maybe also some disadvantages?

Best, Philipp

DAVID PIEDRA

unread,
Apr 29, 2021, 10:36:22 AM4/29/21
to Dataverse Users Community
Hi everyone,

Was planning to ask something similar, so if you don't mind, I will contribute with some personal opinions, and add some more questions :P:

About "on premise" vs "cloud", some points to take into account:
  • Price: probably right now it's cheaper moving to the cloud than using on premise installations.
  • Resources: a priori I would say that is gonna be easier scaling up the resources using cloud services (except your IT department has "infinite" resources and a flexible infrastructure).
  • Network usage: the traffic generated by dataverse (in case it's broadly used, which is the aim) can affect to the network performance in the institution (again, except you have virtually unlimited resources).
  • Backups: easier and cheaper setting up backups in the cloud, you do not need to worry about hardware, disks, backup cabins,...
  • Data ownership: if you use on cloud solutions you are uploading your data outside, and some how, you lose control over it. Data protection policies should be carefully checked
  • Security risks: a service in the cloud will be isolated from your network. If any bad guy gains access to the server, the disaster will be "contained" only to the Dataverse information. Again, if your IT infrastructure uses DMZ, or the security policies are strong enought, using on premise solutions should not be dangerous.
  • Which cloud solutions are better? I have used AWS and Google Cloud, but Azure is preety cool aswell. If you are using Google Suite as your email service maybe Google Cloud is a good solution as you will have the services in the same place (even we can discuss if it's good or bad). About prices, I have heard that AWS is cheaper, but I am not sure (price calculations use to be so caotic that I cannot understand the results).

Said that... currently, in my instutition, we are in a test phase, so we are playing around with a small dataverse instance running on Google Cloud. At the moment we are using an "old school" environment (no docker, no kubernettes), but in a nearly future we will need to think about a production test.

We have in mind these two scenarios:
  • On premise installation: using "old school" environment.
  • On cloud: we can use an "old school" environment, or a cool docker/kubernettes one.

Probably we will use a cloud solution (from my answers before it's clear i love it), but here my question is: for a starting dataverse, having no prevision of the use it will have... do you recommend starting with an "old school" environment (classical install maybe with a server for the app, and another for the database) or a docker/kubernettes solution?

THanks in advance, and sorry Sherry, Obi and Philipp if my contribution adds some entropy to the thread.

danny...@g.harvard.edu

unread,
Apr 29, 2021, 3:34:28 PM4/29/21
to Dataverse Users Community
Hi everyone, 

At Harvard, we run on AWS, with two app servers behind a load balancer. Storage is on AWS S3 and we use Glacier for backups.

Similar to what Phillip mentioned, one thing that may inform how people set up their installations will be the infrastructure strategy of the institution supporting the service. At Harvard there's a big push to move things to the Cloud, and we made the move of dataverse.harvard.edu to AWS because the data center that we were using was getting shut down. :) 

- Danny

Sherry Lake

unread,
Apr 30, 2021, 11:47:58 AM4/30/21
to dataverse...@googlegroups.com
Hi Obi,

Here's one opinion (pros/cons) for going to AWS:

pros:
• AWS' hardware is FAST and they manage it
• S3 storage is cheap and backups via object versioning and bucket replication are easy
• RDS will run/manage Postgres and Postgres backups for you, for a fee
• AWS offers a serial console through their web interface in case of network trouble

cons:
• AWS is the worst choose-your-own-adventure book ever. do everything in a test capacity first because you will *always* wind up wanting to change one thing, but it's something they thought you should've done in the first place and you have to tear everything down and start over.
• you can save money by spinning up temporary and test resources in the cloud, then tearing them down. but if you're going to run permanent services, you're paying a private company to manage what you could do in-house, and given enough time that would absolutely cost more.
• AWS offers a serial console, but if things get totally hosed, that's not the same level of control VMware (or XCP-ng, I love XCP-ng) would grant you over the VM
• AWS charges egress fees for data downloaded - just ask Danny
• E-mail delivery is kind of a pain to set up, but works well once you slog through the setup
• RDS requires custom AWS Postgres extensions be created inside your database. this concerns me about vendor lock-in. they also won't/can't export a DB whose "dump" file would broach 500MB.
• AWS only allows 5 permanent "Elastic" ipv4s, instead they want to sell you "Amazon Route 53" - understandable and tenable, but another can of worms.


--
You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/EZEQKw3gj-k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/fb841f17-e7b9-4cdc-acb3-5377c15487bbn%40googlegroups.com.

Philipp at UiT

unread,
May 3, 2021, 12:40:02 AM5/3/21
to Dataverse Users Community

Thanks for your feedback, Sherry and Danny.
Danny: Any chance we could have a round on this on the community call tomorrow?
Best, Philipp

ofuuzo ofuuzo

unread,
May 3, 2021, 4:42:59 AM5/3/21
to Dataverse Users Community
Thanks Sherry, Danny and David. 
Do you know if anyone has made an investigation/testing using OpenStack cloud?

Cheers
Obi

Vyacheslav Tikhonov

unread,
May 3, 2021, 6:14:07 AM5/3/21
to dataverse...@googlegroups.com
Hi Obi,

We did the upgrade of Dataverse Kubernetes to 5.3 version for EOSC Synergy partners, you can find the repo here: http://github.com/EOSC-synergy/dataverse-kubernetes/tree/5.3
It was successfully tested and deployed on OpenStack Cloud, AWS and Google Cloud (CESSDA Cloud).

Best,
Slava
DANS-KNAW R&D

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/c21b50ec-d16b-44f4-98a7-f3946c6ca59en%40googlegroups.com.

danny...@g.harvard.edu

unread,
May 3, 2021, 9:56:34 AM5/3/21
to Dataverse Users Community
Philipp, sure thing! I've added it to the agenda for Community Call 2. 

- Danny

Philipp at UiT

unread,
May 3, 2021, 12:30:34 PM5/3/21
to Dataverse Users Community
Thanks, Danny!
- Philipp

DAVID PIEDRA

unread,
May 5, 2021, 9:59:48 AM5/5/21
to Dataverse Users Community
About the choosen architecture: k8s or dedicated servers, two questions about K8s.

  • I am afraid that, k8s architecture, being a community driven project, can be uncontinued in the future. Would you choose this architecture for a production environment?
  • any other advantetge (except scalability, flexibility) for choosing a k8s architecture instead a "more classical" one like load balancer, 2 frontend servers, 1 bbdd server?

Thanks,

David

Philipp at UiT

unread,
May 6, 2021, 12:08:42 AM5/6/21
to Dataverse Users Community

Just a small comment on your fear for community-driven projects to be discontinued in the future. For all I know, CentOS, PostgreSQL, Apache etc. etc. are also a community-driven projects, but to my knowledge a lot of organizations are using these tools despite (or I'd say: because) of that.

Best, Philipp

DAVID PIEDRA

unread,
May 6, 2021, 3:24:22 AM5/6/21
to Dataverse Users Community
Sorry Philipp, i didn't want to look I do not trust in a community-driven project at all! In fact, sometimes, they are a better option, as the improvements are constant! It's just this is a point to considere. In fact, a k8s project, hardly is gonna be discontinued, as it's becoming an standard in deploying apps.

Say that, I also would emphatize that I am more interested in the second point.

Thanks for your reply, and sorry again if my words were missunderstanding.

David

Donald Sizemore II

unread,
May 6, 2021, 4:37:06 AM5/6/21
to dataverse...@googlegroups.com
Philipp,

CentOS is a prime example of what Mr. Piedra fears can happen to community project! 

Last year, CentOS 8 was just out, stable and promised support through 2029. This year CentOS 8’s EOL has moved to December 2021 and its “8-Stream” replacement is a less stable testing ground for RHEL. A decision by their Board of Directors following IBM’s acquisition of Red Hat.

I’m sorry I missed the community call discussion on this topic, as I’m interested in it. For my part at Odum, we’re still doing traditional deployments in virtual machines.

Whenever I hear “Docker,” I think “temporal.” Too many times have I had to stop Docker/podman, remove the entire overlayfs, rebuild the containers, and start over. For testing and for non-critical services this is tenable, and dataverse-k8s uses persistent volumes extensively to preserve state.

Maybe I just have too many grey hairs in my beard… I keep finding more of those.

Don

painstakingly pecked on my iphone.

On May 6, 2021, at 00:08, Philipp at UiT <uit.p...@gmail.com> wrote:



Philipp at UiT

unread,
May 6, 2021, 1:00:03 PM5/6/21
to Dataverse Users Community

Thanks, Don. Interesting to hear! I think it would be useful to continue to share our experiences on this, maybe in a session at the next community meeting, a community call, or a IG/WG meeting.

It seems community-driven projects need some larger organizations committed to coordinating and sustaining the development and maintenance of the project, as Harvard and others do in the Dataverse project?

Best, Philipp

Crosas, Mercè

unread,
May 7, 2021, 3:20:47 AM5/7/21
to dataverse...@googlegroups.com
Thanks all of you for the interesting discussion.  

The sustainability of open-source software  (OSS) projects is a hot topic recently - there is usually not one single reason why one OSS lasts and another doesn't, and instead, a set of factors (Governance, Technology, Resources (Financial and Human), and Community Engagement) contribute to the success, together with the right timing. One of the extensive studies in this area is done by the It Takes a Village project ( see https://lyrasisnow.org/tag/it-takes-a-village/ and https://academiccommons.columbia.edu/doi/10.7916/D89G70BS ). An extended, revised guidebook will be released this fall.

For the CentOS case, I'll share a comment from a colleague: "the open source community which does a lot of development for linux and core packages, didn’t have a RHEL flavor of Linux to work on that was up-to-date.  CentOS always lagged behind RHEL.  Now developers can develop on the OS that will become RHEL on the next iteration, so that packages are stable in the production filesystem going forward (instead of fixing things after changes come out in the OS). "

Merce

--
Mercè Crosas, Ph.D.
University Research Data Management Officer, Harvard University Information Technology
Chief Data Science and Technology Officer, Institute for Quantitative Social Science
Harvard University


Vyacheslav Tikhonov

unread,
May 7, 2021, 3:23:51 AM5/7/21
to Dataverse Users Community
Hi David and others,

I understand your concerns about Docker and Kubernetes. However this discussion is mostly not about some technology but more about the knowledge exchange and reusability of software, and further adoption.

In the usual situation you're paying for maintenance (human resources) when you have your software manually installed on a server/VM, there is a lot of DevOps work involved related to the underlying OS upgrades, software security, etc. 

Docker could be considered as a reusable layer on top that allows to maintain and test the common infrastructure in the central place and deploy exactly the same setup in the different nodes. You're basically paying for maintenance once, others can reuse for free. As soon as someone is taking responsibility to maintain stuff, it's less expensive for all parties. More people are coming with own DevOps resources to work on the same thing, cheaper it will become. That's exactly why large companies like Google, Microsoft, Amazon, etc making their technologies Open Source and let others maintain them in the community.

The same story with Kubernetes, it's community-based ecosystem supported by all big players. Currently Dataverse lives in the isolation but in near future you can think about Software as a Service (SaaS) layer integrated with Dataverse nodes and using them a transparent layer to keep data FAIR. I'm pretty sure it's going to happen after Dataverse will increase own maturity with more unit and integration tests, CI/CD pipeline, more services integration. There are a lot of data-driven projects that got supported by Open Source communities and have thousands of contributors, think about Apache Airflow, Superset, Kafka, etc.

I really believe Dataverse itself should become widely used service for data management and will cross the chasm, I'm expecting this to happen in the middle of the next year 2022 when Cloud maturity layer will be enough to convince both business and academia to use it. I did a presentation some time ago about this topic, please watch it here if you're interested: https://youtu.be/bcQk6ykxjis?t=1260

Best,
Slava
DANS-KNAW R&D

DAVID PIEDRA

unread,
May 7, 2021, 3:28:35 AM5/7/21
to Dataverse Users Community
Thanks for the reply Mercè.

Even it's an interesting topic, I will reformulate the question about Dataverse k8s (without taking into account an hypothetic project life-cycle :P). For a new production dataverse, pros and cons of using a k8s architecture, or a more classical architecture (webserver + db server + load balancer,...)?

I would say that k8s is more flexible, but there are classical solutions (like group instances in google cloud) which would also provide flexible and scaling environments.

Thanks in advance,

David

DAVID PIEDRA

unread,
May 7, 2021, 3:29:36 AM5/7/21
to Dataverse Users Community
Sorry Slava, I send my reply before your reply was loaded!

THanks

DAVID PIEDRA

unread,
May 20, 2021, 3:22:16 AM5/20/21
to Dataverse Users Community
Hi Sherry,

Could you share the hardware specs from your dataverse installation? Are you evaluating the server upgrade for performance issues?

Thanks,

David

Sherry Lake

unread,
May 21, 2021, 1:35:30 PM5/21/21
to dataverse...@googlegroups.com
Hello David,

We have a simple local VM host setup (Server is VMWARE, 1 CPU (one core), 16GB RAM) with local file storage attached to our server. My sysadmin basically took the installation documentation and files from IQSS/Dataverse github and installed. We haven’t done any system customizations.

I would love to evaluate the performance issues, but unfortunately, our Library doesn't have many resources to devote to our Dataverse Repository. That's the main reason we went with "simple". It's just me and a sysadmin when I need upgrades. 

Looks like there is much faster file transfers if you go to the cloud, but that is not where my institution wants to go at this time.
 

This message is intended exclusively for its addressee and may contain information that is CONFIDENTIAL and protected by professional privilege. If you are not the intended recipient you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited by law. If this message has been received in error, please immediately notify us via e-mail and delete it.

DATA PROTECTION. We inform you that your personal data, including your e-mail address and data included in your email correspondence, are included in the ISGlobal Foundation files. Your personal data will be used for the purpose of contacting you and sending information on the activities of the above foundations. You can exercise your rights of access, rectification, cancellation and opposition by contacting the following address: lo...@isglobal.org. ISGlobal Privacy Policy at www.isglobal.org.

-----------------------------------------------------------------------------------------------------------------------------

CONFIDENCIALIDAD. Este mensaje y sus anexos se dirigen exclusivamente a su destinatario y puede contener información confidencial, por lo que la utilización, divulgación y/o copia sin autorización está prohibida por la legislación vigente. Si ha recibido este mensaje por error, le rogamos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

PROTECCIÓN DE DATOS. Sus datos de carácter personal utilizados en este envío, incluida su dirección de e-mail, forman parte de ficheros de titularidad de la Fundación ISGlobal  para cualquier finalidades de contacto, relación institucional y/o envío de información sobre sus actividades. Los datos que usted nos pueda facilitar contestando este correo quedarán incorporados en los correspondientes ficheros, autorizando el uso de su dirección de e-mail para las finalidades citadas. Puede ejercer los derechos de acceso, rectificación, cancelación y oposición dirigiéndose a lo...@isglobal.org . Política de privacidad en www.isglobal.org.

--
You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/EZEQKw3gj-k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.

DAVID PIEDRA

unread,
May 25, 2021, 4:04:20 AM5/25/21
to Dataverse Users Community
Thanks for your feedback Sherry! Right now we are in a similar scenario, so probably a similar hardware will work for us. Thanks!!

James Myers

unread,
May 25, 2021, 6:56:48 AM5/25/21
to dataverse...@googlegroups.com

Re: Looks like there is much faster file transfers if you go to the cloud, but that is not where my institution wants to go at this time.

 

FWIW: There are things like https://github.com/minio/minio that provide S3 over a local file store. Running that would allow you to use the direct S3 upload/download support in Dataverse on local hardware. I don’t have experience with this myself but I believe TDL/TACC and Scholar’s Portal have both setup minio (not for main storage, but part of larger data storage efforts).

 

-- Jim

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.

To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/CADL9p-XqimQvgjpCgunq%2BxOLzZ2VdrnfcO4H%2B%2BzYjMo3zEBxhw%40mail.gmail.com.

Reply all
Reply to author
Forward
0 new messages