How do you deal with "too much data" in MyTardis?

Stephen Crawley

unread,

Sep 3, 2012, 11:09:34 PM9/3/12

to tardis...@googlegroups.com

Hi Guys,

One of the things that MyTardis doesn't have (yet) is a generic answer to the problem of what to do when you have too much data.

I need to develop a solution for our (CMM's) installation, but before I launch into implementing this myself, I'd like to hear how other installations are dealing with this problem.

To set the scene, here's how I see the problem:

We (or at least, our users) have an obligation to keep data for at least 5 years due to various policies and regulations. It is not entirely clear whether this applies to all data, but if it doesn't we (CMM) are not in a position to decide which data should be kept ... from the regulatory perspective.

We don't have enough local disc space to hold that much data online, and we don't have local hierarchical storage (and can't afford it).

We should be able to get disc storage and / or tape space from external providers, on-campus or in "the cloud".

External provision has the problem that we don't know if the service will still be available in N-years time, and we may not be able to control the costs in the long term.

My initial thinking is that there are two models for dealing with the problem from a technical perspective; the "archive" and "migrate" model.

In the "archive" model, a suitable unit of data (e.g. dataset) would be turned into an archive object (a ZIP / TAR file with a suitable manifest), and the archive object is copied to the external storage provider where it is kept online or offline. Once that has been done with a sufficient level of confidence that we can get it back, the data files are deleted from primary storage. Ideally, (IMO) the MyTardis experiment/dataset/datafile metadata should be kept online to allow the user to find out that the data exists, and to record where the archive has been sent. I would envisage that the user could request that the archived data be automatically restored ... subject to site-specific policies, etc.

In the "migrate" model, individual data files would be copied to external storage, and the MyTardis URL for the Dataset_File is updated to point to the new location. This allows the user to access the data immediately, but it requires that there is something running on the external storage system that can deliver the data file in response via HTTP, FTP or whatever when MyTardis or the user's browser resolves / fetches the URL. (In this model, the external storage could be a hierarchical storage system ...)

We probably need to support both models to give us the flexibility to adapt to changes in external provisioning, data volumes, storage costs, and regulatory issues.

The other issue is whether this should be core functionality or "app" functionality, and in the latter case whether the we can do this without modifying the MyTardis experiment/dataset/datafile schemas.

-- Steve

Mathew Wyatt

unread,

Sep 3, 2012, 11:31:06 PM9/3/12

to tardis...@googlegroups.com

Hi Steve,

We are having the same issue on another Nectar project (CATAMI) at iVEC, but our problem is we don't have enough disk, but plenty of tape.

We are currently thinking of using a cache-ing approach, where we have down-sampled web viewable data on disk, but all raw data goes to tape (yet to be implemented though :) ).

A couple of comments:
- If you are using a tape HSM back end, it is difficult (near impossible, depending on how old the technology is) to build responsive web interfaces on top. I've tried in the past, and failed miserably. We are currently thinking we will have to incorporate the knowledge of delay into our system, where people invoke a download, but they are notified that they will receive an email when their data is 'actually' ready to download from a staging area.
- Using your 'migrate' model, storage admins will hate you if you invoke numerous individual stages of small files from tape. Traditionally they like a small number of stages, for larger data files - which reduces the delay in their robots moving from one side of the room to the other.
- There are no standard API's for working with HSM systems as far as I'm aware, so depending on how many different storage systems you are planning on using you may have a bit of work to do.

I'd be interested in hearing how you go. We are going to to working with python/django as well, so there could be an opportunity to borrow each others work.

Cheers,
Mat

Stephen Crawley

unread,

Sep 4, 2012, 1:33:42 AM9/4/12

to tardis...@googlegroups.com

Mat,

Thanks for those tips.

For the record, in my "migrate" model, I was assuming that (hypothetical) HFS would be handled transparently by the storage provider at whatever level of granularity they decided was appropriate. I don't particularly want to get into dealing directly with HFS APIs' - standard or otherwise. That would be more appropriate for the "archive" model.

-- Steve

Mathew Wyatt

unread,

Sep 4, 2012, 1:50:48 AM9/4/12

to tardis...@googlegroups.com

No worries Steve, please post your progress, we'd be genuinely interested to see how you approach your implementation...

Cheers,
Mat

Nigel Ward

unread,

Sep 4, 2012, 8:12:54 PM9/4/12

to tardis...@googlegroups.com, Tom Fifield, Glenn Moloney

Hi Stephen,

Regarding your data storage models … RDSI should in theory support both the "archive" and "migrate" models. The RDSI Tinman http://www.rdsi.uq.edu.au/docs/ReDS_tinman_20120120.pdf suggests that nodes will "support highly active project activities through a ‘Market’ service, and longer-term archival for less actively-used collections through a ‘Vault’ service"

I have heard some RDSI nodes discuss supporting an AWS S3 or OpenStack SWIFT API that could meet your "migrate" requirement for a system that can "deliver the data file in response via HTTP". The question then becomes whether you can you get an RDSI allocation large enough to store your data?

I believe that Steve Androulakis was considering adding object storage support to MyTardis as part of his NeCTAR project, so there may be potential for joint effort there. Steve? Steve?

Regarding the archive model ... I have not heard any RDSI nodes discuss the API they would expose for their Vault services, but agree with Mat that it can be messy. I do note that a few things are emerging which might help out down the track:

- AWS Glacier http://aws.amazon.com/glacier/ which looks to me a bit like an object store, but with notification-based retrieval to deal with slow response times.

- The beginnings of a discussion of how OpenStack might support this type of Vault storage http://www.buildcloudstorage.com/2012/08/cold-storage-using-openstack-swift-vs.html

Nigel Ward
Data management coordinator, eResearch Lab, The University of Queensland
phone: +61 7 3365 4553 | mobile: 0414 234 040 | email: n.w...@uq.edu.au

Steve Androulakis

unread,

Sep 4, 2012, 8:33:10 PM9/4/12

to tardis...@googlegroups.com

Hi Nigel, Stephen,

Object storage support (ie SWIFT support) is still being considered for development under my NeCTAR Research Tool. At this point it's likely, but a few milestones (6 months or more) off.

Cheers,

Steve

Stephen Crawley

unread,

Sep 4, 2012, 10:26:00 PM9/4/12

to tardis...@googlegroups.com

Hi Steve A & Nigel,

Since I need the functionality sooner than that, I'm expecting to do the implementation work myself, at least for one of the two models. But I'm aiming to get this functionality into the official MyTardis code-base, so discussion and agreement on approaches and technical details will be essential.

Technologies such as S3, Glacier, SWIFT and so on are well and good, (and there are probably more), but what we >>also<< need is support within the MyTardis framework for knowing were the data is / has gone and what / when / how / where to "move" it, and what the user needs to do to get it back.

FWIW, my current thinking is that the "migrate" model is going to be easiest to implement (and better for the users), so I'll probably address that side first. Stay tuned ...

On the storage provisioning front, while the RDSI technology base could support this, that doesn't mean we can get RDSI storage. I doubt that we'd satisfy all of the RDSI eligibility criteria. But that's OK because we should be able to get storage provisioning via an arrangement with the local NODE. (That's what Graham Chen said to me, and based on my guesstimates, he said that data volumes would be no problem.) Commercial provisioning (e.g. via Amazon S3 and/or Glacier) is also an option, though I'm currently inclined to use that for "archive" only.

-- Steve C

From: tardis...@googlegroups.com [tardis...@googlegroups.com] on behalf of Steve Androulakis [steve.an...@gmail.com]
Sent: 05 September 2012 10:33
To: tardis...@googlegroups.com
Subject: Re: How do you deal with "too much data" in MyTardis?

Reply all

Reply to author

Forward