Hi Guys,
One of the things that MyTardis doesn't have (yet) is a generic answer to the problem of what to do when you have too much data.
I need to develop a solution for our (CMM's) installation, but before I launch into implementing this myself, I'd like to hear how other installations are dealing with this problem.
To set the scene, here's how I see the problem:
- We (or at least, our users) have an obligation to keep data for at least 5 years due to various policies and regulations. It is not entirely clear whether this applies to all data, but if it doesn't we (CMM) are not in a position to decide which data should be kept ... from the regulatory perspective.
- We don't have enough local disc space to hold that much data online, and we don't have local hierarchical storage (and can't afford it).
- We should be able to get disc storage and / or tape space from external providers, on-campus or in "the cloud".
- External provision has the problem that we don't know if the service will still be available in N-years time, and we may not be able to control the costs in the long term.
My initial thinking is that there are two models for dealing with the problem from a technical perspective; the "archive" and "migrate" model.
In the "archive" model, a suitable unit of data (e.g. dataset) would be turned into an archive object (a ZIP / TAR file with a suitable manifest), and the archive object is copied to the external storage provider where it is kept online or offline. Once that has been done with a sufficient level of confidence that we can get it back, the data files are deleted from primary storage. Ideally, (IMO) the MyTardis experiment/dataset/datafile metadata should be kept online to allow the user to find out that the data exists, and to record where the archive has been sent. I would envisage that the user could request that the archived data be automatically restored ... subject to site-specific policies, etc.
In the "migrate" model, individual data files would be copied to external storage, and the MyTardis URL for the Dataset_File is updated to point to the new location. This allows the user to access the data immediately, but it requires that there is something running on the external storage system that can deliver the data file in response via HTTP, FTP or whatever when MyTardis or the user's browser resolves / fetches the URL. (In this model, the external storage could be a hierarchical storage system ...)
We probably need to support
both models to give us the flexibility to adapt to changes in external provisioning, data volumes, storage costs, and regulatory issues.
The other issue is whether this should be core functionality or "app" functionality, and in the latter case whether the we can do this without modifying the MyTardis experiment/dataset/datafile schemas.
-- Steve