Currently, metadata associated with folders doesn't come with a versioning-scheme. Meaning metadata written today, with version 0.5.3, might not be readable with future versions if future versions introduce incompatibility.
How can we best store versions along with the metadata?
I see three potential possibilities or increasing complexity/flexibility; Globally, Locally and Per-entry
Version is assigned across an entire system or across a hierarchy of directories.
$ setx OM_VERSION=0.5.3
Here, a system is hard-wired with a version. Changing version means making all existing metadata invalid and paves the ground for future metadata.
Pros
Cons
$ /root/.om_version
Here, a sidecar file represents the version for all subsequent directories within this hierarchy.
Pros
Cons
Versions are stored on a per-directory basis
$ /home/marcus/.meta/__version__
This would make each folder capable of distinguishing which version of Open Metadata is to be used in interpreting the contained metadata.
Pros
Cons
Versions are stored together with their respective entry, as meta-metadata.
$ /home/marcus/.meta/age.int $ /home/marcus/.meta/__version__/age
Pros
Cons
--
You received this message because you are subscribed to the Google Groups "Open Metadata" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-metadat...@googlegroups.com.
To post to this group, send email to open-m...@googlegroups.com.
Visit this group at http://groups.google.com/group/open-metadata.
For more options, visit https://groups.google.com/d/optout.
Thanks Sebastian.
Could you illustrate how you envision the per-file version to look like?
This is how I’m thinking; currently, all data is written without meta-metadata of any sorts:
mydata.string
"this is my string"
With versions per-file, it could instead look like this
mydata.string
{
"value": "this is my string",
"version": "0.5.3"
}
Initially, I was thinking that, keeping the versions separate from the data would mean that reading and writing data wouldn’t be affected by version bookkeeping, as reading this mydata.string is obviously heavier and includes more parsing (as it is a dict, rather than a string). I was thinking that versions could get read only when necessary, possibly upon user request etc.
But then it struck me that for this to be truly water-proof, versions would always have to be read and written for every entry.
The only disadvantage I can think of at the moment, is that neither folders nor blobs can’t get the same treatment as native OM files;
$ /home/marcus/.meta/playblast.list/image1.jpg
Here, the folder playblast.list can’t be “impregnated” with versions, because its a folder and can’t be written to. The image1.jpg is binary and can’t be modified either.
However both of these could get version support with side-car files.
$ /home/marcus/.meta/__version__/playblast.list --> "0.4.3"
$ /home/marcus/.meta//playblast.list/__version__/image1.jpg --> "0.6.1"
For bullet-proofness, versions would still need to get read and written upon any new entry, so performance in this case would obviously be worse than in the above case. Performance aside, an added disadvantage is that multiple files will have to get written to/modified every time; making it possible for one to succeed and the other to fail, resulting in inconsistencies between data and their versions.
Thoughts?
Yeah, there are a few things going on here about the future expansion of OM, things I haven’t really brought into light just yet. Mainly how meta-metadata is to be implemented and if versioning should be part of it.
I see two alternatives; either versioning remains a separate mechanism, such as embedding it into each file:
{
"value": "5",
"version": "0.5.3"
}
In which case, versions would be have to get read every time, which may be just what we want. However in this case, the file has metadata about it; i.e. meta-metadata, so should this be used as the main mechanism for meta-metadata?
The downsides of this as a meta-metadata is this:
Whereas the benefits are
The other alternative, for meta-metadata and thus including versioning is this:
$ /home/marcus/.meta/age.int
$ /home/marcus/.meta/__version__/age/version.string <-- e.g. "0.5.3"
This would then be the convention for meta-metadata; each entry - file or folder - would be looking for meta-metadata by key (“version”) and value (its corresponding entry-name - “age”, also not the lack of suffix for “age” here. This is so that meta-metadata persists across type-changes.
$ /home/marcus/.meta/age.float <-- changed from 5 to 5.90
$ /home/marcus/.meta/__version__/age/version.string <-- meta-metadata persists
The benefit here is this:
Applies equally well to blobs
Versions may be optionally read; e.g. may be discarded during performance-critical tasks where versions can be guaranteed prior to using.
The disadvantages being:
Thoughts?
--
You received this message because you are subscribed to the Google Groups "Open Metadata" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-metadat...@googlegroups.com.
To post to this group, send email to open-m...@googlegroups.com.
Visit this group at http://groups.google.com/group/open-metadata.
For more options, visit https://groups.google.com/d/optout.
Hey Justin, thanks for your input.
I think I just discovered an amazing middle-ground, based on this; the container of metadata, within a folder, the equivalent of a binary file.
Conceptually, entries (both files and folders) are the equivalent of variables in dynamically typed languages (e.g. Python):
# Which data-type does `my_variable` end up with? (spoiler: a boolean)
>>> my_variable.value = 'hello'
>>> my_variable.value = 5
>>> my_variable.value = True
Which means that the suffix of my_variable changes based on the type of data it holds. This means greater flexibility with what can be done by the library, at the cost of not being able to rely on absolute paths into a metadata container.
# Danger: suffix may change!
$ export IMPORTANT_PATH=/home/marcus/.meta/group.list/myvariable.string
Which brings me to the conclusion of the amazing middle-ground; the metadata container is today a folder with files, but may in the future be a compressed binary (such as HDF5), a database (like MongoDB), a cloud service (like S3) or put simply any datastore. What matters is the front-end (entries and their values), not the back-end.
Consider Maya, scenes are generally stored as compressed binaries but can also be stored as plain-text files, .ma. However in this plain-text file each line is specific to the file as a whole, and not versioned individually:
...
createNode maya2009Node 'awesomeName'
setAttr 'awesomeName.2009SpecificAttr' 15
...
These two lines are specific to the 2009 version of Maya. In the same manner, entries within a metadata container may be specific to the version of their container.
$ /home/marcus/.meta/.version <-- 0.5.3
$ /home/marcus/.meta/length.float
Here, length is specific to 0.5.3 and only guaranteed to work with 0.5.3, with the possibility of forwards-compatibility based on the version of OM that reads from the container.
So, the middle-ground, how about storing the version at the container level?
I see the following benefits:
# myfile.hdf5
version=0.5.3
age=5
length=12
height=1.87
$ /home/marcus/.meta/.version <-- 0.5.3
$ /home/justin/.meta/.version <-- 2.5.1
# Assume a version, and don't bother checking;
# meaning less IO, meaning more performance and safer op.
>>> om.read('/home/marcus', 'age', version=(0, 5, 3))
# Here, three entries are being read on close proximity
# The library could collapse the three requests into one and
# perform only one version-query.
>>> om.read('/home/marcus', 'age')
>>> om.read('/home/marcus', 'length')
>>> om.read('/home/marcus', 'height')
Thoughts?
It would seem beneficial to keep as much data in a single file as possible, as every time you split up data into more files you reduce your consistency and increase your system calls to interact with them.
This is a slightly different discussion (though not any less important) but in short this is where OM differs; on the back-end, many small files are preferred over less large files.
Best,
Marcus
Although I am not really clear what that would look like if it were a database backend. Paths would still be the way of referencing the data, but the backend translates that into the way to access it?
The interface to OM looks something like this:
>>> om.read(path, metapath)
Where path is still an absolute path to a folder on disk (OM is about associating metadata to folders, nothing else) and metapath the hierarchical location (OM is hierarchical). For example
>>> om.read('/home/marcus', '/deeply/nested/data')
In the case of a database as a backend, depending on your schema, hierarchies may reside within individual columns or tables. OM would parse the given metapath into whatever is required of the datastore; SQL queries or what not. The key lies in the interface given by om.read()
Also, what happens if someone makes a change to data in a container using 0.5.4 instead of 0.5.3? Does that bump the version up for the container?
Good question, and a definite design concern. :) If we go with a Maya-style approach, then yes, any change to a ma file within a newer version of Maya would re-tag the new version with a later version; and vice versa.
Another approach would be to say “no” and let the library decide, prior to reading or writing, whether it conforms to the given version.
import old_openmetadata as om
>>> om.read('/home/marcus', '/newer/metadata')
VersionError: I am simply too old I can't make sense out of this metadata.
At which point you could potentially force a read:
>>> om.read('/home/marcus', '/newer/metadata', version=(3,0))
"Potentially scRAmbleD data"
Thoughts?