Versioning

24 views
Skip to first unread message

Marcus Ottosson

unread,
May 15, 2014, 5:08:20 PM5/15/14
to open-m...@googlegroups.com

Currently, metadata associated with folders doesn't come with a versioning-scheme. Meaning metadata written today, with version 0.5.3, might not be readable with future versions if future versions introduce incompatibility.

How can we best store versions along with the metadata?

I see three potential possibilities or increasing complexity/flexibility; Globally, Locally and Per-entry

1: Globally

Version is assigned across an entire system or across a hierarchy of directories.

1a: Environment

$ setx OM_VERSION=0.5.3

Here, a system is hard-wired with a version. Changing version means making all existing metadata invalid and paves the ground for future metadata.

Pros

  • Clean
  • Easy to understand
  • Simple to implement

Cons

  • No room for transitioning between versions
  • Very little room for flexibility

1b: Cascading root

$ /root/.om_version

Here, a sidecar file represents the version for all subsequent directories within this hierarchy.

Pros

  • Clean
  • Easy to understand

Cons

  • Little room for transitioning

2. Locally

Versions are stored on a per-directory basis

$ /home/marcus/.meta/__version__

This would make each folder capable of distinguishing which version of Open Metadata is to be used in interpreting the contained metadata.

Pros

  • Flexible transition, folders could vary in versions across a hierarchy
  • Quite easy to understand

Cons

  • Difficult to implement

3. Per-entry

Versions are stored together with their respective entry, as meta-metadata.

$ /home/marcus/.meta/age.int
$ /home/marcus/.meta/__version__/age

Pros

  • Very flexible

Cons

  • Less easy to understand
  • Very difficult to implement
  • Possibly chaotic

I'm leaning more towards method 2, Locally.

What do you guys think?

Best,
Marcus

Krzysztof

unread,
May 15, 2014, 7:37:10 PM5/15/14
to Marcus Ottosson, open-m...@googlegroups.com
Hmm…
I’d say go with 1a or 1b! 

I think that a global version is the simplest and cleanest solution.

My thinking process:

What if something changes between versions 1.3 and 1.4 and for instance OM won’t accept files like *.String as a variable, instead will be looking for *.s files?

-> Then if the current version is less than 1.4 an upgrade script will be run for a location specified by the user (or could be also a mass global change) and convert *.String files into *.s

This example may be a bit dumb :) , but hopefully illustrates the idea.


Following the thought: we should ask now the question how far could the changes go and how hard would it be to write an upgrade script?

-> Well, even assuming that the whole OM was rewritten from scratch, using some new fancy code, I THINK that the metadata stored with previous versions won’t be much different. 
Metadata is already so atomic, that there’s not really that much place to move around. 

So in summary, even if the OM code changed immensely, I think that the changes to how the data is stored will be small enough for an upgrade script to handle it. 

my 2c,

kk

--
You received this message because you are subscribed to the Google Groups "Open Metadata" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-metadat...@googlegroups.com.
To post to this group, send email to open-m...@googlegroups.com.
Visit this group at http://groups.google.com/group/open-metadata.
For more options, visit https://groups.google.com/d/optout.

Marcus Ottosson

unread,
May 16, 2014, 1:48:26 AM5/16/14
to open-m...@googlegroups.com, Marcus Ottosson
An update script is a good idea. It would keep code from needing to be backwards- and forwards-compatible, by just modifying existing metadata, keeping the code much cleaner (less lines) and safer (all metadata has the same version, means no unexpected behaviour)

However this may not always be possible; for example, consider a project that has been going on for the past 9 weeks/months and has 1 week/month left to go. The shop upgrades its OM library to something that *shouldn't* cause any backwards-compatibility issues (but you never know). In this project, all is well, but due to the update, they must run an update script over their entire datastore. Here are some risks.

Risk 1. The script might fail, for whatever reason (software or hardware), in which case
Risk 1a. The script must include a fail-safe, meaning complexity.
Risk 2. Archived projects are unaccessible; as they are using an older version, and can't be updated
Risk 3. In case on a long-running update, should metadata remain available or should it be taken down?
Risk 3a. If taken down, no one can do work until it's finished
Risk 3b. If it's available, the script will have to include asynchronasy, meaning complexity

An update script would be really great for more rare occassions, such as:

Update 1. The shop requires a feature-addition; solved by a new feature (e.g. Meta-metadata)
Update 2. Internal syntax changes (e.g. time-stamps in history was insufficiently global and now comes with a BST stamp)


Thoughts?

Best,
Marcus

Sebastian Thiel

unread,
May 16, 2014, 3:12:13 AM5/16/14
to open-m...@googlegroups.com, Marcus Ottosson
As I understood it, it's well possible for different versions of OMD to write pieces of of meta-data in a very granular way. Thus a string at '/foo#mystring' could be written in binary compressed form by OMD2.0, whereas OMD1.0 uses plain text for '/foo#myotherstring'.

To truly solve the problem, each value would need some sort of version information attached.
It seems prohibitively expensive to add meta-metadata into yet another file.

For that reason, it might be the best protection to enforce backward compatibility in code. That means, if the schema/layout/format is changed, there must be code to detect different versions and choose the right one for reading. It will only write the latest format version, keeping old meta data as is, but placing new meta data in the new format.

This is how git does it, for example. Additionally, they have a marker per repository telling them about the repository version. The latter might not be possible in OMD as each value could possibly be written by a different OMD version. Maybe it's a good thing to write one of these 'schema' versions per .meta folder, to indicate the general layout, or feature support, within that folder. For instance ,there might be different ways of encoding meta-meta-meta data between 'schema' versions.

Best,
Sebastian

Marcus Ottosson

unread,
May 16, 2014, 3:32:32 AM5/16/14
to Sebastian Thiel, open-m...@googlegroups.com

Thanks Sebastian.

Could you illustrate how you envision the per-file version to look like?

This is how I’m thinking; currently, all data is written without meta-metadata of any sorts:

mydata.string

"this is my string"

With versions per-file, it could instead look like this

mydata.string

{
    "value": "this is my string",
    "version": "0.5.3"
}

Initially, I was thinking that, keeping the versions separate from the data would mean that reading and writing data wouldn’t be affected by version bookkeeping, as reading this mydata.string is obviously heavier and includes more parsing (as it is a dict, rather than a string). I was thinking that versions could get read only when necessary, possibly upon user request etc.

But then it struck me that for this to be truly water-proof, versions would always have to be read and written for every entry.

The only disadvantage I can think of at the moment, is that neither folders nor blobs can’t get the same treatment as native OM files;

$ /home/marcus/.meta/playblast.list/image1.jpg

Here, the folder playblast.list can’t be “impregnated” with versions, because its a folder and can’t be written to. The image1.jpg is binary and can’t be modified either.

However both of these could get version support with side-car files.

$ /home/marcus/.meta/__version__/playblast.list --> "0.4.3"
$ /home/marcus/.meta//playblast.list/__version__/image1.jpg --> "0.6.1"

For bullet-proofness, versions would still need to get read and written upon any new entry, so performance in this case would obviously be worse than in the above case. Performance aside, an added disadvantage is that multiple files will have to get written to/modified every time; making it possible for one to succeed and the other to fail, resulting in inconsistencies between data and their versions.

Thoughts?

--
Marcus Ottosson
konstr...@gmail.com

Marcus Ottosson

unread,
May 16, 2014, 3:34:48 AM5/16/14
to Sebastian Thiel, open-m...@googlegroups.com
How about, if having versions as side-car files, the reading and writing of it would be active by default, but could get toggled off?

That way, it would be safe, until someone decides that performance outweighs safety. It would be explicit, and totally their own responsibility.
--
Marcus Ottosson
konstr...@gmail.com

Sebastian Thiel

unread,
May 17, 2014, 3:29:23 AM5/17/14
to open-m...@googlegroups.com, Sebastian Thiel
I am not in the topic enough to think any further about it, and am sure you will find a good solution.

Marcus Ottosson

unread,
May 17, 2014, 5:49:32 AM5/17/14
to Sebastian Thiel, open-m...@googlegroups.com

Yeah, there are a few things going on here about the future expansion of OM, things I haven’t really brought into light just yet. Mainly how meta-metadata is to be implemented and if versioning should be part of it.

I see two alternatives; either versioning remains a separate mechanism, such as embedding it into each file:

age.int

{
    "value": "5",
    "version": "0.5.3"
}

In which case, versions would be have to get read every time, which may be just what we want. However in this case, the file has metadata about it; i.e. meta-metadata, so should this be used as the main mechanism for meta-metadata?

The downsides of this as a meta-metadata is this:

  • Limited to plain-old-data
  • Limited to files
  • Limited to native OM files (i.e. no blobs)

Whereas the benefits are

  • Simple
  • Consistent
  • Mandatory version reading/writing = stability

The other alternative, for meta-metadata and thus including versioning is this:

$ /home/marcus/.meta/age.int
$ /home/marcus/.meta/__version__/age/version.string  <-- e.g. "0.5.3"

This would then be the convention for meta-metadata; each entry - file or folder - would be looking for meta-metadata by key (“version”) and value (its corresponding entry-name - “age”, also not the lack of suffix for “age” here. This is so that meta-metadata persists across type-changes.

$ /home/marcus/.meta/age.float  <-- changed from 5 to 5.90
$ /home/marcus/.meta/__version__/age/version.string  <-- meta-metadata persists

The benefit here is this:

  • Meta-metadata can take any shape, blobs, nested and plain data
  • Applies equally well to folders
  • Applies equally well to blobs

  • Versions may be optionally read; e.g. may be discarded during performance-critical tasks where versions can be guaranteed prior to using.

The disadvantages being:

  • An extra file being read and written for versioning.
  • Double underscores being reserved for meta-metadata, meaning they may not be used for user-defined metadata.

Thoughts?



--
You received this message because you are subscribed to the Google Groups "Open Metadata" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-metadat...@googlegroups.com.
To post to this group, send email to open-m...@googlegroups.com.
Visit this group at http://groups.google.com/group/open-metadata.
For more options, visit https://groups.google.com/d/optout.



--
Marcus Ottosson
konstr...@gmail.com

Krzysztof

unread,
May 17, 2014, 7:48:09 AM5/17/14
to Marcus Ottosson, open-m...@googlegroups.com
Ok, so just to comment on the scenario and the risks.

The shop will definitely not upgrade a version of OM during an ongoing project. It would probably refrain as much as possible from any upgrades whatsoever for any kinds of software they’re using.
The pipeline should being tested and established before the actual project and remain stable throughout.
Imagine someone saying in the middle of a long running project : “Wow guys, have you heard? Maya 2017 is out and may not be backwards compatible with 2016 that we’re using. C’mon, let’s switch!” :D 

Ad Risk 1* ) Well, every application can fail for some kind of software/hardware reasons, we just need to make sure that what we write takes into account different scenarios and that’s it.
If a file could not be open, modified, saved due to r/w access issues or network problems, we can catch it and  create a dump file with a list of problematic spots, that the script can pick up later on and try once again.
If that fails, well, then the issue lies somewhere else and will be solved by other entities.

Ad Risk 2 ) That’s the same with any archives. If a shop wants to open a project from way back when they used Alias Maya 6, then I would assume they have Maya 6 somewhere as well.
In the case of OM it’s easier: people can have actually multiple versions of OM available, since it’s really light weight, so it’s not a problem to have a default version of OM and also other versions for reading purposes.
Or just store OM with your project and archive it together.

Ad Risk 3* ) The upgrade can be scheduled to run at night, say 4.am. If someone is still using OM, let him take a coffee, or let him go home actually ;P
And also would not expect the upgrade to take too long, even if there are bazillion of files to process.
We would just pass a dictionary with metadata locations created earlier, and then run the actual update modifying the contents.
Or making it even better, OM could provide a list of every .meta location in a some kind of a .global file.
Right click on a folder -> “Register metadata”, or $om register -p /tmp/.meta  
Then we wouldn’t waste time looking for .meta on the file system. We would just pick the dict straight from the .global and bam, upgrade done :P
Obviously I haven’t thought through this idea that much, cause may not be easy to do :D But yeah, anyway...

Ok, let’s get back on the ground.
For me it seems that an upgrade can be handled in a short bash script, since we’re just dealing with simple files. And that would be it :)

What do you think?


Marcus Ottosson

unread,
May 17, 2014, 7:54:41 AM5/17/14
to Krzysztof, open-m...@googlegroups.com
Hey Krzysztof,

Good thoughts, I'll digest these and get back to you. Off the top of my head I'm thinking "I wish things were as easy as this".

I'm sure Sebastian would have a thing or two to say about this as well (he's been on the IT side of the fence). Let's hope he can spare a few more minutes.

Best,
Marcus
--
Marcus Ottosson
konstr...@gmail.com

Justin Israel

unread,
May 17, 2014, 5:57:35 PM5/17/14
to open-m...@googlegroups.com, Krzysztof
It would seem beneficial to keep as much data in a single file as possible, as every time you split up data into more files you reduce your consistency and increase your system calls to interact with them. It may not be something that is much concern on a local disk, but on a network mount like NFS where the currency of a filers speed is operations per second, every extra op you have to add in order to read/write something increases your load and your latency. So then for each field of data that you already break out into its own file, its corresponding metadata also has to incur the same amount of OPs. And ops also increase depending on your directory depth. And this compounds when you are talking about cascading reads of a location. 
Regarding the comment about consistency, what I meant was that as you split up the data, you further reduce your ability to atomically change a body of metadata as a whole. The version information would be separate from its content. It might be better to keep the data together so that when you read a single file you can also verify its version from the same read. 

I might be talking about matters that are not currently a focus, but it always helps to think ahead at what the system might become. 

Marcus Ottosson

unread,
May 18, 2014, 5:36:36 AM5/18/14
to Justin Israel, open-m...@googlegroups.com, Krzysztof

Hey Justin, thanks for your input.

I think I just discovered an amazing middle-ground, based on this; the container of metadata, within a folder, the equivalent of a binary file.

Conceptually, entries (both files and folders) are the equivalent of variables in dynamically typed languages (e.g. Python):

# Which data-type does `my_variable` end up with? (spoiler: a boolean)
>>> my_variable.value = 'hello'
>>> my_variable.value = 5
>>> my_variable.value = True

Which means that the suffix of my_variable changes based on the type of data it holds. This means greater flexibility with what can be done by the library, at the cost of not being able to rely on absolute paths into a metadata container.

# Danger: suffix may change!
$ export IMPORTANT_PATH=/home/marcus/.meta/group.list/myvariable.string

Which brings me to the conclusion of the amazing middle-ground; the metadata container is today a folder with files, but may in the future be a compressed binary (such as HDF5), a database (like MongoDB), a cloud service (like S3) or put simply any datastore. What matters is the front-end (entries and their values), not the back-end.

Consider Maya, scenes are generally stored as compressed binaries but can also be stored as plain-text files, .ma. However in this plain-text file each line is specific to the file as a whole, and not versioned individually:

maya2009_scene.ma

...

createNode maya2009Node 'awesomeName'
setAttr 'awesomeName.2009SpecificAttr' 15
...

These two lines are specific to the 2009 version of Maya. In the same manner, entries within a metadata container may be specific to the version of their container.

$ /home/marcus/.meta/.version  <-- 0.5.3
$ /home/marcus/.meta/length.float

Here, length is specific to 0.5.3 and only guaranteed to work with 0.5.3, with the possibility of forwards-compatibility based on the version of OM that reads from the container.

So, the middle-ground, how about storing the version at the container level?

I see the following benefits:

  • BEN1 - Conceptually aligns with future datastores
# myfile.hdf5
version=0.5.3
age=5
length=12
height=1.87
  • BEN2 - Versioning would get written only once per container
$ /home/marcus/.meta/.version  <-- 0.5.3
$ /home/justin/.meta/.version  <-- 2.5.1
  • BEN3 - When reading, versioning could be asserted
# Assume a version, and don't bother checking;
# meaning less IO, meaning more performance and safer op.
>>> om.read('/home/marcus', 'age', version=(0, 5, 3))
  • BEN4 - Reading of version could be deferred
# Here, three entries are being read on close proximity
# The library could collapse the three requests into one and
# perform only one version-query.
>>> om.read('/home/marcus', 'age')
>>> om.read('/home/marcus', 'length')
>>> om.read('/home/marcus', 'height')

Thoughts?

It would seem beneficial to keep as much data in a single file as possible, as every time you split up data into more files you reduce your consistency and increase your system calls to interact with them.

This is a slightly different discussion (though not any less important) but in short this is where OM differs; on the back-end, many small files are preferred over less large files.

Best,
Marcus

Justin Israel

unread,
May 18, 2014, 6:12:40 AM5/18/14
to Marcus Ottosson, open-m...@googlegroups.com, Krzysztof
That last part that was for another discussion and where OM differs was really my only area of input. Can't add much to the versioning discussion other than I think it is great to have the goal of the swappable backend storage. Although I am not really clear what that would look like if it were a database backend. Paths would still be the way of referencing the data, but the backend translates that into the way to access it? Also, what happens if someone makes a change to data in a container using 0.5.4 instead of 0.5.3? Does that bump the version up for the container?


Marcus Ottosson

unread,
May 18, 2014, 10:06:41 AM5/18/14
to Justin Israel, open-m...@googlegroups.com, Krzysztof

Although I am not really clear what that would look like if it were a database backend. Paths would still be the way of referencing the data, but the backend translates that into the way to access it?

The interface to OM looks something like this:

>>> om.read(path, metapath)

Where path is still an absolute path to a folder on disk (OM is about associating metadata to folders, nothing else) and metapath the hierarchical location (OM is hierarchical). For example

>>> om.read('/home/marcus', '/deeply/nested/data')

In the case of a database as a backend, depending on your schema, hierarchies may reside within individual columns or tables. OM would parse the given metapath into whatever is required of the datastore; SQL queries or what not. The key lies in the interface given by om.read()

Also, what happens if someone makes a change to data in a container using 0.5.4 instead of 0.5.3? Does that bump the version up for the container?

Good question, and a definite design concern. :) If we go with a Maya-style approach, then yes, any change to a ma file within a newer version of Maya would re-tag the new version with a later version; and vice versa.

Another approach would be to say “no” and let the library decide, prior to reading or writing, whether it conforms to the given version.

import old_openmetadata as om
>>> om.read('/home/marcus', '/newer/metadata')
VersionError: I am simply too old I can't make sense out of this metadata.

At which point you could potentially force a read:

>>> om.read('/home/marcus', '/newer/metadata', version=(3,0))
"Potentially scRAmbleD data"

Thoughts?

--
Marcus Ottosson
konstr...@gmail.com

Reply all
Reply to author
Forward
0 new messages