Marcus
Justin
Justin
Marcus
Justin
Hi Justin, and thanks a lot for your thorough reply. Where to start..As per your added reply, Open Metadata is indeed an underlying component of asset management and not asset management itself. In fact, it may be helpful to think of it as a file-format more than anything. The underlying ".meta" folder underneath each hosting folder is then the file, and its innards is its binary layout; only in this case, it isn't binary but plain-text.Then, who writes what where and on which server and under which load is as separate as writing to any other file. The fact that it isn't a binary may even mean more concurrency possibilities than if it were just a single binary blob.
This could possibly be clarified in that document I sent, and so I separated this part into another document here to further drive its point home. Essentially, anything that handles logic around Open Metadata is put into a layer above; such as networking and concurrency handling.I actually had this question about the conflicting names, before I had gotten to that section (What happens if someone makes A.list / A.dict). But how would you suggest this situation realistically be prevented in the system, when the metadata lives on the filesystem right next to its source data? If people are free to set metadata, then they would be free to create conflicting names, as like you said there is no way for OM to stop it since it can't really control the fs data storage completely.I'm actually not too worried about obscured metadata. For one, I feel Windows has proven already that accessing executables without specifying their extension works without hassle, even though there is the chance of one name having multiple extensions. I feel rather safe trusting the user with duplicity and besides, there are many ways to keep track of whether or not this is happening; like warning messages or automated clean-ups.The alternative to this, as with Windows executables, would be to either force user to type a full name, including the extension or to rely on another tactic for figuring out type. And frankly I think both of them pose greater concerns (from a usability standpoint) than that of duplicity in names.This sounds right and I have no comment on it.It would probably want to allow concurrent reads, while locking only for writes (like database row locking). If you do atomic writes to the files, it would probably be much safer, like writing to a temporary file first and then copying it over the previous one in one system call. Like, if you were writing a blob (image), then you would probably have to be writing it in chunks under some circumstances, until it is written. And then you just move it over the top of the old one so that anyone reading the previous one is just on the old inode, and any new readers would see the new inode.This, although interesting, sounds like we may be talking past each other. It sounds like MVC and is something I employ in other areas, but not for Open Metadata itself. Think I covered why in the top part of this reply.To me it seems like the broker is an absolutely necessary component, unless people really don't care about race conditions in the data access. It would seem like if you are dead set on a filesystem approach, that you would even go the next step and remove write permissions for *everyone* to the metadata, and only allow the broker to do the writes. My friend Matt Chambers, whom I work with at Weta and whom started that Plow render farm project, is a big proponent of the "3 tier system": Data Storage / Application Service / Client. As opposed to the 2 tier system where clients have all the business logic on their side, and modify the data storage directly (database queries, filesystem, ...), in the 3 tier system, you move all the business logic to the Application service, and require that the thin clients issue their requests through it. You gain a number of benefits from this structure. It means you can update the business logic without having to change out every client in play, since if you keep the interfaces compatible, they can continue just accessing the services. You also move the control of data access to the application, which is like your broker, being able to efficiently schedule, queue, notify, track, log, etc.
It think you'll find that my view on what metadata is may differ from the traditional sense. To me, values such as "startFrame" to a shot are metadata as much as playblasts, reference images and preview-geometry are metadata to an asset.As I got further down the spec, it sounded like a broker wouldn't really scale for handling the writes of massive data on behalf of the client. In my head I kept thinking of metadata as mainly values and not binary data. Its possible that an approach might be to have the application server act as a locking service for various metadata locations. If you want to write tons of metadata, then you would probably have to acquire the write-lock on that location.The qualifier is "anything about something", where about is what is important. Don't care much for what data-type we've happened to classify that data as. (key/value pair, video, web-address..)This is a great idea! As per the Push/Pull pattern in the RFC, I had only thought of doing the temporary storage client-side, but with large files, it would indeed become a problem once more as the copying itself might take some time. Storing it on a temporary spot server-side would solve all that. Perhaps a four step process? Memory --> local temporary spot --> remote temporary spot --> final destination. As locally caching could help network load overall, I think. This would however be something for RFC13/ZOM, the link from above.OR you would acquire a unique id to refer to a private staging area, and once the write is complete, you can merge it into place.I'd have to go with simplicity. As metadata, especially writing of metadata, is hardly a performance critical task, I put most of my effort involved in making it fast into instead making it usable.Last question... What are the pros and cons of a filesystem based storage for the metadata as opposed to a document-based nosql database like MongoDB?
Querying I believe is quite a different topic than storage. The way I would approach fast querying would most likely be in the form of caching; either via indexing which many OS's already do, quite successfully too I have to say and the other would be to use a database in which case the benefits are as you say.Slow query capabilities - If you don't know your exact location to open and read, then it would probably be slow to say "What is the value of FOO for all items in my sequence that have the color = blue?". For that you would have to do a filesystem scan of everything to find all of the items that have a color = blue, and then return all the FOO values. This would be equivalent to a constant full table scan in a database on unindexed fields.
However I see no reason to bog down the simplicity of metadata, and not to mention to hide away all that data into a strict binary format such as compressed database files given the headaches that come with that such as corruption and migration, for such a feature when a querying mechanism can easily be layered on top.
Speed of metadata read/writes is coupled to the source data filesystem performance - If someone is hammering the filesystem with normal source data, then your ability to read metadata is also affected.This is true. But just because metadata resides next to the content, doesn't mean it will have to reside on the same file-system or even computer.There are a few methods of separating bits on the hard-drive from their visual layout in the OS. One is RAID which would be hardware-bound and another is to symlink, which is file-system-bound a slightly more higher-level one is to simply merge multiple hierarchies upon query, which would be software-bound.Primary FS|-- myFolder| |-- .metaShadow FS|-- myFolder| |-- regularChildFrom the users perspective|-- myFolder| |-- .meta| |-- regularChildIn either case, no hard-drive containing regular content would have to wake up to access metadata and yet they appear in the same hierarchy.I think its important to point out that I'm not necessarily advocating that the 1s and 0s should reside next to each other, even though that would of course be the most straight-forward solution. The key thing is to keep content and metacontent logically coupled.
My point here is that it scales. Possibly even beyond that of databases, but at the very least to the same level performance-wise based on your implementation.
Marcus
Marcus
Marcus
Justin
Hey Justin, thanks for your reply.This makes sense. How would you keep logging running while developing without installing an event logger? Only install by hand it in tests and such?I don't think libraries are supposed to install logging handlers. Otherwise they start outputting information outside of the library consumers control. The person using the library would install the logging handlers they want, whether it be a StreamingHandler, a log file, syslog, ... That is, handlers are usually installed by the applications as opposed to the libraries.
OM list and dict Groups would act like the Python list and dict. When accessed through code, they should behave like a list and dict respectively, but are stored as groups with datasets.What would be the recommended workflow for saving a list data structure?>>> inputs = om.Group('inputs', parent=location)>>> inputs.data = ['path1', '/path/2', 'another/path']>>> print inputs.data[String('path1'), String('/path/2'), String('another/path')Each dataset then provides a .data attribute for access to the equivalent Python data-type.
Here, setting data to groups directly is a convenience, as it converts whatever you give it to the equivalent dataset (if one can be determined, it looks at the Python data-type for clues). If it can't find an equivalent dataset, it is considered a blob (like trying to store a class or any Python object really, it might try and store those as pickled binaries but I haven't gone into much about blobs yet, what do you think?)
The point of returning datasets rather than strings directly is both due to consistency and performance. These objects haven't read from the database yet but are mere handles to files that can potentially be read via om.pull(). This way, you could could return arbitrary large hierarchies without knowing or caring about how large their datasets are until you actually need to access a specific dataset.Additionally, some datasets couldn't be determined by looking at the content, such as Date and Null types (followed by Vector, Point and Matrix datasets etc). You could then use the OM objects for correlation with UI objects so that each get their equivalent display and editor.A list or dict dataset would be considered blobs and since blobs are not yet very developed you currently couldn't regain access to them very easily.But is there a list/dict Dataset?I tried saving a list as a dataset but I had trouble figuring out how to read it back in again. It would just produce a Dataset object that couldn't be accessed fully. Is it the kind of thing where you would expect the API saving the list to serialize it to something and save it as a string type Dataset?
Here is how I see this go down. Any blob is stored as merely an absolute path to the file you assigned; i.e. the blob would first have to exist as a file or equivalent container capable of hosting "binary" data (as it applies equally to binary formats). When attempting to read a blob, you would then be given this path in return, which you could treat however you would normally treat files (json.load or builtins.open etc)
OM should remain very capable of storing arbitrary data, but if you're looking to manage a collection with OM, the recommended way of doing so would be to store them as datasets, either by letting groups handle the conversion, or by first making the contents of your collection into datasets and assigning those to groups.That's true. I haven't yet encountered an issue, but I can see where this could become a problem. The reason I made it this limited was mainly for tests against existing datasets within a group or location; remember we spoke earlier of conflicting names as datasets were accessed only by name and excluding their suffixes? This test would ensure no duplicates could be created. But I suppose I could make such a test more explicit.This equality test for nodes would seemingly produce a false positive for any Node in one location that has the same name with a Node in another location.Have you considered using a hash that takes into consideration both the Node type and its full path? (type, fullpath)Then you would be sure to not to conflict even with Nodes living under different backend types (database, url, ...)
Thought provoking question, I had not considered it before. Blobs should be capable of being any size, but as you say if they are too big than storing history could become a bottleneck.What would be the approach for preventing excessive data explosion when history is applied to larger blobs? Would you use symlinks if the blobs are the same? Would there be control/config over which types to version/history and which types should not?
Spontaneously, here is how I see this go down.Initially, history would only be supported with Datasets. Datasets are supposedly quite small (if not, I would consider the use of them to inappropriate for the given situation, at least for now) and shouldn't produce any significant bottleneck. If they still do, then the make_history() method could be delegated into a separate process, possibly writing to another disk at first and ultimately transferring data once things calm down. I consider history to be one of the things that aren't necessarily dependent on high-performance and whose access isn't required immediately upon creation and may therefore be delayed in favour of overall performance.As for blobs, when and if they support history, what do you think about a Dropbox approach? I.e. de-duplicating large data by chopping them up into chunks and only storing chunks that change as part of an items history. Once stable, this could potentially be used for datasets too, although datasets, being mere text, would possibly fit better with a Git approach; i.e. doing string comparison.
One question for you. As I'm thinking of making Open Metadata compatible with storing the data within a database in addition to file-systems (hence the separation of databse access into a service.py module that is meant to run wherever the database is located and accessed remotely if need be), is there methods of storing history in SQL-style databases in general? How do you store history currently? I have little experience with databases so do go basic on me.