Behind the scenes

26 views
Skip to first unread message

Marcus Ottosson

unread,
Apr 20, 2014, 6:25:38 AM4/20/14
to open-m...@googlegroups.com
Was speaking with Justin off-list and thought I'd might add the conversation here for others to take part in. Some very interesting things have been said and shaped the library to where it is today.

Thanks Justin for letting me share his thoughts here.


Marcus

Hi Justin,

If you haven't yet read the spec, I'd love for you to read the updated version here, I had some thought about concurrency and potentially corrupted writes, its under the header "Distribution and Concurrent Writes"


Just finished up on some examples from the implementation here:

Marcus Ottosson

unread,
Apr 20, 2014, 6:27:51 AM4/20/14
to open-m...@googlegroups.com

Justin

I did look at it earlier, but I just went over it again, and have some questions...

"It is up to the end-user to guarantee that names remain unique as there is no way for Open Metadata to enforce this constraint."

I actually had this question about the conflicting names, before I had gotten to that section (What happens if someone makes A.list / A.dict). But how would you suggest this situation realistically be prevented in the system, when the metadata lives on the filesystem right next to its source data? If people are free to set metadata, then they would be free to create conflicting names, as like you said there is no way for OM to stop it since it can't really control the fs data storage completely. 

"The recieving end (broker) would then manage the actual reads and writes to the file-system(s), thus ensuring that there is only ever one concurrent read and write happening "

It would probably want to allow concurrent reads, while locking only for writes (like database row locking). If you do atomic writes to the files, it would probably be much safer, like writing to a temporary file first and then copying it over the previous one in one system call. Like, if you were writing a blob (image), then you would probably have to be writing it in chunks under some circumstances, until it is written. And then you just move it over the top of the old one so that anyone reading the previous one is just on the old inode, and any new readers would see the new inode. 

To me it seems like the broker is an absolutely necessary component, unless people really don't care about race conditions in the data access. It would seem like if you are dead set on a filesystem approach, that you would even go the next step and remove write permissions for *everyone* to the metadata, and only allow the broker to do the writes. My friend Matt Chambers, whom I work with at Weta and whom started that Plow render farm project, is a big proponent of the "3 tier system": Data Storage / Application Service / Client. As opposed to the 2 tier system where clients have all the business logic on their side, and modify the data storage directly (database queries, filesystem, ...), in the 3 tier system, you move all the business logic to the Application service, and require that the thin clients issue their requests through it. You gain a number of benefits from this structure. It means you can update the business logic without having to change out every client in play, since if you keep the interfaces compatible, they can continue just accessing the services. You also move the control of data access to the application, which is like your broker, being able to efficiently schedule, queue, notify, track, log, etc. 

As I got further down the spec, it sounded like a broker wouldn't really scale for handling the writes of massive data on behalf of the client. In my head I kept thinking of metadata as mainly values and not binary data. Its possible that an approach might be to have the application server act as a locking service for various metadata locations. If you want to write tons of metadata, then you would probably have to acquire the write-lock on that location. OR you would acquire a unique id to refer to a private staging area, and once the write is complete, you can merge it into place. 

"Esoteric types"

It sounds like you are approaching the concepts of FUSE.

Although your proposal suggests that would just have client side "handlers", as opposed to user-space mounts, that know how to read and possible write to various types

Last question... What are the pros and cons of a filesystem based storage for the metadata as opposed to a document-based nosql database like MongoDB? I can see that a filesystem approach means you don't need a functional database/broker to necessarily read the metadata, if it were to go down, or for situations where the data is transferred somewhere else. But then again, I see a lot of other things against the filesystem:
  • Slow query capabilities - If you don't know your exact location to open and read, then it would probably be slow to say "What is the value of FOO for all items in my sequence that have the color = blue?". For that you would have to do a filesystem scan of everything to find all of the items that have a color = blue, and then return all the FOO values. This would be equivalent to a constant full table scan in a database on unindexed fields. 
  • Speed of metadata read/writes is coupled to the source data filesystem performance - If someone is hammering the filesystem with normal source data, then your ability to read metadata is also affected. 
  • As mentioned earlier, you can't fully protect against how people will modify the filesystem outside of the OM API, as the metadata is stored next to the source data in plain view. So solutions to work around it would be needed, hence the broker / proxy / etc. 
  • Similar to the previous point, synchronizing (locking) dataset commits requires workarounds  
Just brainstorming here. Playing devils advocate. 

-- justin


Justin
 

The more I think about my questions, the more I wonder if I am confusing the goals of OpenMetadata with an underlying component of an Asset Management System. I reread the goal again and it seems that it is not trying to be a query engine or a view into the entire project, so I am probably off the mark by asking questions related to this. Your proposal says it will provide metadata alongside the thing you already know you want. Some other system would probably be tracking the relationships and indexing them in a way that they can be tracked and observed as a whole. 
Maybe just the synchronization of the read/writes, and/or atomic writes is the important part.

Am I right or wrong in that assumption?

Marcus Ottosson

unread,
Apr 20, 2014, 6:28:26 AM4/20/14
to open-m...@googlegroups.com

Marcus


Hi Justin, and thanks a lot for your thorough reply. Where to start..

As per your added reply, Open Metadata is indeed an underlying component of asset management and not asset management itself. In fact, it may be helpful to think of it as a file-format more than anything. The underlying ".meta" folder underneath each hosting folder is then the file, and its innards is its binary layout; only in this case, it isn't binary but plain-text.

Then, who writes what where and on which server and under which load is as separate as writing to any other file. The fact that it isn't a binary may even mean more concurrency possibilities than if it were just a single binary blob.

This could possibly be clarified in that document I sent, and so I separated this part into another document here to further drive its point home. Essentially, anything that handles logic around Open Metadata is put into a layer above; such as networking and concurrency handling.

I actually had this question about the conflicting names, before I had gotten to that section (What happens if someone makes A.list / A.dict). But how would you suggest this situation realistically be prevented in the system, when the metadata lives on the filesystem right next to its source data? If people are free to set metadata, then they would be free to create conflicting names, as like you said there is no way for OM to stop it since it can't really control the fs data storage completely. 

I'm actually not too worried about obscured metadata. For one, I feel Windows has proven already that accessing executables without specifying their extension works without hassle, even though there is the chance of one name having multiple extensions. I feel rather safe trusting the user with duplicity and besides, there are many ways to keep track of whether or not this is happening; like warning messages or automated clean-ups.

The alternative to this, as with Windows executables, would be to either force user to type a full name, including the extension or to rely on another tactic for figuring out type. And frankly I think both of them pose greater concerns (from a usability standpoint) than that of duplicity in names.

It would probably want to allow concurrent reads, while locking only for writes (like database row locking). If you do atomic writes to the files, it would probably be much safer, like writing to a temporary file first and then copying it over the previous one in one system call. Like, if you were writing a blob (image), then you would probably have to be writing it in chunks under some circumstances, until it is written. And then you just move it over the top of the old one so that anyone reading the previous one is just on the old inode, and any new readers would see the new inode. 

This sounds right and I have no comment on it.

To me it seems like the broker is an absolutely necessary component, unless people really don't care about race conditions in the data access. It would seem like if you are dead set on a filesystem approach, that you would even go the next step and remove write permissions for *everyone* to the metadata, and only allow the broker to do the writes. My friend Matt Chambers, whom I work with at Weta and whom started that Plow render farm project, is a big proponent of the "3 tier system": Data Storage / Application Service / Client. As opposed to the 2 tier system where clients have all the business logic on their side, and modify the data storage directly (database queries, filesystem, ...), in the 3 tier system, you move all the business logic to the Application service, and require that the thin clients issue their requests through it. You gain a number of benefits from this structure. It means you can update the business logic without having to change out every client in play, since if you keep the interfaces compatible, they can continue just accessing the services. You also move the control of data access to the application, which is like your broker, being able to efficiently schedule, queue, notify, track, log, etc. 

This, although interesting, sounds like we may be talking past each other. It sounds like MVC and is something I employ in other areas, but not for Open Metadata itself. Think I covered why in the top part of this reply.

As I got further down the spec, it sounded like a broker wouldn't really scale for handling the writes of massive data on behalf of the client. In my head I kept thinking of metadata as mainly values and not binary data. Its possible that an approach might be to have the application server act as a locking service for various metadata locations. If you want to write tons of metadata, then you would probably have to acquire the write-lock on that location.

It think you'll find that my view on what metadata is may differ from the traditional sense. To me, values such as "startFrame" to a shot are metadata as much as playblasts, reference images and preview-geometry are metadata to an asset.

The qualifier is "anything about something", where about is what is important. Don't care much for what data-type we've happened to classify that data as. (key/value pair, video, web-address..)

OR you would acquire a unique id to refer to a private staging area, and once the write is complete, you can merge it into place. 

This is a great idea! As per the Push/Pull pattern in the RFC, I had only thought of doing the temporary storage client-side, but with large files, it would indeed become a problem once more as the copying itself might take some time. Storing it on a temporary spot server-side would solve all that. Perhaps a four step process? Memory --> local temporary spot --> remote temporary spot --> final destination. As locally caching could help network load overall, I think. This would however be something for RFC13/ZOM, the link from above.

Last question... What are the pros and cons of a filesystem based storage for the metadata as opposed to a document-based nosql database like MongoDB?

I'd have to go with simplicity. As metadata, especially writing of metadata, is hardly a performance critical task, I put most of my effort involved in making it fast into instead making it usable.

Slow query capabilities - If you don't know your exact location to open and read, then it would probably be slow to say "What is the value of FOO for all items in my sequence that have the color = blue?". For that you would have to do a filesystem scan of everything to find all of the items that have a color = blue, and then return all the FOO values. This would be equivalent to a constant full table scan in a database on unindexed fields.

Querying I believe is quite a different topic than storage. The way I would approach fast querying would most likely be in the form of caching; either via indexing which many OS's already do, quite successfully too I have to say and the other would be to use a database in which case the benefits are as you say.

However I see no reason to bog down the simplicity of metadata, and not to mention to hide away all that data into a strict binary format such as compressed database files given the headaches that come with that such as corruption and migration, for such a feature when a querying mechanism can easily be layered on top.

Speed of metadata read/writes is coupled to the source data filesystem performance - If someone is hammering the filesystem with normal source data, then your ability to read metadata is also affected.

This is true. But just because metadata resides next to the content, doesn't mean it will have to reside on the same file-system or even computer.

There are a few methods of separating bits on the hard-drive from their visual layout in the OS. One is RAID which would be hardware-bound and another is to symlink, which is file-system-bound a slightly more higher-level one is to simply merge multiple hierarchies upon query, which would be software-bound.

Primary FS
|-- myFolder
|   |-- .meta

Shadow FS
|-- myFolder
|   |-- regularChild

From the users perspective
|-- myFolder
|   |-- .meta
|   |-- regularChild

In either case, no hard-drive containing regular content would have to wake up to access metadata and yet they appear in the same hierarchy.

I think its important to point out that I'm not necessarily advocating that the 1s and 0s should reside next to each other, even though that would of course be the most straight-forward solution. The key thing is to keep content and metacontent logically coupled.

My point here is that it scales. Possibly even beyond that of databases, but at the very least to the same level performance-wise based on your implementation.

As mentioned earlier, you can't fully protect against how people will modify the filesystem outside of the OM API, as the metadata is stored next to the source data in plain view. So solutions to work around it would be needed, hence the broker / proxy / etc.

This is true, but at this point I would think it comes down to skills in managing backups. For Pipi, we're doing distributed storage, like Git, so data, including metadata, would exist at multiple places at all times so securing one spot in particular isn't really much of a concern.

It sounds like you are approaching the concepts of FUSE.

Quite interesting! Had not come across it before, but it doesn't surprise me that something like this already exists. Their implementations are rather complicated compared to what I have in mind, which brings a smile to my face. :)

--

Thanks again Justin, I'm very thankful for your time and inspiration.

As a finish, I started mocking up another aspect that might further test your view on metadata. ;)

Best,
Marcus

Marcus Ottosson

unread,
Apr 20, 2014, 6:30:36 AM4/20/14
to open-m...@googlegroups.com
Justin

Cool man. Well it sounds like some good information is coming out of these discussions...

On Mon, Mar 31, 2014 at 10:38 AM, Marcus Ottosson <konstr...@gmail.com> wrote:
Hi Justin, and thanks a lot for your thorough reply. Where to start..

As per your added reply, Open Metadata is indeed an underlying component of asset management and not asset management itself. In fact, it may be helpful to think of it as a file-format more than anything. The underlying ".meta" folder underneath each hosting folder is then the file, and its innards is its binary layout; only in this case, it isn't binary but plain-text.

Then, who writes what where and on which server and under which load is as separate as writing to any other file. The fact that it isn't a binary may even mean more concurrency possibilities than if it were just a single binary blob.

I'm not sure I can agree on this, since to me it seems like it is the other way around. Being that your metadata is composed of numerous files means you don't get atomic updates to a complete data structure without having an application server synchronizing all writes. If I want to write a large json structure to a file, that would be a single file to perform OPs on. Whereas a multi-file / multi-hierarchy filesystem structure means I need to write into N number of files. So it implies that synchronizing is even more necessary. 
 

This could possibly be clarified in that document I sent, and so I separated this part into another document here to further drive its point home. Essentially, anything that handles logic around Open Metadata is put into a layer above; such as networking and concurrency handling.

I actually had this question about the conflicting names, before I had gotten to that section (What happens if someone makes A.list / A.dict). But how would you suggest this situation realistically be prevented in the system, when the metadata lives on the filesystem right next to its source data? If people are free to set metadata, then they would be free to create conflicting names, as like you said there is no way for OM to stop it since it can't really control the fs data storage completely. 

I'm actually not too worried about obscured metadata. For one, I feel Windows has proven already that accessing executables without specifying their extension works without hassle, even though there is the chance of one name having multiple extensions. I feel rather safe trusting the user with duplicity and besides, there are many ways to keep track of whether or not this is happening; like warning messages or automated clean-ups.

The alternative to this, as with Windows executables, would be to either force user to type a full name, including the extension or to rely on another tactic for figuring out type. And frankly I think both of them pose greater concerns (from a usability standpoint) than that of duplicity in names.

It would probably want to allow concurrent reads, while locking only for writes (like database row locking). If you do atomic writes to the files, it would probably be much safer, like writing to a temporary file first and then copying it over the previous one in one system call. Like, if you were writing a blob (image), then you would probably have to be writing it in chunks under some circumstances, until it is written. And then you just move it over the top of the old one so that anyone reading the previous one is just on the old inode, and any new readers would see the new inode. 

This sounds right and I have no comment on it.

To me it seems like the broker is an absolutely necessary component, unless people really don't care about race conditions in the data access. It would seem like if you are dead set on a filesystem approach, that you would even go the next step and remove write permissions for *everyone* to the metadata, and only allow the broker to do the writes. My friend Matt Chambers, whom I work with at Weta and whom started that Plow render farm project, is a big proponent of the "3 tier system": Data Storage / Application Service / Client. As opposed to the 2 tier system where clients have all the business logic on their side, and modify the data storage directly (database queries, filesystem, ...), in the 3 tier system, you move all the business logic to the Application service, and require that the thin clients issue their requests through it. You gain a number of benefits from this structure. It means you can update the business logic without having to change out every client in play, since if you keep the interfaces compatible, they can continue just accessing the services. You also move the control of data access to the application, which is like your broker, being able to efficiently schedule, queue, notify, track, log, etc. 

This, although interesting, sounds like we may be talking past each other. It sounds like MVC and is something I employ in other areas, but not for Open Metadata itself. Think I covered why in the top part of this reply.

I'm not really sure that MVC is the correct terminology in this case. It wasn't about separating the data source, from the business logic, from the presentation layer. It was more about needing to compensate for a filesystem datastore that is not really shielded from the end users, and moving all of the logic into a service that can centralize all of the operations. Technically you could store all of your metadata in a fast indexed database, and just have your app server write out the read-only metadata files as a fallback format for any client reads that can't or don't want to access the primary datastore. It also helps with the idea of syncing, since when you transfer data to another location, the application server can be populated from the filesystem metadata (missing db data). 
 

As I got further down the spec, it sounded like a broker wouldn't really scale for handling the writes of massive data on behalf of the client. In my head I kept thinking of metadata as mainly values and not binary data. Its possible that an approach might be to have the application server act as a locking service for various metadata locations. If you want to write tons of metadata, then you would probably have to acquire the write-lock on that location.

It think you'll find that my view on what metadata is may differ from the traditional sense. To me, values such as "startFrame" to a shot are metadata as much as playblasts, reference images and preview-geometry are metadata to an asset.

The qualifier is "anything about something", where about is what is important. Don't care much for what data-type we've happened to classify that data as. (key/value pair, video, web-address..)

OR you would acquire a unique id to refer to a private staging area, and once the write is complete, you can merge it into place. 

This is a great idea! As per the Push/Pull pattern in the RFC, I had only thought of doing the temporary storage client-side, but with large files, it would indeed become a problem once more as the copying itself might take some time. Storing it on a temporary spot server-side would solve all that. Perhaps a four step process? Memory --> local temporary spot --> remote temporary spot --> final destination. As locally caching could help network load overall, I think. This would however be something for RFC13/ZOM, the link from above.

Last question... What are the pros and cons of a filesystem based storage for the metadata as opposed to a document-based nosql database like MongoDB?

I'd have to go with simplicity. As metadata, especially writing of metadata, is hardly a performance critical task, I put most of my effort involved in making it fast into instead making it usable.

It may not be critical upfront, until processes that end up writing metadata as part of their pipeline have to wait for those metadata operations to complete. 
i.e. Do A, Do B, Do Metadata, Do C, Do D.  


Slow query capabilities - If you don't know your exact location to open and read, then it would probably be slow to say "What is the value of FOO for all items in my sequence that have the color = blue?". For that you would have to do a filesystem scan of everything to find all of the items that have a color = blue, and then return all the FOO values. This would be equivalent to a constant full table scan in a database on unindexed fields.

Querying I believe is quite a different topic than storage. The way I would approach fast querying would most likely be in the form of caching; either via indexing which many OS's already do, quite successfully too I have to say and the other would be to use a database in which case the benefits are as you say.

Native filesystem cachine, or your own caching? It would have to be caching that can actually index *into* the data to let you express "foo==bar". And if its your own caching, then you would end up needing to run a service anyways, which leads to the path of storing the metadata in an indexed database. Something like MongoDB + GridFS (which can store blobs), or not even storing blobs in the database, and just contextual metadata like the path to the blob. 
 

However I see no reason to bog down the simplicity of metadata, and not to mention to hide away all that data into a strict binary format such as compressed database files given the headaches that come with that such as corruption and migration, for such a feature when a querying mechanism can easily be layered on top.


The storage part is not really my concern either. Its just the ability to inquire into the data in reasonably efficient ways. Like I said earlier, you could have your primary communication go through an application server that creates read-only metadata on the filesystem. Something that is version controlled and permission locked down.
 
Speed of metadata read/writes is coupled to the source data filesystem performance - If someone is hammering the filesystem with normal source data, then your ability to read metadata is also affected.

This is true. But just because metadata resides next to the content, doesn't mean it will have to reside on the same file-system or even computer.

There are a few methods of separating bits on the hard-drive from their visual layout in the OS. One is RAID which would be hardware-bound and another is to symlink, which is file-system-bound a slightly more higher-level one is to simply merge multiple hierarchies upon query, which would be software-bound.

Primary FS
|-- myFolder
|   |-- .meta

Shadow FS
|-- myFolder
|   |-- regularChild

From the users perspective
|-- myFolder
|   |-- .meta
|   |-- regularChild

In either case, no hard-drive containing regular content would have to wake up to access metadata and yet they appear in the same hierarchy.

I think its important to point out that I'm not necessarily advocating that the 1s and 0s should reside next to each other, even though that would of course be the most straight-forward solution. The key thing is to keep content and metacontent logically coupled.

Thats true, you can design your filesystems to have mounts and symlinks, etc. 
 

My point here is that it scales. Possibly even beyond that of databases, but at the very least to the same level performance-wise based on your implementation.

Not sure if that is true. A database can keep indexes and caches in memory. I definitely think the blob data needs to live on the filesystem. Its just the part that I consider metadata (which differs from your view) that lets you query on criteria to find the thing you want. 

Marcus Ottosson

unread,
Apr 20, 2014, 6:36:09 AM4/20/14
to open-m...@googlegroups.com
Marcus

I'm not sure I can agree on this, since to me it seems like it is the other way around. Being that your metadata is composed of numerous files means you don't get atomic updates to a complete data structure without having an application server synchronizing all writes

You might be right, but it may also depend on the level of granularity of your atoms. I'm trying to picture a scenario where atomicity facilitates usability and not just technicalities. What would be the benefits of atomic commits at the level of a whole metadata-structure as opposed to happening at each individual file?

It may not be critical upfront, until processes that end up writing metadata as part of their pipeline have to wait for those metadata operations to complete. 
i.e. Do A, Do B, Do Metadata, Do C, Do D.

I'm with you this and it has already produces noticeable delays in UI interaction using the previous version of OM. The solution I'm imagining is to memcache. As metadata is read much more than it is written, having such a delay the first time a user accesses new metadata I consider manageable. And besides, we're talking about milliseconds, possible a second of delay or two, both of which is more a matter of keeping the UI alive and artists happy to wait rather than becoming a show-stopper in itself. If "Do Metadata" involves writing, then to locally write should be fine too and could be completely decoupled from time taken to perform the whole chain of events. And to be clear, I'm considering memcache more of a method than a process. Meaning even though it technically means caching frequently accessed attributes to memory, I think a similar approach could be applied for storing things persistently; such as grouping hardlinks of files in a folder per frequent request - like "give me all assets in this film".

What about accessing metadata from 40,000+ cores on the renderwall, since anyone has the freedom to run jobs that access various APIs and toolsets. 

Is this really the common case though? I'd consider it an edge case and one I would develop separately for. If I had to make a standing approximation, I'd probably suspect performance critical data to be fairly predictable and simply tag it as requiring fast access and have some other process handling synchronisation and availability.

Overall, I consider the use of a database to require a strong reason and if that reason is nothing but performance than I'm not sure I buy it. Can you think of another reason why a database would be better suited for metadata? My end goal is based on the belief that what works for a feature film studio, also works for a small commercial house. It's something I know a lot of people can't realistically envision.

More importantly perhaps, I'd like for our discussion to be more about Open Metadata and it's interface and potential uses, rather than blunt file-system vs. database. At the end of the day, it seems it falls back on opinion, understandably so, as the benefits of either are so tightly connected with how they are being used. It may be true that a file-system approach may not be suitable for 40 000 concurrent reads, but then maybe Open Metadata simply isn't suited for such use or perhaps there are ways of doing it that wouldn't put Open Metadata or any other technology in such a tough spot.

Again, thanks immensely, I consider your input at this point to be enough to be mentioned as a contributor to Open Metadata, if you'd like.

All the best,
Marcus

Marcus Ottosson

unread,
Apr 20, 2014, 6:37:57 AM4/20/14
to open-m...@googlegroups.com
Marcus

Hi Justin,

I'm happy to share with you an early implementation of the specification I presented earlier.

Also in this version is an early implementation of RFC14 - Temporal Metadata; which essentially provides a retrievable history of changes made to metadata.

Have a look at the included examples for general usage and let me know what you think.

Best,
Marcus

Marcus Ottosson

unread,
Apr 20, 2014, 6:38:32 AM4/20/14
to open-m...@googlegroups.com
Hey there,

I got a chance to read through both the source and the new RFC. Got questions/comments!

1)

I don't think libraries are supposed to install logging handlers. Otherwise they start outputting information outside of the library consumers control. The person using the library would install the logging handlers they want, whether it be a StreamingHandler, a log file, syslog, ... That is, handlers are usually installed by the applications as opposed to the libraries. 

2) 
What would be the recommended workflow for saving a list data structure? Lets say for a given piece of source data I want to track the related other source paths that went into an edit of a final output. So I would want to save out a list of 20 paths to something like "inputs". I see that you have categorized list and dict as Group types, which I assume are like general collection structures for child metadata. But is there a list/dict Dataset? 
I tried saving a list as a dataset but I had trouble figuring out how to read it back in again. It would just produce a Dataset object that couldn't be accessed fully. Is it the kind of thing where you would expect the API saving the list to serialize it to something and save it as a string type Dataset?

Or is it expected to go through a List Group, and create each element in the list as a Dataset, which then saves to disk as one file per list element?

3) 

This equality test for nodes would seemingly produce a false positive for any Node in one location that has the same name with a Node in another location. 
Have you considered using a hash that takes into consideration both the Node type and its full path?     (type, fullpath)
Then you would be sure to not to conflict even with Nodes living under different backend types (database, url, ...)

4)

What would be the approach for preventing excessive data explosion when history is applied to larger blobs? Would you use symlinks if the blobs are the same? Would there be control/config over which types to version/history and which types should not?

Thats all for now!

Marcus Ottosson

unread,
Apr 20, 2014, 6:39:01 AM4/20/14
to open-m...@googlegroups.com
Marcus
 
Hey Justin, thanks for your reply.

I don't think libraries are supposed to install logging handlers. Otherwise they start outputting information outside of the library consumers control. The person using the library would install the logging handlers they want, whether it be a StreamingHandler, a log file, syslog, ... That is, handlers are usually installed by the applications as opposed to the libraries. 

This makes sense. How would you keep logging running while developing without installing an event logger? Only install by hand it in tests and such?

What would be the recommended workflow for saving a list data structure?

OM list and dict Groups would act like the Python list and dict. When accessed through code, they should behave like a list and dict respectively, but are stored as groups with datasets.

>>> inputs = om.Group('inputs', parent=location)
>>> inputs.data = ['path1', '/path/2', 'another/path']
>>> print inputs.data
[String('path1'), String('/path/2'), String('another/path')

Each dataset then provides a .data attribute for access to the equivalent Python data-type.

Here, setting data to groups directly is a convenience, as it converts whatever you give it to the equivalent dataset (if one can be determined, it looks at the Python data-type for clues). If it can't find an equivalent dataset, it is considered a blob (like trying to store a class or any Python object really, it might try and store those as pickled binaries but I haven't gone into much about blobs yet, what do you think?)

The point of returning datasets rather than strings directly is both due to consistency and performance. These objects haven't read from the database yet but are mere handles to files that can potentially be read via om.pull(). This way, you could could return arbitrary large hierarchies without knowing or caring about how large their datasets are until you actually need to access a specific dataset.

Additionally, some datasets couldn't be determined by looking at the content, such as Date and Null types (followed by Vector, Point and Matrix datasets etc). You could then use the OM objects for correlation with UI objects so that each get their equivalent display and editor.

But is there a list/dict Dataset? 
I tried saving a list as a dataset but I had trouble figuring out how to read it back in again. It would just produce a Dataset object that couldn't be accessed fully. Is it the kind of thing where you would expect the API saving the list to serialize it to something and save it as a string type Dataset?

A list or dict dataset would be considered blobs and since blobs are not yet very developed you currently couldn't regain access to them very easily.

Here is how I see this go down. Any blob is stored as merely an absolute path to the file you assigned; i.e. the blob would first have to exist as a file or equivalent container capable of hosting "binary" data (as it applies equally to binary formats). When attempting to read a blob, you would then be given this path in return, which you could treat however you would normally treat files (json.load or builtins.open etc)

OM should remain very capable of storing arbitrary data, but if you're looking to manage a collection with OM, the recommended way of doing so would be to store them as datasets, either by letting groups handle the conversion, or by first making the contents of your collection into datasets and assigning those to groups.

This equality test for nodes would seemingly produce a false positive for any Node in one location that has the same name with a Node in another location. 
Have you considered using a hash that takes into consideration both the Node type and its full path?     (type, fullpath)
Then you would be sure to not to conflict even with Nodes living under different backend types (database, url, ...)

That's true. I haven't yet encountered an issue, but I can see where this could become a problem. The reason I made it this limited was mainly for tests against existing datasets within a group or location; remember we spoke earlier of conflicting names as datasets were accessed only by name and excluding their suffixes? This test would ensure no duplicates could be created. But I suppose I could make such a test more explicit.

What would be the approach for preventing excessive data explosion when history is applied to larger blobs? Would you use symlinks if the blobs are the same? Would there be control/config over which types to version/history and which types should not?

Thought provoking question, I had not considered it before. Blobs should be capable of being any size, but as you say if they are too big than storing history could become a bottleneck.

Spontaneously, here is how I see this go down.

Initially, history would only be supported with Datasets. Datasets are supposedly quite small (if not, I would consider the use of them to inappropriate for the given situation, at least for now) and shouldn't produce any significant bottleneck. If they still do, then the make_history() method could be delegated into a separate process, possibly writing to another disk at first and ultimately transferring data once things calm down. I consider history to be one of the things that aren't necessarily dependent on high-performance and whose access isn't required immediately upon creation and may therefore be delayed in favour of overall performance.

As for blobs, when and if they support history, what do you think about a Dropbox approach? I.e. de-duplicating large data by chopping them up into chunks and only storing chunks that change as part of an items history. Once stable, this could potentially be used for datasets too, although datasets, being mere text, would possibly fit better with a Git approach; i.e. doing string comparison.

One question for you. As I'm thinking of making Open Metadata compatible with storing the data within a database in addition to file-systems (hence the separation of databse access into a service.py module that is meant to run wherever the database is located and accessed remotely if need be), is there methods of storing history in SQL-style databases in general? How do you store history currently? I have little experience with databases so do go basic on me.

Thanks for the great questions, and hope I managed to answer them for you!

Best,
Marcus 

Marcus Ottosson

unread,
Apr 20, 2014, 6:40:48 AM4/20/14
to open-m...@googlegroups.com
Justin

On Sun, Apr 6, 2014 at 7:45 PM, Marcus Ottosson <konstr...@gmail.com> wrote:
Hey Justin, thanks for your reply.

I don't think libraries are supposed to install logging handlers. Otherwise they start outputting information outside of the library consumers control. The person using the library would install the logging handlers they want, whether it be a StreamingHandler, a log file, syslog, ... That is, handlers are usually installed by the applications as opposed to the libraries. 

This makes sense. How would you keep logging running while developing without installing an event logger? Only install by hand it in tests and such?

Exactly. Your unittests are executable entry points, so they can install handlers. 
 

What would be the recommended workflow for saving a list data structure?

OM list and dict Groups would act like the Python list and dict. When accessed through code, they should behave like a list and dict respectively, but are stored as groups with datasets.

>>> inputs = om.Group('inputs', parent=location)
>>> inputs.data = ['path1', '/path/2', 'another/path']
>>> print inputs.data
[String('path1'), String('/path/2'), String('another/path')

Each dataset then provides a .data attribute for access to the equivalent Python data-type.


Strange. I tried this snippet and got:

inputs.data = ['path1', '/path/2', 'another/path']
AttributeError: can't set attribute

I didn't really investigate the source though.
 
Here, setting data to groups directly is a convenience, as it converts whatever you give it to the equivalent dataset (if one can be determined, it looks at the Python data-type for clues). If it can't find an equivalent dataset, it is considered a blob (like trying to store a class or any Python object really, it might try and store those as pickled binaries but I haven't gone into much about blobs yet, what do you think?)

But I didn't see a List datatype. What does it end up doing with all 20 items in that list? How do they get arranged to disk, under that Group structure?
As for storing pickled binaries, maybe it is better to  use json to keep it portable, incase you want to allow non-python clients under OpenMetadata. Or at least some portable and performant serialization format for data structures. 
 

The point of returning datasets rather than strings directly is both due to consistency and performance. These objects haven't read from the database yet but are mere handles to files that can potentially be read via om.pull(). This way, you could could return arbitrary large hierarchies without knowing or caring about how large their datasets are until you actually need to access a specific dataset.

Additionally, some datasets couldn't be determined by looking at the content, such as Date and Null types (followed by Vector, Point and Matrix datasets etc). You could then use the OM objects for correlation with UI objects so that each get their equivalent display and editor.

But is there a list/dict Dataset? 
I tried saving a list as a dataset but I had trouble figuring out how to read it back in again. It would just produce a Dataset object that couldn't be accessed fully. Is it the kind of thing where you would expect the API saving the list to serialize it to something and save it as a string type Dataset?

A list or dict dataset would be considered blobs and since blobs are not yet very developed you currently couldn't regain access to them very easily.

Is this related to my previous question? So the list of 20 paths would be a blob dataset, although not yet very well supported.
 

Here is how I see this go down. Any blob is stored as merely an absolute path to the file you assigned; i.e. the blob would first have to exist as a file or equivalent container capable of hosting "binary" data (as it applies equally to binary formats). When attempting to read a blob, you would then be given this path in return, which you could treat however you would normally treat files (json.load or builtins.open etc)

I think that is an excellent idea and answers something I had been thinking previously. For some reason I thought you were implying blobs (lets say saving a thumbnail) would be directly stored within the meta system. I think it is much better like you said to store blobs and references to already existing concrete data. 
 

OM should remain very capable of storing arbitrary data, but if you're looking to manage a collection with OM, the recommended way of doing so would be to store them as datasets, either by letting groups handle the conversion, or by first making the contents of your collection into datasets and assigning those to groups.

This equality test for nodes would seemingly produce a false positive for any Node in one location that has the same name with a Node in another location. 
Have you considered using a hash that takes into consideration both the Node type and its full path?     (type, fullpath)
Then you would be sure to not to conflict even with Nodes living under different backend types (database, url, ...)

That's true. I haven't yet encountered an issue, but I can see where this could become a problem. The reason I made it this limited was mainly for tests against existing datasets within a group or location; remember we spoke earlier of conflicting names as datasets were accessed only by name and excluding their suffixes? This test would ensure no duplicates could be created. But I suppose I could make such a test more explicit.

Well ya but you can only define one __eq__ method for the class. If someone has a list of Nodes and wants to know if one they have is the same as one they are encountering in some loop, they would probably want to know if A == B. It definitely would handle your specific case where you want to prevent them from being at the exact same location as siblings. 
 

What would be the approach for preventing excessive data explosion when history is applied to larger blobs? Would you use symlinks if the blobs are the same? Would there be control/config over which types to version/history and which types should not?

Thought provoking question, I had not considered it before. Blobs should be capable of being any size, but as you say if they are too big than storing history could become a bottleneck.

Part of this was answered by your other statement that blobs can really just be path references. But I do see that you can have arbitrary data blobs. 

You might want to consider researching this project:

It is authored by Brad Fitzpatrik, the same guy who wrote memcached. He created this system as a way to store anything, forever. It uses all kinds of awesome and fancy tricks for creating references to things, storing only deltas between versions, etc. And everything seems to be a "reference". Watch some of the demos for inspiration. 
 

Spontaneously, here is how I see this go down.

Initially, history would only be supported with Datasets. Datasets are supposedly quite small (if not, I would consider the use of them to inappropriate for the given situation, at least for now) and shouldn't produce any significant bottleneck. If they still do, then the make_history() method could be delegated into a separate process, possibly writing to another disk at first and ultimately transferring data once things calm down. I consider history to be one of the things that aren't necessarily dependent on high-performance and whose access isn't required immediately upon creation and may therefore be delayed in favour of overall performance.

As for blobs, when and if they support history, what do you think about a Dropbox approach? I.e. de-duplicating large data by chopping them up into chunks and only storing chunks that change as part of an items history. Once stable, this could potentially be used for datasets too, although datasets, being mere text, would possibly fit better with a Git approach; i.e. doing string comparison.

Yea that Camlistore project has the similar approach where I think he uses 32KB blocks to determine changes. 
But it really could just be an API opt-in thing where you say something tracks version/history. Maybe as a meta-config option for the dataset, like autohistory=True.
 

One question for you. As I'm thinking of making Open Metadata compatible with storing the data within a database in addition to file-systems (hence the separation of databse access into a service.py module that is meant to run wherever the database is located and accessed remotely if need be), is there methods of storing history in SQL-style databases in general? How do you store history currently? I have little experience with databases so do go basic on me.

History could be in terms of rows in a table or something. It really depends on how you structure your schema. You can have a <foo>_history table which stores the same schema + version fields. Or you can have one table that always contains version fields and just has all of the history items.
Reply all
Reply to author
Forward
0 new messages