What is the fastest way to insert a group of linked documents?

97 views
Skip to first unread message

Eric24

unread,
Jun 2, 2016, 11:40:01 PM6/2/16
to OrientDB
Using orientjs (or a JS or SQL function), I'm trying to insert a related group of documents (up to six or eight at a time), so that LINKMAPs in each document can be set as they are inserted (i.e. D1->D2->D3->D4). I'm not wanting to use edges, because the links are only ever uni-directional and I gain nothing but additional overhead by having bi-directional "links".

Of course, I can do this "sequentially", insert D4, then insert D3 (now knowing the RID for D4), etc. But I'm looking for a more efficient way. The missing piece of the puzzle is that I don't think I have any way to know what the RID is (or will be) ahead of time. If I have to use the sequential method, then my thought would be to do this in a JS or SQL function (to save the communications round-trip latency of doing it in orientjs), but I can't figure out how to capture the RID of each document as it's inserted in a function.

Another idea, which I don't think is much better, would be to insert all of the documents without LINKMAPs and then update them afterward, but that seems even less efficient than the sequential approach.

Any insight would be much appreciated.

scott molinari

unread,
Jun 3, 2016, 1:06:05 AM6/3/16
to OrientDB
Hey Eric. 

If I may ask, what is the data model you have or can you give an example, which requires a unidirectional relationship between classes?

Scott

Eric24

unread,
Jun 3, 2016, 8:51:36 AM6/3/16
to OrientDB
Time-series data is a pretty good representation of what I'm doing (see p.22 of this presentation: http://www.slideshare.net/LuigiDellAquila/orientdb-time-representation). Consider: a) there is never a need to traverse "up" from a lower-level node; b) the data is relatively static (i.e. write once, read mostly); and c) when finally deleting old data (if ever), the delete will also only traverse down. Given that, and taking into account the large number of nodes, I not only don't benefit from bi-directional links, but specifically, I don't want to incur the additional storage overhead of the "reverse" pointers and additional edge records (even with lightweight edges, which I can't use in this case, there is extra storage overhead that provides no value). There has been discussion of supporting uni-directional (or mono-directional) links in ODB, but as far as I can tell, this hasn't yet happened. Thus my approach of using LINKMAPs. The only downside to them is that I want to be able to do a single insert with the LINKMAP property already set, and to do that, I need to know the RID of the downstream node for each upstream node insert operation. That's what I'm trying to optimize.

scott molinari

unread,
Jun 5, 2016, 9:48:11 AM6/5/16
to OrientDB
You say your problem is inserting a number of documents at the same time. At what point in storing the time series data is that necessary? 

Scott

Eric Lenington

unread,
Jun 5, 2016, 12:09:14 PM6/5/16
to OrientDB
The time-series data is stored in a "cascade" of linked records (I'm currently using LINKMAPs in documents, but the same could be done with vertexes and edges); there are several possible structures, but one is root->year->month->day->hour->minute->sample (this one allowing for multiple samples per minute (and yes, the samples could be stored as an EMBEDDEDMAP in the "minute" record, but that doesn't change the original issue). So let's say I'm storing a sample for 2016-06-01 00:00:00 and this is the first sample stored in that month and day. The "root" and "year" exists, but none of the other records do, so I need to create month->day->hour->minute->sample at once. The sample that comes in a few seconds later would just add on to the "minute" record, while the sample that comes in at 00:01:00 would need to create a new "minute" record and link it from the corresponding "hour" record.

So the solution I've arrived at since asking the original question is to build a SQL batch dynamically based on the last known sample time for a given time-series (knowing that lets me know what "chain" of records already exists and what needs to be created, ranging from a single new record (with one "upstream" link update) to as many as six new records in a linked chain. This works and seems fairly efficient, but maybe there's a better way?

By the way, I would have preferred to use a Javascript function instead of a client-generated SQL batch, but I got too frustrated trying to "discover" the Javascript function environment and syntax by trial and error. As far as I can tell, there is no documentation beyond a few examples here and there on the objects and functions available in the Javascript function environment. Are you aware of any?


On Sun, Jun 5, 2016 at 8:48 AM, 'scott molinari' via OrientDB <orient-...@googlegroups.com> wrote:
You say your problem is inserting a number of documents at the same time. At what point in storing the time series data is that necessary? 

Scott

--

---
You received this message because you are subscribed to a topic in the Google Groups "OrientDB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/orient-database/21wTp42oZEI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

scott molinari

unread,
Jun 6, 2016, 1:48:44 AM6/6/16
to OrientDB
No. Not aware of any more docs on JS custom functions. 

I am still uncertain about not using edges and vertexes. For instance, how would you do aggregations (if you do need them) without being able to traverse upward in the hierarchy? You're basically limiting yourself, aren't you?

The only thing that comes to mind, which might help is pre-allocating, say, a day's worth of records. So, create the time based nodes before the time actually hits (the day before?), then it is a matter of only updating the nodes according to the timestamp of the data, once the data is created. This could be like your current batch job, only done earlier to pre-build the time series hierarchy. This seems plausible to me, because the time series data is a fixed or rather known data structure. The only disadvantage is if your time based data has holes in time, meaning, you don't have data for every minute of the day, you'd have nodes with no data. But, theoretically, you could clean those up the day after too. I don't know. Just throwing out some ideas...:-)

Scott  

Eric Lenington

unread,
Jun 6, 2016, 8:54:42 AM6/6/16
to OrientDB
That's too bad. If anyone from ODB is reading this thread, please make documentation a higher priority! This is too complex of a product to learn entirely by trial-and-error. In this particular case, my preference would be to use custom JS functions, but after burning several days trying to guess how to make it work, I ended up settling on a custom SQL function instead (and I'm a JS expert).

So far, I've not been able to identify any case where I need to traverse "upwards". To aggregate, all queries still start at the "root". Given that, I only need uni-directional links, but the links need to be "identified" with their time unit value (i.e. if we're at the month level linking to days, I need to know what day the link refers to). That leaves me with a LINKMAP (using the key as the time unit value) or an edge with a property. So even with the proposed but as yet unimplemented uni-direction links, it would still be a heavier solution than the LINKMAP.

Yes, I'm considering some form of pre-allocation, although I'm not yet convinced that this is much better from a performance standpoint. What I do think will help is to pre-allocate space for the LINKMAP in each record when its first created. This comes back to part of my other questions on how much space is actually used on disk for various property types. For example, to make the LINKMAP updates most efficient, I'd want to pre-allocate enough space so the many updates don't each end up moving the record (rather than updating it in place). As it stands, I'm having to determine this through experimentation rather than just reading the docs (which are apparently not just incomplete but also out of date), so I don't have a solution yet.


--

scott molinari

unread,
Jun 6, 2016, 9:33:24 AM6/6/16
to OrientDB
Yes, most definitely pre-allocate for any data in the records. This is also something MongoDB (sort of) recommends, when doing time series data with their database (which is a pure document database).

http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb (see next to last paragraph before the conclusion).

Or here: http://learnmongodbthehardway.com/schema/chapter6/

:-)

Scott

Eric Lenington

unread,
Jun 6, 2016, 10:07:16 AM6/6/16
to OrientDB
Thanks for the links. Looks like useful info.

--
Reply all
Reply to author
Forward
0 new messages