Re: [rhino-tools-dev] Rhino.Etl and nested collections

95 views
Skip to first unread message

Nathan Palmer

unread,
May 10, 2013, 4:58:06 PM5/10/13
to rhino-t...@googlegroups.com
Couple of things I can think of off hand

1) Create an operation to yield out Frames. Then create an AbstractOperation that re-queries sql for the corresponding Stack's and Tasks for each frame and then wires those together into one document.

2) Flatten the Frame/Stack/Task relationship out of sql and order it by Frame,Stack,Task. When the "FrameId" changes create a new frame document and as you are looping through append the stacks and tasks to the their corresponding collections. You'll need to do the same "StackId" changes logic for stacks as well but since it's ordered it should construct it correctly.

Nathan Palmer


On Thu, May 9, 2013 at 4:04 PM, TJ Roche <tdr...@gmail.com> wrote:
Hello all,
Using rhino.etl to handle a migration from sql server into mongodb.  I have a fairly complicated document that i am trying to pull out of several tables in sql and put into mongo.  I am getting hung up on the individual processing of the items in the pipelines. 

the structure looks similar to this

Frame has collection of Stacks
a Stack has a collection of Tasks
a Task has a collection of Answers.

The whole Frame document will be inserted into mongo with the sub collections.

Do I  need to create an abstract operation to fetch data and just yield the entire collection?  Do I use a Join operation? How about a Nested Loops Join ( I am not really 100% sure what this is used for)?


the insertion into mongo actually is fairly seemless for most of the standard pieces but I am struggling a bit here.


Any help would be greatly appreciated.
Thanks

--
You received this message because you are subscribed to the Google Groups "Rhino Tools Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhino-tools-d...@googlegroups.com.
To post to this group, send email to rhino-t...@googlegroups.com.
Visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Nathan Palmer

unread,
May 13, 2013, 3:06:24 PM5/13/13
to TJ Roche, rhino-t...@googlegroups.com
Awesome. I'd love to see the finished version.

Nathan Palmer


On Mon, May 13, 2013 at 1:41 PM, TJ Roche <tdr...@gmail.com> wrote:
I think the solution is going to be to create a "FunnelingOperation" (opposite of a branching operation), and start at the furthest leaf of the collection tree and fill up from the bottom.  I have to finish writing it and testing it but once I am happy with it I will post the code.

TJ Roche

unread,
May 23, 2013, 11:25:45 AM5/23/13
to rhino-t...@googlegroups.com, TJ Roche, em...@nathanpalmer.com
Well let me just say ETL is hard ;)  
The funneling operation turned out to be a resounding failure.  Either through my own lack of skill/knowledge I just couldn't get it to be reasonably (at least what I deem reasonably) performant on a data set of any real size.

So I ended up nixing the funneling operation and decided on a different tactic.  I created something called FetchAsCollection.  Which returns back the output of a db command through a merge rows function that takes in row and IEnumerable<Row> thus allowing you to deposit the collection and parse it however you needed.   

I may have reinvented the wheel for some pieces but it appears to work fairly well for my uses.   My next task involves creating an AntiJoin for sql so that I can dedupe any existing records and use the sql bulk insert.

Here is the code for FetchAsCollection, should anyone want it, also feel free to critique, change, improve etc. https://gist.github.com/anonymous/6c30878329d4c5817731

Nathan Palmer

unread,
May 23, 2013, 1:12:59 PM5/23/13
to rhino-t...@googlegroups.com, TJ Roche
So first I'll say that I've personally used Rhino Etl with efficient speed on translations of up to 1 billion rows. I'm curious where the speed bump in your example is coming from. You are effectively caching each "bucket" of data and then passing that along. How many rows are you dealing with and what are the operations doing with them afterward? The only time I've done this type of thing to speed it up is when I was dealing with a bottleneck on serialization. Even then though it was fairly minimal of a difference.

Nathan Palmer
Reply all
Reply to author
Forward
0 new messages