Timeouts with etl

47 views
Skip to first unread message

Wayne Douglas

unread,
May 5, 2013, 10:42:42 AM5/5/13
to rhino-t...@googlegroups.com
I'm getting timeout issues when doing a select on a huge data set - it times out in the select mid way through processing. 

Is there anyway to page through huge data sets to stop this from happening?

Nathan Palmer

unread,
May 5, 2013, 11:15:57 AM5/5/13
to rhino-t...@googlegroups.com
Depends. Since it's timing out mid way through processing I assume manipulating the command and/or connection timeout won't help you. I assume you are running into a MVCC timeout (depends on backend database or configuration.) The best thing might be to page based on content. Find a good way to split the data and create an operation that runs first to grab the distinct values. Then when you get to the main select you filter on that value. To give an example if we're dealing with addresses you can have this.

operation GetDistinctStates
operation GetData
operation ProcessData

GetDistinctStates will get a list of states within your dataset
GetData will process 1 state at a time (and yield it to the next operation)
ProcessData will do what it's doing today.

Nathan Palmer


On Sun, May 5, 2013 at 7:42 AM, Wayne Douglas <wa...@prjatk.com> wrote:
I'm getting timeout issues when doing a select on a huge data set - it times out in the select mid way through processing. 

Is there anyway to page through huge data sets to stop this from happening?

--
You received this message because you are subscribed to the Google Groups "Rhino Tools Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhino-tools-d...@googlegroups.com.
To post to this group, send email to rhino-t...@googlegroups.com.
Visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Wayne Douglas

unread,
May 7, 2013, 5:42:18 AM5/7/13
to rhino-t...@googlegroups.com, em...@nathanpalmer.com
Hi

Thanks for that, firstly, what is an MVCC timeout/

Also - I have now broken the data out by club but how can I feed that club id into the next input operation? Convention input operation doesn't have the ability to pass params to the sproc that is executed to fetch the data? 

I have created my own input operation which can do this but it is slow as hell. It accepts a single param as that is all I need for now. Here is the code: http://monobin.com/1568/

w

Nathan Palmer

unread,
May 7, 2013, 8:26:37 AM5/7/13
to rhino-t...@googlegroups.com
MVCC being Multi Version Concurrency Control. When it's enabled (and those systems that support it) will give you a consistent result based on the time you started the query. So if any rows in the resultset change while you streaming them out you will get the original value (as it was at the time of start.) However there is a time limit to how long it will hold onto it. It's generally really long though and I have only run into it myself on billion record tables when it gets near 24 hours of runtime.

Everything looks just fine on your operation. My guess is there is some slowdown executing the query. Does the table have an index on club? How much time is spent waiting for the initial set of results? What is the cardinality of club? You may want to find something that has less variation so it's only a few queries (versus a lot of queries.)

Nathan
Reply all
Reply to author
Forward
0 new messages