rhino ETL and huge (2 billion+) data

TJ Roche

unread,

Jun 5, 2013, 7:14:03 PM6/5/13

to rhino-t...@googlegroups.com

So I have around 2.5 billion records that i am trying to take from sql server on one system to sql server on another system.

Obviously this is slightly beyond the bounds of the standard input command -> batch/bulk command scenario.

Is there an accepted methodology to use here?

The way that keeps tickling my brain is to split the query using some combination of querying the system table partitions to gather the count, the paged sql here http://blog.sqlauthority.com/2013/04/14/sql-server-tricks-for-row-offset-and-paging-in-various-versions-of-sql-server/ and some kind of parallel PartialProcessOperation, I can get some decent throughput, 10million in around 40 sec, but when I point it to the full result set I receive a SqlTimeoutException.

Which reveals a fun little bit of microsoft shenanigans, apparently the SqlTimeoutException will also throw if you have exceeded the number of available connections in the pool or if they are all busy when a request comes in.

So I may be OVER paralleling the query. *sigh*

Any help would be greatly appreciated.

Nathan Palmer

unread,

Jun 6, 2013, 6:53:50 PM6/6/13

to rhino-t...@googlegroups.com

Well, 2.5 billion is a large set :)

When I was dealing with data of that size what I had to do to avoid timeouts was disable parallel processing in certain parts and enable it in others. It's going to depend on your data and structure though on how best to do it. Is it a single table? I was dealing with multiple large tables so I kept each table a single stream but ran them concurrently. I also had to break up the export into chunks of 100k since I was dealing with a geographical conversion (moving it about 3k miles so bandwidth was a consideration.)

Nathan Palmer

--
You received this message because you are subscribed to the Google Groups "Rhino Tools Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhino-tools-d...@googlegroups.com.
To post to this group, send email to rhino-t...@googlegroups.com.
Visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

TJ Roche

unread,

Jun 10, 2013, 7:15:58 PM6/10/13

to rhino-t...@googlegroups.com, em...@nathanpalmer.com

It is a single large table fairly narrow, mostly ints. I have a couple other tables that are hovering right around the 1/2 billion row mark but I am reasonably okay with them, I have had reasonable luck splitting them out based on cardinality or taking the processing out into different aspects of my migration routine.

I am only moving it in house so I am not necessarily worried about the bandwidth, more concerned with speed and time.

I have a 36 hour downtime window and if possible I would like to be able to finish my migration within that window.

Nathan Palmer

unread,

Jun 11, 2013, 9:35:25 AM6/11/13

to rhino-t...@googlegroups.com

I'm not sure I'd attempt the type of paging shown on the sqlauthority blog on a table as large as that. Is there an identity field (sequential #) that you can use to split it up? Also, have you tried to disable transactions as well? For LARGE sets that can usually speed it up and avoid timeouts of course you get partial data if it fails.

Nathan Palmer

Jeffrey Becker

unread,

Jun 11, 2013, 9:42:43 AM6/11/13

to rhino-t...@googlegroups.com, em...@nathanpalmer.com

Just out of curiosity have you looked at your io metrics on the query side to figure out if the parallelism is hurting you?

Vladimir Giszpenc

unread,

Aug 6, 2014, 3:42:27 PM8/6/14

to rhino-t...@googlegroups.com, em...@nathanpalmer.com

Using SqlBulkCopy with a Reader is up to the task if there is no transformation involved.

Reply all

Reply to author

Forward