How many connections to the Databases?

Shekhar Sharma

unread,

May 8, 2015, 9:55:06 PM5/8/15

to lingua...@googlegroups.com

Lingual cascading uses MR job to move the data to and fro much like sqoop. So does every mapper makes a connection to the data base or how is it?

I believe not every mapper makes the connection, because if the data is huge then it will span 1000s of mapper, and we cannot have those many connection to the database.

Can you please explain in detail a bit?

Regards,

Som

Joe Posner

unread,

May 9, 2015, 10:58:36 AM5/9/15

to lingua...@googlegroups.com

The Redshift tap will save the data to a file on hdfs and then use a singly connection to issue the COPY command to load that data into Redshift.

Joe Posner

unread,

May 9, 2015, 11:01:03 AM5/9/15

to lingua...@googlegroups.com

Sorry, typo. That data is saved to files ( plural ) and then a single connection issues the COPY command to load those files.

Andre Kelpe

unread,

May 10, 2015, 9:23:20 AM5/10/15

to lingua...@googlegroups.com

If you are referring to the cascading-jdbc provider for lingual, then yes: Every mapper will open a connection. A mapper is a separate JVM, a process on its own. There is no other way than opening one connection per mapper.

- André

--
You received this message because you are subscribed to the Google Groups "Lingual User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lingual-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

André Kelpe
an...@concurrentinc.com
http://concurrentinc.com

Shekhar Sharma

unread,

May 11, 2015, 4:08:27 AM5/11/15

to lingua...@googlegroups.com

if the number of mappers are larger in number even 100, then you cannot have those many connections to the database.

Joe. It doesnt make sense to me then why again it has to create the files on the HDFS and then make a connection. Files are already present right?

Joe Posner

unread,

May 11, 2015, 12:57:10 PM5/11/15

to lingua...@googlegroups.com

For large data sizes Redshift's optimized HFS-based COPY loading process will be more efficient than JDBC which is why the Tap stages the data and then loads it. As with any performance optimization there are always cases where a general case isn't appropriate for your specific case. For example if you just want to insert a single row to a non-indexed table it's faster to just use JDBC. And whether or not the data already exists entirely depends on where in your workflow you're using the Tap; the defaults assume that you're doing some data transformation in Cascading and not just loading existing data directly.

If you want the tap to use JDBC directly instead of using the HFS-based COPY set the "usedirectinsert" protocol property to "true".

Reply all

Reply to author

Forward