All,
I have a cascading job that tries to read from cassandra 2.0.1 cql table. I use the following cascading tap:
I get the following error on EMR:
https://github.com/ifesdjeen/cascading-cassandra2013-10-06 19:09:36,136 ERROR org.apache.thrift.transport.TSocket (pool-14-thread-421): Could not configure socket.
java.net.SocketException: Too many open files
at java.net.Socket.createImpl(Socket.java:447)
at java.net.Socket.getImpl(Socket.java:510)
at java.net.Socket.setSoLinger(Socket.java:984)
at org.apache.thrift.transport.TSocket.initSocket(TSocket.java:116)
at org.apache.thrift.transport.TSocket.<init>(TSocket.java:107)
at org.apache.thrift.transport.TSocket.<init>(TSocket.java:92)
at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:40)
at org.apache.cassandra.hadoop.ConfigHelper.createConnection(ConfigHelper.java:560)
at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSubSplits(AbstractColumnFamilyInputFormat.java:272)
at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.access$200(AbstractColumnFamilyInputFormat.java:62)
at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:222)
at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:207)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
2013-10-06 19:09:36,139 ERROR org.apache.thrift.transport.TSocket (pool-14-thread-421): Could not configure socket.
java.net.SocketException: Too many open files
at java.net.Socket.createImpl(Socket.java:447)
The default file limit (ulimit -n) for hadoop is 32768. This seems like a high limit. The question is why is the cascading tap creating so many sockets? How do I reduce the number of socket connections from EMR to cassandra cluster.
Any help is highly appreciated.
Thanks in advance.
-Prateek