too many open file handles with cascading cassandra and EMR

Prateek Gupta

unread,

Oct 6, 2013, 3:45:20 PM10/6/13

to cascadi...@googlegroups.com

All,
      I have a cascading job that tries to read from cassandra 2.0.1 cql table. I use the following cascading tap:

I get the following error on EMR:
https://github.com/ifesdjeen/cascading-cassandra

2013-10-06 19:09:36,136 ERROR org.apache.thrift.transport.TSocket (pool-14-thread-421): Could not configure socket.
java.net.SocketException: Too many open files
        at java.net.Socket.createImpl(Socket.java:447)
        at java.net.Socket.getImpl(Socket.java:510)
        at java.net.Socket.setSoLinger(Socket.java:984)
        at org.apache.thrift.transport.TSocket.initSocket(TSocket.java:116)
        at org.apache.thrift.transport.TSocket.<init>(TSocket.java:107)
        at org.apache.thrift.transport.TSocket.<init>(TSocket.java:92)
        at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:40)
        at org.apache.cassandra.hadoop.ConfigHelper.createConnection(ConfigHelper.java:560)
        at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSubSplits(AbstractColumnFamilyInputFormat.java:272)
        at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.access$200(AbstractColumnFamilyInputFormat.java:62)
        at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:222)
        at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:207)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
2013-10-06 19:09:36,139 ERROR org.apache.thrift.transport.TSocket (pool-14-thread-421): Could not configure socket.
java.net.SocketException: Too many open files
        at java.net.Socket.createImpl(Socket.java:447)

The default file limit (ulimit -n) for hadoop is 32768. This seems like a high limit. The question is why is the cascading tap creating so many sockets? How do I reduce the number of socket connections from EMR to cassandra cluster.

Any help is highly appreciated.

Thanks in advance.

-Prateek

Ken Krugler

unread,

Oct 6, 2013, 6:58:37 PM10/6/13

to cascadi...@googlegroups.com

So you're sure it's the tap that's opening up lots of files? How did you confirm that?

-- Ken

How do I reduce the number of socket connections from EMR to cassandra cluster.

Any help is highly appreciated.

Thanks in advance.

-Prateek

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Manojit Sarkar

unread,

Oct 23, 2013, 10:11:59 PM10/23/13

to cascadi...@googlegroups.com

I am seeing similar issue.
On the stderr I see the exceptions below. On (EMR) syslog I see the exception(s) many reported by the original poster.
I checked number of open files allowed on Cassandra nodes. It is set to 200,000 - which should be more than adequate.

Exception in thread "main" cascading.cascade.CascadeException: flow failed: br_data_etl_product_flow
    at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:1059)
    at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:998)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)

Caused by: cascading.flow.FlowException: unhandled exception
    at cascading.flow.BaseFlow.complete(BaseFlow.java:825)
    at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:1050)
    ... 5 more
Caused by: java.io.IOException: Could not get input splits
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:189)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:340)
    at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:199)
    at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:161)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1044)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1036)
    at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:174)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:952)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:905)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:879)
    at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
    at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:191)
    at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:145)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:120)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
    ... 4 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: failed connecting to all endpoints 10.179.26.169,10.42.103.228,10.214.249.178,10.123.74.248,10.4.197.53,10.165.12.88
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:188)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:185)
    ... 22 more
Caused by: java.io.IOException: failed connecting to all endpoints 10.179.26.169,10.42.103.228,10.214.249.178,10.123.74.248,10.4.197.53,10.165.12.88
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSubSplits(AbstractColumnFamilyInputFormat.java:303)

    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.access$200(AbstractColumnFamilyInputFormat.java:62)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:222)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:207)

... 4 more

Ken Krugler

unread,

Oct 23, 2013, 10:59:26 PM10/23/13

to cascadi...@googlegroups.com

Hi Manojit,

I'm not sure what version of Cassandra is being used on the Hadoop side, as line numbers in the stack traces aren't matching up with 2.0.1 source.

But one thing I see in Cassandra's AbstractColumnFamilyInputFormat is that it creates threaded requests to calculate input splits.

And the number of simultaneous requests appears unbounded, based on whatever is returned by describe_ring & how it intersects the target range for the job.

Do you happen to have vnodes enabled on your cluster?

-- Ken

Ken Krugler

unread,

Oct 23, 2013, 11:02:48 PM10/23/13

to cascadi...@googlegroups.com

And if your Cassandra cluster does have vnodes enabled, then you're probably running into this bug:

https://issues.apache.org/jira/browse/CASSANDRA-6084

-- Ken

On Oct 23, 2013, at 7:11pm, Manojit Sarkar wrote:

Ken Krugler

unread,

Oct 23, 2013, 11:31:46 PM10/23/13

to cascadi...@googlegroups.com

Looks like there's been a recent fix for this same issue - see https://issues.apache.org/jira/browse/CASSANDRA-6169

So you could build Cassandra 1.2.11, then change the dependency in https://github.com/ifesdjeen/cascading-cassandra/blob/master/project.clj to match, and build a new version.