Hi,
I am testing a table as the following. The last column denotes an id
attribute. The first row is for "a1,b1,c1,d1", and all other rows are
"a1,b1,c2,d1" but with unique ids.
------------------
a1,b1,c1,d1,1
a1,b1,c2,d1,2
a1,b1,c2,d1,3
a1,b1,c2,d1,4
a1,b1,c2,d1,5
a1,b1,c2,d1,6
... ... ... ...
------------------
I write a simple code as the following:
------------------------------------------------------------------
Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, Main.class);
FlowConnector flowConnector = new FlowConnector(properties);
Scheme sourceScheme = new TextDelimited(fields_cols, false, ",");
Tap source = new Hfs(sourceScheme, g_input_file);
Pipe assembly = new Pipe("CFD-violation");
assembly = new Each(assembly, fields_cfd, new FilterCFD(cfd_id,
g_cfd_file));
assembly = new GroupBy(assembly, fields_cfd_lhs);
assembly = new Every(assembly, fields_cfd, new AggreViolations(cfd_id,
g_cfd_file), Fields.ALL);
Scheme sinkScheme = new TextDelimited(Fields.ALL, false, ",");
Tap sink = new Hfs(sinkScheme, output_path, SinkMode.REPLACE);
Flow flow = flowConnector.connect("cfd-detect", source, sink,
assembly);
flow.writeDOT("myCascade.dot");
flow.complete();
------------------------------------------------------------------
It works fine when the file is about 1.5K (100 rows). However, when I
generated a file for 15.5K (1000 rows). The following errors happened.
Please let me know why this happened.
------------------------------------------------------------------
11/04/19 20:17:12 FATAL conf.Configuration: error parsing conf file:
java.io.FileNotFoundException: /disk/scratch/
workspace/.metadata/.plugins/org.apache.hadoop.eclipse/hadoop-
conf-5567698613354107885/core-site.xml (Too many open files)
11/04/19 20:17:12 WARN mapred.LocalJobRunner: job_local_0001
cascading.pipe.OperatorException: [CFD-violation]
[edinburgh.datacleaning.Main.CFD_Violations(Main.java:171)] operator
Each failed executing operation
at cascading.pipe.Each$EachHandler.operate(Each.java:486)
at
cascading.flow.stack.EachMapperStackElement.operateEach(EachMapperStackElement.java:
94)
at
cascading.flow.stack.EachMapperStackElement.collect(EachMapperStackElement.java:
82)
at cascading.flow.stack.FlowMapperStack.map(FlowMapperStack.java:220)
at cascading.flow.FlowMapper.map(FlowMapper.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner
$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /
disk/scratch/workspace/.metadata/.plugins/org.apache.hadoop.eclipse/
hadoop-conf-5567698613354107885/core-site.xml (Too many open files)
at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:
1162)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:
1030)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:
980)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:436)
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
at edinburgh.datacleaning.CFDHandler.loadCFDs(CFDHandler.java:18)
at edinburgh.datacleaning.FilterCFD.isRemove(FilterCFD.java:26)
at cascading.pipe.Each.applyFilter(Each.java:372)
at cascading.pipe.Each.access$300(Each.java:53)
at cascading.pipe.Each$EachFilterHandler.handle(Each.java:558)
at cascading.pipe.Each$EachHandler.operate(Each.java:478)
... 8 more
Caused by: java.io.FileNotFoundException: /disk/scratch/
workspace/.metadata/.plugins/org.apache.hadoop.eclipse/hadoop-
conf-5567698613354107885/core-site.xml (Too many open files)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:106)
at java.io.FileInputStream.<init>(FileInputStream.java:66)
at
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:
70)
at
sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:
161)
at
com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:
653)
at
com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:
186)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:
772)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:
737)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:
119)
at
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:
235)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:
284)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:
1079)
... 19 more
11/04/19 20:17:13 INFO mapred.LocalJobRunner:
hdfs://
hcrc1425n30.inf.ed.ac.uk/user/ntang/dcinput/1g.csv:0+15892
11/04/19 20:17:16 WARN flow.FlowStep: [cfd-detect] task completion
events identify failed tasks
11/04/19 20:17:16 WARN flow.FlowStep: [cfd-detect] task completion
events count: 0
11/04/19 20:17:16 WARN flow.Flow: stopping jobs
11/04/19 20:17:16 INFO flow.FlowStep: [cfd-detect] stopping: (1/1)
Hfs["TextDelimited[[UNKNOWN]->[ALL]]"]["/user/ntang/output/CFD1"]"]
11/04/19 20:17:16 WARN flow.Flow: stopped jobs
11/04/19 20:17:16 WARN flow.Flow: shutting down job executor
11/04/19 20:17:16 WARN flow.Flow: shutdown complete
11/04/19 20:17:16 INFO hadoop.Hadoop18TapUtil: deleting temp path /
user/ntang/output/CFD1/_temporary
Exception in thread "main" cascading.flow.FlowException: step failed:
(1/1) Hfs["TextDelimited[[UNKNOWN]->[ALL]]"]["/user/ntang/output/
CFD1"]"], with job id: job_local_0001, please see cluster logs for
failure messages
at cascading.flow.FlowStepJob.blockOnJob(FlowStepJob.java:173)
at cascading.flow.FlowStepJob.start(FlowStepJob.java:138)
at cascading.flow.FlowStepJob.call(FlowStepJob.java:127)
at cascading.flow.FlowStepJob.call(FlowStepJob.java:39)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)