not sure how much this has to do with cascalog per se ... but i have
this really confounding issue and maybe someone can help? so i have
this job which is failing, the stack trace in the job logs look like
Caused by: java\.lang\.RuntimeException: java\.lang
\.ClassNotFoundException: views\.visit-facts, compiling:(views/
visit_facts\.clj:1)
at clojure\.lang\.Compiler\.analyze(Compiler\.java:6235)
at clojure\.lang\.Compiler\.analyze(Compiler\.java:6177)
...
Caused by: java\.lang\.RuntimeException: java\.lang
\.ClassNotFoundException: views\.visit-facts
at clojure\.lang\.Util\.runtimeException(Util\.java:165)
at clojure\.lang\.RT\.classForName(RT\.java:2017)
...
Caused by: java\.lang\.ClassNotFoundException: views\.visit-facts
at java\.net\.URLClassLoader$1\.run(URLClassLoader\.java:202)
...
I sort of suspect that the job jar was not being replicated
correctly .. and looking at daemon logs i see that the namenode has
errors replicating
jobtracker.info
INFO org.apache.hadoop.ipc.Server (IPC Server handler 6 on 9000): IPC
Server handler 6 on 9000, call addBlock(/mnt/var/lib/hadoop/tmp/mapred/
system/
jobtracker.info, DFSClient_1731950709) from
10.194.15.165:51308: error: java.io.IOException: File /mnt/var/lib/
hadoop/tmp/mapred/system/
jobtracker.info could only be replicated to 0
nodes, instead of 1
on the datanode side I see errors w/ receiving the job jar
namenode logs says:
2012-04-28 19:05:55,582 INFO org.apache.hadoop.hdfs.StateChange (IPC
Server handler 11 on 9000): DIR* NameSystem.completeFile: file /mnt/
var/lib/hadoop/tmp/mapred/system/job_201204281904_0001/job.jar is
closed by DFSClient_-387163361
datanode logs says:
INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace
(PacketResponder 0 for Block blk_832986479003239919_1004): src: /
10.195.89.124:53001, dest: /
10.76.91.41:9200, bytes: 39825229, op:
HDFS_WRITE, cliID: DFSClient_-387163361, srvID:
DS-304531098-10.76.91.41-9200-1335639918679, blockid:
blk_832986479003239919_1004q
WARN org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@1a5d08):
DatanodeRegistration(
10.76.91.41:9200,
storageID=DS-304531098-10.76.91.41-9200-1335639918679, infoPort=9102,
ipcPort=9201):Failed to transfer blk_138586677137070325_1009 to
10.37.67.149:9200 got java.net.SocketException: Original Exception :
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
my first gut reaction was, maybe the jar is too big and too hard to
replicate? however the oddest part of all this is that
(1) other scripts in this jar work fine. i don't see the replication
problems with
jobtracker.info or the job.jar
(2) portions of the visit-facts script work fine as well -- like the
subqueries it depends on run w/ out the above issues
so it seems to suggest that something specific to this script is
affecting how hadoop is replicating its
jobtracker.info and job.jar --
which does not make a whole lot of sense to me.
i am running this on AWS EMR -- get the same problem for hadoop vs.
0.20 and 0.20.205
any insight or guesses welcome on this issue.