Message from discussion
Unable to get anything to work in Scoobi 0.5.0 - class loader problem
Date: Thu, 2 Aug 2012 16:09:32 -0700 (PDT)
From: Ben Wing <b...@benwing.com>
To: scoobi-users@googlegroups.com
Message-Id: <fe141743-e0f8-4e12-804e-2adbf93b2fb3@googlegroups.com>
In-Reply-To: <e71eafaf-1632-4d18-8980-5ecee34df0d4@googlegroups.com>
References: <9e6a775c-866b-4b06-9f4b-e60106442503@googlegroups.com>
<CAHD7LM32EA7xnA6OxPQLJ0fSBQX5LT4v2xSVfxpQTESHd42j_A@mail.gmail.com>
<d2f18db3-0e07-4de4-9f32-860b9248db0e@googlegroups.com> <a772b62a-1820-4241-8bba-136722052201@googlegroups.com>
<8c2cf7cf-9e9d-43c3-8adc-ea06e0873252@googlegroups.com>
<CAHD7LM3MNV7BRRQt4oywen9fC1RXdssygK+42Kah9fPAALeiHQ@mail.gmail.com>
<3b6397f9-630c-4657-9626-615a83194acd@googlegroups.com>
<1b433342-b33b-46c8-95f0-10b860393955@googlegroups.com>
<e1017ac3-dbd1-438f-b7ff-38159a3a53e3@googlegroups.com>
<e71eafaf-1632-4d18-8980-5ecee34df0d4@googlegroups.com>
Subject: Re: Unable to get anything to work in Scoobi 0.5.0 - class loader
problem
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_214_27434367.1343948972662"
------=_Part_214_27434367.1343948972662
Content-Type: multipart/alternative;
boundary="----=_Part_215_17957847.1343948972662"
------=_Part_215_17957847.1343948972662
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
On Wednesday, August 1, 2012 2:29:42 AM UTC-4, Eric Torreborre wrote:
>
> Hi Alex,
>
> > I create jar file of my application with "sbt package-hadoop" on one
> machine and then copy it on the gateway machine
>
> This is indeed probably the issue. You can then override the "jars" method
> to get the correct jar URLs from either a configuration file, or from the
> URLClassLoader if it happens to reference them.
>
> Just replace the existing filter: (Seq(".ivy2",
> ".m2").exists(url.getFile.contains) with something more relevant (at line
> com.nicta.scoobi.application.LibJars #32).
>
> Eric T.
>
>
Eric, I have access to two very different Hadoop configurations.
The one I was using before to test Scoobi 0.5 is a fairly small cluster
with a long-term persistent HDFS file system, as well as a single job
tracker, a single name node, and 16 task nodes. I only have ssh access to
the job tracker, and AFAIK the other machines are firewalled from the
Internet and do not have access to my home directory on the job tracker --
i.e. the only shared file system is HDFS. I compile and launch the
application from the job tracker. The version of the 'hadoop' client
executable is Cloudera 0.20.2 cdh3u3; I'm not sure what the version of HDFS
or the servers is but I would guess the same or similar.
I don't understand what Alex's issue is, but I have to ask -- why did this
work before, and why doesn't it work now? I thought the whole point of
building an assembly/combined jar was precisely to include *all* the
necessary libraries in it. Why does Scoobi 0.5 screw around with trying to
upload libraries itself rather than relying on what's in the assembly, as
Scoobi 0.4 did?
Now, the other configuration is completely different. This system has an
enormous number (in the hundreds) of 8-core compute servers, managed by a
Sun Grid Engine, where you submit jobs with qsub. 48 of these are set
aside for Hadoop usage, but they don't form a normal Hadoop cluster.
Instead, all they really have is an extra local 2 TB disk installed on
/hadoop. Rather, what I do is ssh to a login node and then request some
subset of compute servers (e.g. 8 nodes) for some amount of time (you get
exclusive use of the servers you request, but for a maximum of only 24
hours!!) using an appropriate qsub script. This is just a shell script
with some extra directives in it telling qsub how many machines you're
asing for, which type, for how long, etc., which gets run as soon as your
requested resources are available. It proceeds to set up one of the nodes
as a combined job tracker/name node and all the rest as task nodes, and
format a new HDFS using all the disks in /hadoop, storing the configuration
info in a subdirectory of my home directory. Then, I ssh into the job
tracker, copy my data into HDFS, and run my Hadoop tasks -- for a maximum
of 24 hours, which is all you get at a time. In this setup, my home
directory as well as a series of ginormous 1000+ TB Lustre partitions are
all available on all of the compute servers (as well as the login server),
and I can freely ssh into all the compute servers that I've requested and
have been given control over, and all of them can connect directly to the
Internet.
This second system uses an installation of Hadoop Cloudera 0.20.2 cdh3u2
sitting in my home dir. The same version is used both for starting HDFS
and the various servers and the client 'hadoop' executable. It took a good
deal of dicking around with the configuration and qsub script and such to
get it working, so I'd rather not touch it, although conceivably I could
update to a newer version, since (as mentioned above) I start a new HDFS
each time.
After Alex's and your comment, I tried getting things working on this
system, since my home dir is mounted on all the machines, so whatever issue
there is with accessing the .ivy2 and .m2 dirs, it shouldn't exist here.
Unfortunately:
(1) I get a warning on some of my code when compiling:
[warn]
/home/01683/benwing/devel/poligrounder/src/main/scala/opennlp/textgrounder/util/hadoop.scala:69:
method isDir in class FileStatus is deprecated: see corresponding Javadoc
for more information.
[warn] get_file_system(filename).getFileStatus(new
Path(filename)).isDir
[warn] ^
[warn] one warning found
This is not problematic except that it indicates that everything is being
compiled against Hadoop 0.21 or later, which doesn't sound good, and indeed:
(2) I immediately get an error when running, apparently due to an
incompatibility between 0.20 and 0.21:
Exception in thread "main" java.io.IOException: Input path
input-naleo-jun-21 does not exist.
at
com.nicta.scoobi.io.text.TextInput$TextSource$$anonfun$inputCheck$1.apply(TextInput.scala:140)
at
com.nicta.scoobi.io.text.TextInput$TextSource$$anonfun$inputCheck$1.apply(TextInput.scala:136)
at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at
com.nicta.scoobi.io.text.TextInput$TextSource.inputCheck(TextInput.scala:136)
at
com.nicta.scoobi.impl.exec.Executor$$anonfun$prepare$4.apply(Executor.scala:79)
at
com.nicta.scoobi.impl.exec.Executor$$anonfun$prepare$4.apply(Executor.scala:79)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:86)
at com.nicta.scoobi.impl.exec.Executor$.prepare(Executor.scala:79)
at
com.nicta.scoobi.application.Persister$.com$nicta$scoobi$application$Persister$$createPlan(Persister.scala:259)
...
So this leads to some more questions:
Am I going to have to upgrade to Hadoop 0.21 or later just to run Scoobi
0.5? Besides all the hassle involved, this seems like a bad idea, because
the Hadoop 0.21 series, whose latest release is Hadoop 2.0.0-alpha, is ...
well, alpha software. In general, why is Scoobi tracking the bleeding
edge like this? I understand that eventually we will need to upgrade, but
it seems preliminary in this case, particularly since there appear to be
significant backward-compatibility issues, and Hadoop 2 still is far from
being released in stable form.
Overall, I still don't understand the whole story behind Hadoop
configuration and such, but I wonder, why was it necessary to switch away
from just building and running a big assembly, and why was it necessary to
move to Hadoop 0.21?
Thanks,
ben
------=_Part_215_17957847.1343948972662
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
<br><br>On Wednesday, August 1, 2012 2:29:42 AM UTC-4, Eric Torreborre wrot=
e:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;b=
order-left: 1px #ccc solid;padding-left: 1ex;">Hi Alex,<div><br></div><div>=
> I create jar file of my application with "sbt package-hadoo=
p" on one machine and then copy it on the gateway machine</div><div><br></d=
iv><div>This is indeed probably the issue. You can then override the "jars"=
method to get the correct jar URLs from either a configuration file, or fr=
om the URLClassLoader if it happens to reference them. </div><div><br>=
</div><div>Just replace the existing filter: (Seq(".ivy2", ".m2").exis=
ts(url.getFile.<wbr>contains) with something more relevant (at line com.nic=
ta.scoobi.application.<wbr>LibJars #32).</div><div><br></div><div>Eric T.<b=
r><br></div></blockquote><div><br></div><div>Eric, I have access to two ver=
y different Hadoop configurations.</div><div><br></div><div>The one I was u=
sing before to test Scoobi 0.5 is a fairly small cluster with a long-term p=
ersistent HDFS file system, as well as a single job tracker, a single name =
node, and 16 task nodes. I only have ssh access to the job tracker, a=
nd AFAIK the other machines are firewalled from the Internet and do not hav=
e access to my home directory on the job tracker -- i.e. the only shared fi=
le system is HDFS. I compile and launch the application from the job =
tracker. The version of the 'hadoop' client executable is Cloudera&nb=
sp;0.20.2 cdh3u3; I'm not sure what the version of HDFS or the servers is b=
ut I would guess the same or similar.</div><div><br></div><div>I don't unde=
rstand what Alex's issue is, but I have to ask -- why did this work before,=
and why doesn't it work now? I thought the whole point of building a=
n assembly/combined jar was precisely to include *all* the necessary librar=
ies in it. Why does Scoobi 0.5 screw around with trying to upload lib=
raries itself rather than relying on what's in the assembly, as Scoobi 0.4 =
did?</div><div><br></div><div>Now, the other configuration is completely di=
fferent. This system has an enormous number (in the hundreds) of 8-co=
re compute servers, managed by a Sun Grid Engine, where you submit jobs wit=
h qsub. 48 of these are set aside for Hadoop usage, but they don't fo=
rm a normal Hadoop cluster. Instead, all they really have is an extra=
local 2 TB disk installed on /hadoop. Rather, what I do is ssh to a =
login node and then request some subset of compute servers (e.g. 8 nodes) f=
or some amount of time (you get exclusive use of the servers you request, b=
ut for a maximum of only 24 hours!!) using an appropriate qsub script. &nbs=
p;This is just a shell script with some extra directives in it telling qsub=
how many machines you're asing for, which type, for how long, etc., which =
gets run as soon as your requested resources are available. It procee=
ds to set up one of the nodes as a combined job tracker/name node and all t=
he rest as task nodes, and format a new HDFS using all the disks in /hadoop=
, storing the configuration info in a subdirectory of my home directory. Th=
en, I ssh into the job tracker, copy my data into HDFS, and run my Hadoop t=
asks -- for a maximum of 24 hours, which is all you get at a time. In=
this setup, my home directory as well as a series of ginormous 1000+ TB Lu=
stre partitions are all available on all of the compute servers (as well as=
the login server), and I can freely ssh into all the compute servers that =
I've requested and have been given control over, and all of them can connec=
t directly to the Internet.</div><div><br></div><div>This second system use=
s an installation of Hadoop Cloudera 0.20.2 cdh3u2 sitting in my home dir. =
The same version is used both for starting HDFS and the various serve=
rs and the client 'hadoop' executable. It took a good deal of dicking=
around with the configuration and qsub script and such to get it working, =
so I'd rather not touch it, although conceivably I could update to a newer =
version, since (as mentioned above) I start a new HDFS each time.</div><div=
><br></div><div>After Alex's and your comment, I tried getting things worki=
ng on this system, since my home dir is mounted on all the machines, so wha=
tever issue there is with accessing the .ivy2 and .m2 dirs, it shouldn't ex=
ist here.</div><div><br></div><div>Unfortunately:</div><div><br></div><div>=
(1) I get a warning on some of my code when compiling:</div><div><br></div>=
<div><div>[warn] /home/01683/benwing/devel/poligrounder/src/main/scala/open=
nlp/textgrounder/util/hadoop.scala:69: method isDir in class FileStatus is =
deprecated: see corresponding Javadoc for more information.</div><div>[warn=
] get_file_system(filename).getFileStatus(new Path(fil=
ename)).isDir</div><div>[warn] &n=
bsp; =
 =
; ^</div><div>[warn] one warning found</=
div></div><div><br></div><div>This is not problematic except that it indica=
tes that everything is being compiled against Hadoop 0.21 or later, which d=
oesn't sound good, and indeed:</div><div><br></div><div>(2) I immediately g=
et an error when running, apparently due to an incompatibility between 0.20=
and 0.21:</div><div><br></div><div><div>Exception in thread "main" java.io=
.IOException: Input path input-naleo-jun-21 does not exist.</div><div><span=
class=3D"Apple-tab-span" style=3D"white-space:pre">=09</span>at com.nicta.=
scoobi.io.text.TextInput$TextSource$$anonfun$inputCheck$1.apply(TextInput.s=
cala:140)</div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre=
">=09</span>at com.nicta.scoobi.io.text.TextInput$TextSource$$anonfun$input=
Check$1.apply(TextInput.scala:136)</div><div><span class=3D"Apple-tab-span"=
style=3D"white-space:pre">=09</span>at scala.collection.LinearSeqOptimized=
$class.foreach(LinearSeqOptimized.scala:59)</div><div><span class=3D"Apple-=
tab-span" style=3D"white-space:pre">=09</span>at scala.collection.immutable=
.List.foreach(List.scala:76)</div><div><span class=3D"Apple-tab-span" style=
=3D"white-space:pre">=09</span>at com.nicta.scoobi.io.text.TextInput$TextSo=
urce.inputCheck(TextInput.scala:136)</div><div><span class=3D"Apple-tab-spa=
n" style=3D"white-space:pre">=09</span>at com.nicta.scoobi.impl.exec.Execut=
or$$anonfun$prepare$4.apply(Executor.scala:79)</div><div><span class=3D"App=
le-tab-span" style=3D"white-space:pre">=09</span>at com.nicta.scoobi.impl.e=
xec.Executor$$anonfun$prepare$4.apply(Executor.scala:79)</div><div><span cl=
ass=3D"Apple-tab-span" style=3D"white-space:pre">=09</span>at scala.collect=
ion.immutable.Set$Set1.foreach(Set.scala:86)</div><div><span class=3D"Apple=
-tab-span" style=3D"white-space:pre">=09</span>at com.nicta.scoobi.impl.exe=
c.Executor$.prepare(Executor.scala:79)</div><div><span class=3D"Apple-tab-s=
pan" style=3D"white-space:pre">=09</span>at com.nicta.scoobi.application.Pe=
rsister$.com$nicta$scoobi$application$Persister$$createPlan(Persister.scala=
:259)</div></div><div> ...</div><div><br></div><=
div><br></div><div>So this leads to some more questions:</div><div><br></di=
v><div>Am I going to have to upgrade to Hadoop 0.21 or later just to run Sc=
oobi 0.5? Besides all the hassle involved, this seems like a bad idea=
, because the Hadoop 0.21 series, whose latest release is Hadoop 2.0.0-alph=
a, is ... well, alpha software. In general, why is Scoobi track=
ing the bleeding edge like this? I understand that eventually we will=
need to upgrade, but it seems preliminary in this case, particularly since=
there appear to be significant backward-compatibility issues, and Hadoop 2=
still is far from being released in stable form. </div><div><br></di=
v><div><br></div><div>Overall, I still don't understand the whole story beh=
ind Hadoop configuration and such, but I wonder, why was it necessary to sw=
itch away from just building and running a big assembly, and why was it nec=
essary to move to Hadoop 0.21? </div><div><br></div><div>Thanks,</div=
><div><br></div><div>ben</div>
------=_Part_215_17957847.1343948972662--
------=_Part_214_27434367.1343948972662--