Issue with running HelloWorld

Eric Lange

unread,

Aug 20, 2012, 6:45:13 PM8/20/12

to common...@googlegroups.com

Hi,

I am trying to get HelloWorld up and running and am beating my head against the wall. It works fine when I submit the job using Amazon EMR, however, I simply cannot get it to work locally. I am basing my custom job off of HelloWorld, and it is a nightmare to have to debug it running on EC2, as it takes several minutes to upload the JAR to S3 and then wait for the process to spin up. I would very much like to run it locally, as the README file suggests you should be able to.

The problem seems to be an intricate ballet between the commoncrawl, jets3t and hadoop jars. If I try to use the default jets3t jar in Hadoop 1.0.3 (0.6.1), it fails thusly:

Exception in thread "main" java.lang.VerifyError: (class: org/commoncrawl/hadoop/io/JetS3tARCSource, method: configureImpl signature: (Lorg/apache/hadoop/mapred/JobConf;)V) Incompatible argument to function

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:247)

at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891)

at org.commoncrawl.hadoop.io.ARCInputFormat.configure(ARCInputFormat.java:159)

...

If I instead kill that file and use version 0.8.1 (the version included as part of HelloWorld), I get much further, but still fail like this:

Exception in thread "main" java.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V

at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:54)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)

at org.apache.hadoop.fs.s3native.$Proxy1.initialize(Unknown Source)

at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)

at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:110)

at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)

at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:396)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)

at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)

at org.commoncrawl.tutorial.HelloWorld.main(HelloWorld.java:134)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

In the first case (using 0.6.1), it seems that Hadoop is happy with the jets3t version, but commoncrawl is not.

In the second case (using 0.8.1), commoncrawl is happy with the jets3t version, but hadoop is not

Am I missing something? Can anyone get this example to run locally?

FYI, I am running Hadoop 1.0.3 on OSX Lion. Also tried on Ubuntu 12.04 but am seeing similar problems.

I also tried all versions in between 0.6.1 and 0.9.0 of jets3t.

Any help is greatly appreciated.

Thanks,

Eric

Ahad Rana

unread,

Aug 20, 2012, 7:27:27 PM8/20/12

to common...@googlegroups.com

Hi Eric,

Sorry to see that you are having these problems. Unfortunately, the jet version we use needs to track the version used in Hadoop, primarily because the Hadoop library path gets classpath precedence in the Local job scenario. We are going to try and fix this ASAP.

Ahad.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/C8xuvyxzen8J.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.

Eric Lange

unread,

Aug 20, 2012, 7:33:49 PM8/20/12

to common...@googlegroups.com

Thanks for the prompt response, Ahad. I am glad to see that this will be fixed soon. One question though: why does this work on EMR? It seems like it should fail there too. Is Amazon using a version of Hadoop that tracks to jet 0.8.1?

Chris Stephens

unread,

Aug 20, 2012, 8:41:17 PM8/20/12

to common...@googlegroups.com

Hi Eric,

Have you taken a look at the Common Crawl Examples project:

http://github.com/commoncrawl/commoncrawl-examples

or the Common Crawl Quick Start AMI:

http://commoncrawl.org/get-started

The Common Crawl Examples project uses a new ARC reader that doesn't directly depend on S3.

- Chris

Sent from my iPhone

Eric Lange

unread,

Aug 21, 2012, 12:24:14 AM8/21/12

to common...@googlegroups.com

Hi Chris,

Thanks for the reply. I have looked at the examples project, but I was pretty excited about being able to run it on my local Hadoop cluster for quick prototyping and debugging before having to put it up on EC2. Is the only option to run it from the AMI? It's fine if that's the case, but I wanted to know if there was an option to run it locally. Can I copy a handful of archive files to HDFS locally and run the examples? Or does it depend on the EC2 setup?

Thanks,

Eric

Eric Lange

unread,

Aug 21, 2012, 12:26:23 AM8/21/12

to common...@googlegroups.com

Oh, I responded too soon. I read the readme more carefully. I will try the local option out. Thanks again.

Chris Stephens

unread,

Aug 21, 2012, 1:18:30 AM8/21/12

to common...@googlegroups.com

I have run the examples locally on Hadoop 1.0.3. I think the build script assumes Hadoop jars are under "/usr/share/hadoop", but I think that is the only environment-specific dependency. I've built and run mostly using OpenJDK 1.6.

If you have any trouble building and running in your environment, please let us know! This project is not supposed to be environment dependent.

- Chris

Chris Stephens

unread,

Aug 21, 2012, 1:28:33 AM8/21/12

to common...@googlegroups.com

One more thing: the "bin/ccRunExample" script was written in Bash shell. The variable at the top of the specifying the local HDFS may need to be modified to match your environment.

- Chris

Sent from my iPhone

Eric Lange

unread,

Aug 21, 2012, 5:47:08 PM8/21/12

to common...@googlegroups.com

Sheer awesomeness, thanks! The examples are running on my mac after tweaking configs a bit.

I'm sure you're probably aware of this, but I installed Cloudera CH4 on my Linux machine and it does not work there. It is using Hadoop 2.0.0, and the build libraries are in a different place and have different names. I was able to sort-of get it working by adding the last line to build.xml:

<include name="hadoop-*.jar"/>

</fileset>

However, it is still failing with a number of exceptions like this:

12/08/21 14:44:54 ERROR examples.ExampleMetadataDomainPageCount: Caught Exception

java.net.URISyntaxException: Illegal character in path at index 92: http://computers.pricegrabber.com/laptop/ethernet-200:230/p/13/form_keyword=Ethernet/popup61[]=200:230/retid[]=50

at java.net.URI$Parser.fail(URI.java:2810)

at java.net.URI$Parser.checkChars(URI.java:2983)

at java.net.URI$Parser.parseHierarchical(URI.java:3067)

at java.net.URI$Parser.parse(URI.java:3015)

at java.net.URI.<init>(URI.java:577)

at org.commoncrawl.examples.ExampleMetadataDomainPageCount$ExampleMetadataDomainPageCountMapper.map(ExampleMetadataDomainPageCount.java:83)

at org.commoncrawl.examples.ExampleMetadataDomainPageCount$ExampleMetadataDomainPageCountMapper.map(ExampleMetadataDomainPageCount.java:65)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:263)

But this seems to work just fine on my Mac running Hadoop 1.0.3. In any event, I will stick with 1.0.3 for the time being.

Thanks again!

-Eric

Chris Stephens

unread,

Aug 24, 2012, 2:04:26 PM8/24/12

to common...@googlegroups.com

Hi Eric,

Thank you for the update on how the examples work under CDH4.

The Exception below looks like it is thrown from the standard Java runtime libraries (the "java.net.URI" class), not the Hadoop libraries. Maybe this is a Java 7 vs. Java 6 issue?

We know lots of people will be using CDH4, so we'll try to get a CDH4 test environment set up. (We use Hadoop 0.20.205 right now because this is used by Amazon Elastic MapReduce.)

- Chris

Reply all

Reply to author

Forward