Trying to read arc.gz files using CommonCrawl Support Library

30 views

Skip to first unread message

Laurier Rochon

unread,

Apr 18, 2014, 12:20:25 AM4/18/14

to common...@googlegroups.com

Hi there,
I'm pretty comfortable with AWS, but not so much with Java, jars and hadoop. I've been trying to get the simple example here https://github.com/commoncrawl/commoncrawl to work (by feeding an InputStream to the ARCFileReader), but I get the following error message:

$ ./bin/launcher.sh org.commoncrawl.util.shared.ARCFileReader --awsAccessKey xxxxxxxxxxxxxxx --awsSecret xxxxxxxxxxxxx --file s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690164240/1341819847375_4319.arc.gz

CCAPP_HOME:/home/ec2-user/commoncrawl/bin/..

CCAPP_CONF_DIR:/home/ec2-user/commoncrawl/bin/../conf

CCAPP_LOG_DIR:/home/ec2-user/commoncrawl/bin/../logs

Please build commoncrawl jar

JAVA_HOME:/usr/lib/jvm/java

Unable to locate hadoop core jar file. Please check your hadoop installation.

This is directly after I've cloned the git repo and ran "ant" from the root. I've setup the path for hadoop in the build.properties file (/usr/bin/hadoop), but not sure if I missed something else. This is running on a small instance of the commoncrawl AMI.