Trying to read arc.gz files using CommonCrawl Support Library

30 views
Skip to first unread message

Laurier Rochon

unread,
Apr 18, 2014, 12:20:25 AM4/18/14
to common...@googlegroups.com
Hi there,
I'm pretty comfortable with AWS, but not so much with Java, jars and hadoop. I've been trying to get the simple example here https://github.com/commoncrawl/commoncrawl to work (by feeding an InputStream to the ARCFileReader), but I get the following error message:

$ ./bin/launcher.sh org.commoncrawl.util.shared.ARCFileReader --awsAccessKey xxxxxxxxxxxxxxx --awsSecret xxxxxxxxxxxxx --file s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690164240/1341819847375_4319.arc.gz
CCAPP_HOME:/home/ec2-user/commoncrawl/bin/..
CCAPP_CONF_DIR:/home/ec2-user/commoncrawl/bin/../conf
CCAPP_LOG_DIR:/home/ec2-user/commoncrawl/bin/../logs
Please build commoncrawl jar
JAVA_HOME:/usr/lib/jvm/java
Unable to locate hadoop core jar file. Please check your hadoop installation.


This is directly after I've cloned the git repo and ran "ant" from the root. I've setup the path for hadoop in the build.properties file (/usr/bin/hadoop), but not sure if I missed something else. This is running on a small instance of the commoncrawl AMI.

thanks

Reply all
Reply to author
Forward
0 new messages