Dear all,
I have some issue in downloading commoncrawl-crawl-002!
I tried using:
https://github.com/commoncrawl/commoncrawl
https://github.com/matpalm/common-crawl/
but i didn't understand how i get download commoncrawl-crawl-002 (60T)
I have EC2 access. I tried using
https://github.com/commoncrawl/commoncrawl
https://github.com/matpalm/common-crawl/ in my EC2 instance (ubuntu).
Could you explain me how i can get (download) that files store?
Any advice....
Thanks a lot.
P.S. I need only file with special format file. i want download only
file with these format file. There are: doc,docx, ppt, pptx, xls,
xlsx, svg, swf, pdf, js, xps, rtf, dot, dotx, xml, wps, xlt, xltx,
csv, dif, slk, slk, ods, odt, potx, ppsx, plt, pltx, pps, pot, wmv,
gif, jpeg, bmp, tiff, dib, emf, wmf, odp, one, mht, pub, grv, xsn,
vsd, vss, vst, vsw, vdx, vsx, vtx, vsl, xfdf, fdf, 7z, ace, arj, cab,
gz, lzh, rar, tgz, zip, ecc, sfx.
P.S.S in
https://github.com/commoncrawl/commoncrawl i have issue:
$ ./bin/launcher.sh org.commoncrawl.samples.BasicArcFileReaderSample
commoncrawl-crawl-002 2010/01/07/18/1262876244253_18.arc.gz
CCAPP_HOME:/home/ubuntu/commoncrawl/bin/..
CCAPP_CONF_DIR:/home/ubuntu/commoncrawl/bin/../conf
CCAPP_LOG_DIR:/home/ubuntu/commoncrawl/bin/../logs
CCAPP_JAR:commoncrawl-0.1.jar
CCAPP_JAR_PATH:/home/ubuntu/commoncrawl/bin/../build
JAVA_HOME:/usr/lib/jvm/java-6-openjdk
HADOOP JAR IS:/usr/share/hadoop/hadoop-core-1.0.1.jar
HADOOP_JAR:/usr/share/hadoop/hadoop-core-1.0.1.jar
HADOOP_CONF_DIR:/usr/share/hadoop/conf
CLASSPATH:/home/ubuntu/commoncrawl/bin/../conf:/usr/share/hadoop/conf:/
home/ubuntu/commoncrawl/bin/../webapps:/home/ubuntu/commoncrawl/bin/../
tests:/usr/lib/jvm/java-6-openjdk/lib/tools.jar:/home/ubuntu/
commoncrawl/bin/../build/commoncrawl-0.1.jar:/home/ubuntu/commoncrawl/
bin/../lib/chardet.jar:/home/ubuntu/commoncrawl/bin/../lib/
dnsjava-2.0.3.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
activation-1.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
ant-1.6.5.jar:/home/ubuntu/commoncrawl/bin/../build/lib/ant-1.7.0.jar:/
home/ubuntu/commoncrawl/bin/../build/lib/ant-launcher-1.7.0.jar:/home/
ubuntu/commoncrawl/bin/../build/lib/avalon-framework-4.1.3.jar:/home/
ubuntu/commoncrawl/bin/../build/lib/aws-java-sdk-1.2.12.jar:/home/
ubuntu/commoncrawl/bin/../build/lib/commons-cli-1.2.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-codec-1.3.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-httpclient-3.1.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-io-1.4.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-lang-2.5.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-logging-1.1.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-logging-api-1.1.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/core-3.1.1.jar:/home/ubuntu/commoncrawl/
bin/../build/lib/dom4j-1.6.1.jar:/home/ubuntu/commoncrawl/bin/../build/
lib/gson-1.7.2.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
guava-10.0.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/hamcrest-
core-1.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/httpclient-4.2-
beta1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
httpcore-4.0.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/jackson-
core-asl-1.9.5.jar:/home/ubuntu/commoncrawl/bin/../build/lib/java-
xmlbuilder-0.4.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
jets3t-0.8.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
jetty-6.1.26.jar:/home/ubuntu/commoncrawl/bin/../build/lib/jetty-
util-6.1.26.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
jsp-2.1-6.1.14.jar:/home/ubuntu/commoncrawl/bin/../build/lib/jsp-
api-2.1-6.1.14.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
jsr305-1.3.9.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
junit-4.10.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
libthrift-0.7.0.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
log4j-1.2.14.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
logkit-1.0.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/mail-1.4.5-
rc1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/servlet-
api-2.5-20081211.jar:/home/ubuntu/commoncrawl/bin/../build/lib/servlet-
api-2.5-6.1.14.jar:/home/ubuntu/commoncrawl/bin/../build/lib/servlet-
api-2.5.jar:/home/ubuntu/commoncrawl/bin/../build/lib/slf4j-
api-1.5.8.jar:/home/ubuntu/commoncrawl/bin/../build/lib/stax-
api-1.0.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/xml-
apis-1.0.b2.jar:/usr/share/hadoop/hadoop-core-1.0.1.jar:/usr/share/
hadoop/lib/asm-3.2.jar:/usr/share/hadoop/lib/aspectjrt-1.6.5.jar:/usr/
share/hadoop/lib/aspectjtools-1.6.5.jar:/usr/share/hadoop/lib/commons-
beanutils-1.7.0.jar:/usr/share/hadoop/lib/commons-beanutils-
core-1.8.0.jar:/usr/share/hadoop/lib/commons-cli-1.2.jar:/usr/share/
hadoop/lib/commons-codec-1.4.jar:/usr/share/hadoop/lib/commons-
collections-3.2.1.jar:/usr/share/hadoop/lib/commons-
configuration-1.6.jar:/usr/share/hadoop/lib/commons-daemon-1.0.1.jar:/
usr/share/hadoop/lib/commons-digester-1.8.jar:/usr/share/hadoop/lib/
commons-el-1.0.jar:/usr/share/hadoop/lib/commons-httpclient-3.0.1.jar:/
usr/share/hadoop/lib/commons-lang-2.4.jar:/usr/share/hadoop/lib/
commons-logging-1.1.1.jar:/usr/share/hadoop/lib/commons-logging-
api-1.0.4.jar:/usr/share/hadoop/lib/commons-math-2.1.jar:/usr/share/
hadoop/lib/commons-net-1.4.1.jar:/usr/share/hadoop/lib/core-3.1.1.jar:/
usr/share/hadoop/lib/hadoop-capacity-scheduler-1.0.1.jar:/usr/share/
hadoop/lib/hadoop-fairscheduler-1.0.1.jar:/usr/share/hadoop/lib/hadoop-
thriftfs-1.0.1.jar:/usr/share/hadoop/lib/hsqldb-1.8.0.10.jar:/usr/
share/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/share/hadoop/lib/
jackson-mapper-asl-1.8.8.jar:/usr/share/hadoop/lib/jasper-
compiler-5.5.12.jar:/usr/share/hadoop/lib/jasper-runtime-5.5.12.jar:/
usr/share/hadoop/lib/jdeb-0.8.jar:/usr/share/hadoop/lib/jersey-
core-1.8.jar:/usr/share/hadoop/lib/jersey-json-1.8.jar:/usr/share/
hadoop/lib/jersey-server-1.8.jar:/usr/share/hadoop/lib/
jets3t-0.6.1.jar:/usr/share/hadoop/lib/jetty-6.1.26.jar:/usr/share/
hadoop/lib/jetty-util-6.1.26.jar:/usr/share/hadoop/lib/
jsch-0.1.42.jar:/usr/share/hadoop/lib/junit-4.5.jar:/usr/share/hadoop/
lib/kfs-0.2.2.jar:/usr/share/hadoop/lib/log4j-1.2.15.jar:/usr/share/
hadoop/lib/mockito-all-1.8.5.jar:/usr/share/hadoop/lib/oro-2.0.8.jar:/
usr/share/hadoop/lib/servlet-api-2.5-20081211.jar:/usr/share/hadoop/
lib/slf4j-api-1.4.3.jar:/usr/share/hadoop/lib/slf4j-log4j12-1.4.3.jar:/
usr/share/hadoop/lib/xmlenc-0.52.jar:/usr/share/hadoop/lib/jetty-ext/
*.jar
CCAPP_CLASS_NAME:org.commoncrawl.samples.BasicArcFileReaderSample
CCAPP_NAME:BasicArcFileReaderSample
Platform Name is:Linux-i386-32
LD_LIBRARY_PATH: /home/ubuntu/commoncrawl/bin/../lib/native/Linux-
i386-32:/usr/share/hadoop/lib/native/Linux-i386-32:/home/ubuntu/
bitblaze/temu/trunk/tracecap:/home/ubuntu/bitblaze/temu/trunk/shared/
hooks/hooks_plugins:/usr/lib:/usr/local/lib:/usr/lib/jvm/java-1.5.0-
gcj-4.4/lib/
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.
P.S.S.S. In
https://github.com/matpalm/common-crawl/
I don't know cc.jar compiled.
Thanks,
Alex