how i can download commoncrawl-crawl-002 (60T)

1,033 views
Skip to first unread message

Bazhanyuk Alex

unread,
Mar 14, 2012, 7:09:15 PM3/14/12
to Common Crawl
Dear all,

I have some issue in downloading commoncrawl-crawl-002!

I tried using:
https://github.com/commoncrawl/commoncrawl
https://github.com/matpalm/common-crawl/
but i didn't understand how i get download commoncrawl-crawl-002 (60T)
I have EC2 access. I tried using https://github.com/commoncrawl/commoncrawl
https://github.com/matpalm/common-crawl/ in my EC2 instance (ubuntu).

Could you explain me how i can get (download) that files store?
Any advice....

Thanks a lot.

P.S. I need only file with special format file. i want download only
file with these format file. There are: doc,docx, ppt, pptx, xls,
xlsx, svg, swf, pdf, js, xps, rtf, dot, dotx, xml, wps, xlt, xltx,
csv, dif, slk, slk, ods, odt, potx, ppsx, plt, pltx, pps, pot, wmv,
gif, jpeg, bmp, tiff, dib, emf, wmf, odp, one, mht, pub, grv, xsn,
vsd, vss, vst, vsw, vdx, vsx, vtx, vsl, xfdf, fdf, 7z, ace, arj, cab,
gz, lzh, rar, tgz, zip, ecc, sfx.

P.S.S in https://github.com/commoncrawl/commoncrawl i have issue:

$ ./bin/launcher.sh org.commoncrawl.samples.BasicArcFileReaderSample
commoncrawl-crawl-002 2010/01/07/18/1262876244253_18.arc.gz
CCAPP_HOME:/home/ubuntu/commoncrawl/bin/..
CCAPP_CONF_DIR:/home/ubuntu/commoncrawl/bin/../conf
CCAPP_LOG_DIR:/home/ubuntu/commoncrawl/bin/../logs
CCAPP_JAR:commoncrawl-0.1.jar
CCAPP_JAR_PATH:/home/ubuntu/commoncrawl/bin/../build
JAVA_HOME:/usr/lib/jvm/java-6-openjdk
HADOOP JAR IS:/usr/share/hadoop/hadoop-core-1.0.1.jar
HADOOP_JAR:/usr/share/hadoop/hadoop-core-1.0.1.jar
HADOOP_CONF_DIR:/usr/share/hadoop/conf

CLASSPATH:/home/ubuntu/commoncrawl/bin/../conf:/usr/share/hadoop/conf:/
home/ubuntu/commoncrawl/bin/../webapps:/home/ubuntu/commoncrawl/bin/../
tests:/usr/lib/jvm/java-6-openjdk/lib/tools.jar:/home/ubuntu/
commoncrawl/bin/../build/commoncrawl-0.1.jar:/home/ubuntu/commoncrawl/
bin/../lib/chardet.jar:/home/ubuntu/commoncrawl/bin/../lib/
dnsjava-2.0.3.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
activation-1.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
ant-1.6.5.jar:/home/ubuntu/commoncrawl/bin/../build/lib/ant-1.7.0.jar:/
home/ubuntu/commoncrawl/bin/../build/lib/ant-launcher-1.7.0.jar:/home/
ubuntu/commoncrawl/bin/../build/lib/avalon-framework-4.1.3.jar:/home/
ubuntu/commoncrawl/bin/../build/lib/aws-java-sdk-1.2.12.jar:/home/
ubuntu/commoncrawl/bin/../build/lib/commons-cli-1.2.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-codec-1.3.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-httpclient-3.1.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-io-1.4.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-lang-2.5.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-logging-1.1.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/commons-logging-api-1.1.jar:/home/ubuntu/
commoncrawl/bin/../build/lib/core-3.1.1.jar:/home/ubuntu/commoncrawl/
bin/../build/lib/dom4j-1.6.1.jar:/home/ubuntu/commoncrawl/bin/../build/
lib/gson-1.7.2.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
guava-10.0.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/hamcrest-
core-1.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/httpclient-4.2-
beta1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
httpcore-4.0.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/jackson-
core-asl-1.9.5.jar:/home/ubuntu/commoncrawl/bin/../build/lib/java-
xmlbuilder-0.4.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
jets3t-0.8.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
jetty-6.1.26.jar:/home/ubuntu/commoncrawl/bin/../build/lib/jetty-
util-6.1.26.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
jsp-2.1-6.1.14.jar:/home/ubuntu/commoncrawl/bin/../build/lib/jsp-
api-2.1-6.1.14.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
jsr305-1.3.9.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
junit-4.10.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
libthrift-0.7.0.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
log4j-1.2.14.jar:/home/ubuntu/commoncrawl/bin/../build/lib/
logkit-1.0.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/mail-1.4.5-
rc1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/servlet-
api-2.5-20081211.jar:/home/ubuntu/commoncrawl/bin/../build/lib/servlet-
api-2.5-6.1.14.jar:/home/ubuntu/commoncrawl/bin/../build/lib/servlet-
api-2.5.jar:/home/ubuntu/commoncrawl/bin/../build/lib/slf4j-
api-1.5.8.jar:/home/ubuntu/commoncrawl/bin/../build/lib/stax-
api-1.0.1.jar:/home/ubuntu/commoncrawl/bin/../build/lib/xml-
apis-1.0.b2.jar:/usr/share/hadoop/hadoop-core-1.0.1.jar:/usr/share/
hadoop/lib/asm-3.2.jar:/usr/share/hadoop/lib/aspectjrt-1.6.5.jar:/usr/
share/hadoop/lib/aspectjtools-1.6.5.jar:/usr/share/hadoop/lib/commons-
beanutils-1.7.0.jar:/usr/share/hadoop/lib/commons-beanutils-
core-1.8.0.jar:/usr/share/hadoop/lib/commons-cli-1.2.jar:/usr/share/
hadoop/lib/commons-codec-1.4.jar:/usr/share/hadoop/lib/commons-
collections-3.2.1.jar:/usr/share/hadoop/lib/commons-
configuration-1.6.jar:/usr/share/hadoop/lib/commons-daemon-1.0.1.jar:/
usr/share/hadoop/lib/commons-digester-1.8.jar:/usr/share/hadoop/lib/
commons-el-1.0.jar:/usr/share/hadoop/lib/commons-httpclient-3.0.1.jar:/
usr/share/hadoop/lib/commons-lang-2.4.jar:/usr/share/hadoop/lib/
commons-logging-1.1.1.jar:/usr/share/hadoop/lib/commons-logging-
api-1.0.4.jar:/usr/share/hadoop/lib/commons-math-2.1.jar:/usr/share/
hadoop/lib/commons-net-1.4.1.jar:/usr/share/hadoop/lib/core-3.1.1.jar:/
usr/share/hadoop/lib/hadoop-capacity-scheduler-1.0.1.jar:/usr/share/
hadoop/lib/hadoop-fairscheduler-1.0.1.jar:/usr/share/hadoop/lib/hadoop-
thriftfs-1.0.1.jar:/usr/share/hadoop/lib/hsqldb-1.8.0.10.jar:/usr/
share/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/share/hadoop/lib/
jackson-mapper-asl-1.8.8.jar:/usr/share/hadoop/lib/jasper-
compiler-5.5.12.jar:/usr/share/hadoop/lib/jasper-runtime-5.5.12.jar:/
usr/share/hadoop/lib/jdeb-0.8.jar:/usr/share/hadoop/lib/jersey-
core-1.8.jar:/usr/share/hadoop/lib/jersey-json-1.8.jar:/usr/share/
hadoop/lib/jersey-server-1.8.jar:/usr/share/hadoop/lib/
jets3t-0.6.1.jar:/usr/share/hadoop/lib/jetty-6.1.26.jar:/usr/share/
hadoop/lib/jetty-util-6.1.26.jar:/usr/share/hadoop/lib/
jsch-0.1.42.jar:/usr/share/hadoop/lib/junit-4.5.jar:/usr/share/hadoop/
lib/kfs-0.2.2.jar:/usr/share/hadoop/lib/log4j-1.2.15.jar:/usr/share/
hadoop/lib/mockito-all-1.8.5.jar:/usr/share/hadoop/lib/oro-2.0.8.jar:/
usr/share/hadoop/lib/servlet-api-2.5-20081211.jar:/usr/share/hadoop/
lib/slf4j-api-1.4.3.jar:/usr/share/hadoop/lib/slf4j-log4j12-1.4.3.jar:/
usr/share/hadoop/lib/xmlenc-0.52.jar:/usr/share/hadoop/lib/jetty-ext/
*.jar

CCAPP_CLASS_NAME:org.commoncrawl.samples.BasicArcFileReaderSample
CCAPP_NAME:BasicArcFileReaderSample
Platform Name is:Linux-i386-32
LD_LIBRARY_PATH: /home/ubuntu/commoncrawl/bin/../lib/native/Linux-
i386-32:/usr/share/hadoop/lib/native/Linux-i386-32:/home/ubuntu/
bitblaze/temu/trunk/tracecap:/home/ubuntu/bitblaze/temu/trunk/shared/
hooks/hooks_plugins:/usr/lib:/usr/local/lib:/usr/lib/jvm/java-1.5.0-
gcj-4.4/lib/
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.

P.S.S.S. In https://github.com/matpalm/common-crawl/
I don't know cc.jar compiled.

Thanks,
Alex

Ahad Rana

unread,
Mar 14, 2012, 7:15:16 PM3/14/12
to common...@googlegroups.com
Hi Alex,

The location for the data has moved to s3://aws-publicdatasets/common-crawl/crawl-002/. 

Ahad.



--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.


Alex Bazhanyuk

unread,
Mar 14, 2012, 7:18:13 PM3/14/12
to common...@googlegroups.com
Thanks,
But what command i need using, if i want download, for example: 2009/09/17/0/1253229598569_0.arc.gz
How i can download file from:  s3://aws-publicdatasets/common-crawl/crawl-002/

Could you give me command which i can using?

Thanks a lot.

I tried using: hadoop distcp -pr s3n://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/9/1285409485343_9.arc.gz ./
But i have:

2/03/15 01:17:27 INFO tools.DistCp: srcPaths=[s3n://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/9/1285409485343_9.arc.gz]
12/03/15 01:17:27 INFO tools.DistCp: destPath=
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
    at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at org.apache.hadoop.fs.s3native.$Proxy1.initialize(Unknown Source)
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
    at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:635)
    at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656)
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)



Thanks,
Alex

Ahad Rana

unread,
Mar 14, 2012, 7:38:08 PM3/14/12
to common...@googlegroups.com
To use s3n you have to set fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey in your Configuration object. Be aware that the s3n that ships with Hadoop is pretty broken, and pretty much the only version that really works is the one that ships with Amazon's EMR hadoop distribution. 

Since the bucket is not requester-pays anymore, you can now use a tool like s3cmd to retrieve the data: 
s3cmd get  s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/9/1285409485343_9.arc.gz ./

Ahad.

qqz

unread,
Mar 15, 2012, 6:21:38 AM3/15/12
to common...@googlegroups.com
Hi Ahad,

I was wondering if you know what the problem is. I downloaded s3cmd and did the --configure which setups my keys on an EC2 instance. But when I run s3cmd get s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/9/1285409485343_9.arc.gz ./ it returns "file not found" however, if I run s3cmd ls on the exact same file it returns "Bucket 'aws-publicdatasets':
2012-01-05 19:41  10033248   s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/9/1285409485343_9.arc.gz" Any ideas?

thanks
-Qing

Ahad.

To unsubscribe from this group, send email to common-crawl+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl+unsubscribe@googlegroups.com.

Ahad Rana

unread,
Mar 15, 2012, 7:49:25 AM3/15/12
to common...@googlegroups.com
Hi Qing,

I ran the same command on my machine and it worked. Try using curl or wget to retrieve it via http ( http://s3.amazonaws.com/aws-publicdatasets/common-crawl/crawl-002/2010/09/25/9/1285409485343_9.arc.gz ). s3cmd can be flaky at times and does not always return good error messages.

Ahad.

To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/cfWun0ESsRUJ.

To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.

Nada Amin

unread,
Mar 15, 2012, 9:40:50 AM3/15/12
to common...@googlegroups.com
It's easy to get direct access to the data with this perl script:
perl aws get aws-publicdatasets/common-crawl/crawl-002/2010/09/25/18/1285398033394_18.arc.gz >1285398033394_18.arc.gz

Cheers,
~n

qqz

unread,
Mar 27, 2012, 3:55:14 PM3/27/12
to common...@googlegroups.com
Thanks Ahad and Nada that worked!!  Do you by any chance know if the crawled data contains dynamically generated content such as javascript or embedded images...etc?

thanks!
-Qing
Reply all
Reply to author
Forward
0 new messages