Hi,I am working on using behemoth for extracting warc files. I am following the tutorial and i have done the map reduce job but i have problem reading the behemoth document.I am pasting here the commands that i have tried ( I have actually tried extracting both warc and warc.gz format )/opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/io/target/behemoth-io-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob hdfs://LucidN1:50001/input/virginiaEarthquake/ARCHIVEIT-2821-WEEKLY-WYYNUE-20110906022146-00007-crawling205.us.archive.org-6680.warc.gz hdfs://LucidN1:50001/output_gz
/opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/io/target/behemoth-io-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob hdfs://LucidN1:50001/input/virginiaEarthquake_warc/ARCHIVEIT-2821-WEEKLY-WYYNUE-20110906022146-00007-crawling205.us.archive.org-6680.warc hdfs://LucidN1:50001/output/
Now the job is complete with these commands and i am using below command to list the files or read the behemoth document. ( I am not sure if its the right way)/opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/core/target/behemoth-core-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i hdfs://LucidN1:50001/output_gz/part-*Can you please point me to what is the right way to read the behemoth document ?Thanks,Kiran.--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To view this discussion on the web, visit https://groups.google.com/d/msg/digitalpebble/-/-CpfDdIWjGoJ.
To post to this group, send an email to digita...@googlegroups.com.
To unsubscribe from this group, send email to digitalpebbl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
Exception in thread "main" java.lang.NullPointerException
at com.digitalpebble.behemoth.util.CorpusReader.run(CorpusReader.java:102)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.digitalpebble.behemoth.util.CorpusReader.main(CorpusReader.java:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
hadoop@LucidN1:/opt/behemoth$ /opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/io/target/behemoth-io-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob hdfs://LucidN1:50001/input/virginiaEarthquake/ARCHIVEIT-2821-WEEKLY-WYYNUE-20110906022146-00007-crawling205.us.archive.org-6680.warc.gz hdfs://LucidN1:50001/output_gz1
Warning: $HADOOP_HOME is deprecated.
13/01/21 16:33:20 INFO mapred.FileInputFormat: Total input paths to process : 1
13/01/21 16:33:20 INFO mapred.JobClient: Running job: job_201301182219_0280
13/01/21 16:33:21 INFO mapred.JobClient: map 0% reduce 0%
13/01/21 16:33:44 INFO mapred.JobClient: map 100% reduce 0%
13/01/21 16:33:52 INFO mapred.JobClient: Job complete: job_201301182219_0280
13/01/21 16:33:52 INFO mapred.JobClient: Counters: 20
13/01/21 16:33:52 INFO mapred.JobClient: Job Counters
13/01/21 16:33:52 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=21715
13/01/21 16:33:52 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/01/21 16:33:52 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/01/21 16:33:52 INFO mapred.JobClient: Launched map tasks=1
13/01/21 16:33:52 INFO mapred.JobClient: Data-local map tasks=1
13/01/21 16:33:52 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/01/21 16:33:52 INFO mapred.JobClient: File Input Format Counters
13/01/21 16:33:52 INFO mapred.JobClient: Bytes Read=0
13/01/21 16:33:52 INFO mapred.JobClient: File Output Format Counters
13/01/21 16:33:52 INFO mapred.JobClient: Bytes Written=139
13/01/21 16:33:52 INFO mapred.JobClient: FileSystemCounters
13/01/21 16:33:52 INFO mapred.JobClient: HDFS_BYTES_READ=13057974
13/01/21 16:33:52 INFO mapred.JobClient: FILE_BYTES_WRITTEN=23267
13/01/21 16:33:52 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=139
13/01/21 16:33:52 INFO mapred.JobClient: Map-Reduce Framework
13/01/21 16:33:52 INFO mapred.JobClient: Map input records=0
13/01/21 16:33:52 INFO mapred.JobClient: Physical memory (bytes) snapshot=79859712
13/01/21 16:33:52 INFO mapred.JobClient: Spilled Records=0
13/01/21 16:33:52 INFO mapred.JobClient: CPU time spent (ms)=4410
13/01/21 16:33:52 INFO mapred.JobClient: Total committed heap usage (bytes)=77070336
13/01/21 16:33:52 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2900570112
13/01/21 16:33:52 INFO mapred.JobClient: Map input bytes=0
13/01/21 16:33:52 INFO mapred.JobClient: Map output records=0
13/01/21 16:33:52 INFO mapred.JobClient: SPLIT_RAW_BYTES=188
13/01/21 16:33:52 INFO warc.WARCConverterJob: Conversion: done
/opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/core/target/behemoth-core-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i hdfs://LucidN1:50001/output_gz2/part-00000 -c
13/01/21 17:02:19 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/01/21 17:02:19 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library13/01/21 17:02:19 INFO compress.CodecPool: Got brand-new decompressor13/01/21 17:02:19 INFO compress.CodecPool: Got brand-new decompressor13/01/21 17:02:19 INFO compress.CodecPool: Got brand-new decompressor13/01/21 17:02:19 INFO compress.CodecPool: Got brand-new decompressor
./behemoth reader -i hdfs://LucidN1:50001/output_gz2/part-00000 -c
To view this discussion on the web, visit https://groups.google.com/d/msg/digitalpebble/-/ufauIyCG1i4J.
To post to this group, send an email to digita...@googlegroups.com.
To unsubscribe from this group, send email to digitalpebbl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.