各位前輩大家早安
小弟利用nutch based on hadoop爬取了一些網頁,並用"
http://trac.nchc.org.tw/cloud/wiki/waue/2009/0409#a5.3下載crawl結果"的
bin/hadoop dfs -get search /opt/search
指令將segments下載到本機上,我想看一下爬取網頁的內容,我利用"
https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html#nutch-vs-lucene"的指令
s=`ls -d crawl-tinysite/segments/* | head -1`
bin/nutch segread -dump $s
但linux呈現的訊息卻是路徑不存在
SegmentReader: dump segment: ls
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://XXXXX:9000/user/crawler/ls/crawl_generate
Input path does not exist: hdfs://
XXXXX
:9000/user/crawler/ls/crawl_fetch
Input path does not exist: hdfs://
XXXXX
:9000/user/crawler/ls/crawl_parse
Input path does not exist: hdfs://
XXXXX
:9000/user/crawler/ls/content
Input path does not exist: hdfs://
XXXXX
:9000/user/crawler/ls/parse_data
Input path does not exist: hdfs://
XXXXX
:9000/user/crawler/ls/parse_text
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225)
at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
指令本身是LINUX端的指令,但卻出現HDFS路徑不存在的訊息,令人疑問
有哪位前輩可以指導小弟?非常感謝!