請教segment read的問題

42 views
Skip to first unread message

許育峰

unread,
Feb 22, 2014, 9:47:54 PM2/22/14
to crawlzi...@googlegroups.com
各位前輩大家早安

小弟利用nutch based on hadoop爬取了一些網頁,並用"http://trac.nchc.org.tw/cloud/wiki/waue/2009/0409#a5.3下載crawl結果"的
bin/hadoop dfs -get search /opt/search
指令將segments下載到本機上,我想看一下爬取網頁的內容,我利用"https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html#nutch-vs-lucene"的指令
s=`ls -d crawl-tinysite/segments/* | head -1`
bin/nutch segread -dump $s


但linux呈現的訊息卻是路徑不存在

SegmentReader: dump segment: ls
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://XXXXX:9000/user/crawler/ls/crawl_generate
Input path does not exist: hdfs://
XXXXX:9000/user/crawler/ls/crawl_fetch
Input path does not exist: hdfs://
XXXXX:9000/user/crawler/ls/crawl_parse
Input path does not exist: hdfs://
XXXXX:9000/user/crawler/ls/content
Input path does not exist: hdfs://
XXXXX:9000/user/crawler/ls/parse_data
Input path does not exist: hdfs://
XXXXX:9000/user/crawler/ls/parse_text
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225)
at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)

指令本身是LINUX端的指令,但卻出現HDFS路徑不存在的訊息,令人疑問
有哪位前輩可以指導小弟?非常感謝!

Reply all
Reply to author
Forward
0 new messages