各位論壇的前輩大家好

46 views

Skip to first unread message

許育峰

unread,

Mar 19, 2013, 9:50:41 AM3/19/13

to crawlzi...@googlegroups.com

各位論壇的前輩大家好

請教大家 Crawlzilla抓下來的資料，除了可以利用網路瀏覽器觀看之外，是否有可能轉成其它格式？例如轉成人類可判斷的文字。
在利用Crawlzilla完成爬取資料後，我曾試著開啟它儲存在系統裡的檔案，但打開來全是亂碼無法觀看，因此想請大家給我一些指導。
看是否有可以轉換資料，或是查看其內容的方法

謝謝！

Jazz Yao-Tsung Wang

unread,

Mar 19, 2013, 11:25:58 AM3/19/13

to crawlzi...@googlegroups.com

許同學您好：

就我理解，Nutch (Crawlzilla 底層) 並沒有將網站完整內容存下來，
而是將 HTML 轉成 SequenceFile 的方式存在 segments 中。
要將內容取出，得去查 SequenceFile 的格式。
( 若要類比 Google 搜尋，segments 中存的是「頁面庫存」)

如果您想要的是觀察 Index 索引庫(就像原文書後面的索引一樣，
差別只是 Index 索引庫存放的是關鍵字在哪些網址來源的對應)
可以用 Luke ( http://code.google.com/p/luke/ )
Crawlzilla 1.x 就是透過 Luke 的 API 從 Index 取出 Top 50 網址
與 Top 50 關鍵字。

我曾經請學生將 Index 內容，採用 Luke API 轉換成 SQLite。
只是不確定那對您有無幫助。

- Jazz
2013/3/19 許育峰 <yufe...@mis.nsysu.edu.tw>:

> --
> You received this message because you are subscribed to the Google Groups
> "crawlzilla-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to crawlzilla-us...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Reply all

Reply to author

Forward

0 new messages