API for accessing crawled content

1 view
Skip to first unread message

B R

unread,
Aug 26, 2009, 11:54:24 AM8/26/09
to hounder
Hi,

I wish to access the content crawled by Hounder for use in a custom
search application. Is there any API to read the database of all pages
downloaded by Hounder ? I understand it is possible to use a crawler
module to process the data at the time of crawling; however, I would
like to process all the pages after the crawling is completed or at
any time in the future.

Thanks.

Jorge Handl

unread,
Aug 26, 2009, 12:18:44 PM8/26/09
to hou...@googlegroups.com
Hi,

If all you need is the url of the crawled pages, you can get that through the db.sh script, using the "list" command. If you need the page contents and you have the cache module running, you can get that through the command:

java -cp ../lib/hounder-trunk.jar:../lib/hounder-trunk-deps.jar com.flaptor.util.cache.FileCache

- Jorge

B R

unread,
Aug 27, 2009, 5:42:42 AM8/27/09
to hounder
I tried using FileCache as above and the "Cache list <dir>" command;
however, I do not know which directory is to be used as input for the
list command.

Please advise.

Thank you.

On Aug 26, 9:18 pm, Jorge Handl <jha...@gmail.com> wrote:
> Hi,
>
> If all you need is the url of the crawled pages, you can get that through
> the db.sh script, using the "list" command. If you need the page contents
> and you have the cache module running, you can get that through the command:
>
> java -cp ../lib/hounder-trunk.jar:../lib/hounder-trunk-deps.jar
> com.flaptor.util.cache.FileCache
>
> - Jorge
>

Jorge Handl

unread,
Aug 27, 2009, 7:43:23 AM8/27/09
to hou...@googlegroups.com
The original page is stored at "cache/page" and the extracted text is stored at "cache/text", both within the crawler directory. You can find the cache configuration in the "conf/cacheModule.properties" file.
- Jorge

B R

unread,
Aug 27, 2009, 11:01:02 AM8/27/09
to hounder
I tried running the following command :

"java -cp lib/hounder-trunk.jar:lib/hounder-trunk-deps.jar
com.flaptor.util.cache.FileCache Cache list /opt/softwares/hounder/
crawler/cache/page"

However, nothing happens. No error is reported. Have I missed out
something ?

Thanks.

On Aug 27, 4:43 pm, Jorge Handl <jha...@gmail.com> wrote:
> The original page is stored at "cache/page" and the extracted text is stored
> at "cache/text", both within the crawler directory. You can find the cache
> configuration in the "conf/cacheModule.properties" file. - Jorge
>

Jorge Handl

unread,
Aug 27, 2009, 11:09:30 AM8/27/09
to hou...@googlegroups.com
Try this:

java -cp lib/hounder-trunk.jar:lib/hounder-trunk-deps.jar com.flaptor.util.cache.FileCache list /opt/softwares/hounder/crawler/cache/page

B R

unread,
Aug 28, 2009, 8:36:33 AM8/28/09
to hounder
Removing the word "Cache" worked.

I can now list pages and text in the cache and use the getobj command
successfully. The get command does not work -

"Exception in thread "main" java.lang.ClassCastException:
java.lang.String cannot be cast to [B
at com.flaptor.util.cache.FileCache.main(FileCache.java:726)"

Could you explain the usage of the getobjprop command ?

Thanks.

On Aug 27, 8:09 pm, Jorge Handl <jha...@gmail.com> wrote:
> Try this:
>
> java -cp
> lib/hounder-trunk.jar:lib/hounder-trunk-deps.jar
> com.flaptor.util.cache.FileCache
> list /opt/softwares/hounder/crawler/cache/page
>

Jorge Handl

unread,
Aug 28, 2009, 12:34:13 PM8/28/09
to hou...@googlegroups.com
The get command only works when the cached object is a byte array. The crawler stores a String object in the text cache and a DocumentCacheItem in the page cache.

The getobj command works for both because they both implement the toString() method.

The getobjprop command is for more generic uses of the file cache, to show the output of calling an arbitrary method of the stored object, for example if you store an object that implements the "getName()" method, you would use "FileCache getobjprop <dir> getName <key>". The DocumentCacheItem class has a getMimeType() method, so you could get a list of all mime types by calling "FileCache getobjprop cache/page getMimeType".

Hope this helps.

- Jorge

B R

unread,
Aug 31, 2009, 2:36:43 AM8/31/09
to hounder
Thanks a lot.

On Aug 28, 9:34 pm, Jorge Handl <jha...@gmail.com> wrote:
> The get command only works when the cached object is a byte array. The
> crawler stores a String object in the text cache and a DocumentCacheItem in
> the page cache.
>
> The getobj command works for both because they both implement the toString()
> method.
>
> The getobjprop command is for more generic uses of the file cache, to show
> the output of calling an arbitrary method of the stored object, for example
> if you store an object that implements the "getName()" method, you would use
> "FileCache getobjprop <dir> getName <key>". The DocumentCacheItem class has
> a getMimeType() method, so you could get a list of all mime types by calling
> "FileCache getobjprop cache/page getMimeType".
>
> Hope this helps.
>
> - Jorge
>
Reply all
Reply to author
Forward
0 new messages