Extracting crawled page's text

0 views
Skip to first unread message

Nico

unread,
Jul 11, 2008, 2:53:40 PM7/11/08
to hounder
Hi, is there a way to extract from command line the text of the
crawled pages?

Thanks

nicolas Bottarini

jhandl

unread,
Jul 11, 2008, 3:59:34 PM7/11/08
to hounder
Nico, the quick and dirty way is to extract the text from the index
using the idx script as follows:

cd indexer
idx list indexes/index 0

The best way, though, is to write a crawler module to extract the
parsed text to a file or database, or directly via rpc to any post-
processing you might want to do with it.

Hope this helps.

-- Jorge

Nico

unread,
Jul 11, 2008, 4:13:03 PM7/11/08
to hounder
Thanks. I executed that command and obtained the text. Do you know why
there is encoding problems?
I get things like: "producto extra�do desde vertientes naturales"

do i have to configure something?

Thank you very much for your help!

jhandl

unread,
Jul 11, 2008, 4:26:46 PM7/11/08
to hounder
Can you send me the url of the page?

Nico

unread,
Jul 11, 2008, 5:01:05 PM7/11/08
to hounder

jhandl

unread,
Jul 11, 2008, 5:09:29 PM7/11/08
to hounder
Nico, make sure you have the LANG and LC_ALL environment variables set
to "en_US.UTF-8".

-- Jorge

On Jul 11, 6:01 pm, Nico <nicolasbottar...@gmail.com> wrote:
> Obviously,  the URL is:http://blogsearch.google.com/blogsearch?as_q=coca+cola+dasani&num=100...

Nico

unread,
Jul 11, 2008, 5:21:03 PM7/11/08
to hounder
both variables are in en_US.UTF-8

jhandl

unread,
Jul 11, 2008, 9:03:33 PM7/11/08
to hounder
Nico, we found a bug in the way the crawler treated ISO-8859-1 encoded
pages.
We fixed it and a new version of Hounder is ready for download at
http://hounder.org/downloads/hounder-1.0-binary_installer.tgz
Once you download the new version, just run "ant jar" and copy output/
hounder-trunk.jar to the lib directory where hounder is installed.
You will have to re-crawl to get the pages correctly encoded though.
Hope this fixes the problem.

--Jorge
Reply all
Reply to author
Forward
0 new messages