Extracting crawled page's text

Nico

unread,

Jul 11, 2008, 2:53:40 PM7/11/08

to hounder

Hi, is there a way to extract from command line the text of the
crawled pages?

Thanks

nicolas Bottarini

jhandl

unread,

Jul 11, 2008, 3:59:34 PM7/11/08

to hounder

Nico, the quick and dirty way is to extract the text from the index
using the idx script as follows:

cd indexer
idx list indexes/index 0

The best way, though, is to write a crawler module to extract the
parsed text to a file or database, or directly via rpc to any post-
processing you might want to do with it.

Hope this helps.

-- Jorge

Nico

unread,

Jul 11, 2008, 4:13:03 PM7/11/08

to hounder

Thanks. I executed that command and obtained the text. Do you know why
there is encoding problems?
I get things like: "producto extraï¿½do desde vertientes naturales"

do i have to configure something?

Thank you very much for your help!

jhandl

unread,

Jul 11, 2008, 4:26:46 PM7/11/08

to hounder

Can you send me the url of the page?

Nico

unread,

Jul 11, 2008, 5:01:05 PM7/11/08

to hounder

Obviously, the URL is:
http://blogsearch.google.com/blogsearch?as_q=coca+cola+dasani&num=100&hl=en&ctz=180&c2coff=1&btnG=Search+Blogs&as_epq=&as_oq=&as_eq=&bl_pt=&bl_bt=&bl_url=&bl_auth=&as_drrb=q&as_qdr=a&as_mind=1&as_minm=1&as_miny=2000&as_maxd=10&as_maxm=7&as_maxy=2008&lr=lang_es&safe=active

jhandl

unread,

Jul 11, 2008, 5:09:29 PM7/11/08

to hounder

Nico, make sure you have the LANG and LC_ALL environment variables set
to "en_US.UTF-8".

-- Jorge

On Jul 11, 6:01 pm, Nico <nicolasbottar...@gmail.com> wrote:
> Obviously, the URL is:http://blogsearch.google.com/blogsearch?as_q=coca+cola+dasani&num=100...

Nico

unread,

Jul 11, 2008, 5:21:03 PM7/11/08

to hounder

both variables are in en_US.UTF-8

jhandl

unread,

Jul 11, 2008, 9:03:33 PM7/11/08

to hounder

Nico, we found a bug in the way the crawler treated ISO-8859-1 encoded
pages.
We fixed it and a new version of Hounder is ready for download at
http://hounder.org/downloads/hounder-1.0-binary_installer.tgz
Once you download the new version, just run "ant jar" and copy output/
hounder-trunk.jar to the lib directory where hounder is installed.
You will have to re-crawl to get the pages correctly encoded though.
Hope this fixes the problem.

--Jorge

Reply all

Reply to author

Forward