Google Groups Home
Help | Sign in
Extracting crawled page's text
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  8 messages - Collapse all
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Nico  
View profile
 More options Jul 11, 2:53 pm
From: Nico <nicolasbottar...@gmail.com>
Date: Fri, 11 Jul 2008 11:53:40 -0700 (PDT)
Local: Fri, Jul 11 2008 2:53 pm
Subject: Extracting crawled page's text
Hi, is there a way to extract from command line the text of the
crawled pages?

Thanks

nicolas Bottarini


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jhandl  
View profile
 More options Jul 11, 3:59 pm
From: jhandl <jha...@gmail.com>
Date: Fri, 11 Jul 2008 12:59:34 -0700 (PDT)
Local: Fri, Jul 11 2008 3:59 pm
Subject: Re: Extracting crawled page's text
Nico, the quick and dirty way is to extract the text from the index
using the idx script as follows:

cd indexer
idx list indexes/index 0

The best way, though, is to write a crawler module to extract the
parsed text to a file or database, or directly via rpc to any post-
processing you might want to do with it.

Hope this helps.

-- Jorge

On Jul 11, 3:53 pm, Nico <nicolasbottar...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nico  
View profile
 More options Jul 11, 4:13 pm
From: Nico <nicolasbottar...@gmail.com>
Date: Fri, 11 Jul 2008 13:13:03 -0700 (PDT)
Local: Fri, Jul 11 2008 4:13 pm
Subject: Re: Extracting crawled page's text
Thanks. I executed that command and obtained the text. Do you know why
there is encoding problems?
I get things like: "producto extra�do desde vertientes naturales"

do i have to configure something?

Thank you very much for your help!

On Jul 11, 4:59 pm, jhandl <jha...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jhandl  
View profile
 More options Jul 11, 4:26 pm
From: jhandl <jha...@gmail.com>
Date: Fri, 11 Jul 2008 13:26:46 -0700 (PDT)
Local: Fri, Jul 11 2008 4:26 pm
Subject: Re: Extracting crawled page's text
Can you send me the url of the page?

On Jul 11, 5:13 pm, Nico <nicolasbottar...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nico  
View profile
 More options Jul 11, 5:01 pm
From: Nico <nicolasbottar...@gmail.com>
Date: Fri, 11 Jul 2008 14:01:05 -0700 (PDT)
Local: Fri, Jul 11 2008 5:01 pm
Subject: Re: Extracting crawled page's text
Obviously,  the URL is:
http://blogsearch.google.com/blogsearch?as_q=coca+cola+dasani&num=100...

On Jul 11, 5:26 pm, jhandl <jha...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jhandl  
View profile
 More options Jul 11, 5:09 pm
From: jhandl <jha...@gmail.com>
Date: Fri, 11 Jul 2008 14:09:29 -0700 (PDT)
Local: Fri, Jul 11 2008 5:09 pm
Subject: Re: Extracting crawled page's text
Nico, make sure you have the LANG and LC_ALL environment variables set
to "en_US.UTF-8".

-- Jorge

On Jul 11, 6:01 pm, Nico <nicolasbottar...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nico  
View profile
 More options Jul 11, 5:21 pm
From: Nico <nicolasbottar...@gmail.com>
Date: Fri, 11 Jul 2008 14:21:03 -0700 (PDT)
Local: Fri, Jul 11 2008 5:21 pm
Subject: Re: Extracting crawled page's text
both variables are in en_US.UTF-8

On Jul 11, 6:09 pm, jhandl <jha...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jhandl  
View profile
 More options Jul 11, 9:03 pm
From: jhandl <jha...@gmail.com>
Date: Fri, 11 Jul 2008 18:03:33 -0700 (PDT)
Local: Fri, Jul 11 2008 9:03 pm
Subject: Re: Extracting crawled page's text
Nico, we found a bug in the way the crawler treated ISO-8859-1 encoded
pages.
We fixed it and a new version of Hounder is ready for download at
http://hounder.org/downloads/hounder-1.0-binary_installer.tgz
Once you download the new version, just run "ant jar" and copy output/
hounder-trunk.jar to the lib directory where hounder is installed.
You will have to re-crawl to get the pages correctly encoded though.
Hope this fixes the problem.

--Jorge

On Jul 11, 6:21 pm, Nico <nicolasbottar...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2008 Google