On Jul 11, 6:21 pm, Nico <
nicolasbottar...@gmail.com> wrote:
> both variables are in en_US.UTF-8
>
> On Jul 11, 6:09 pm, jhandl <
jha...@gmail.com> wrote:
>
> > Nico, make sure you have the LANG and LC_ALL environment variables set
> > to "en_US.UTF-8".
>
> > -- Jorge
>
> > On Jul 11, 6:01 pm, Nico <
nicolasbottar...@gmail.com> wrote:
>
> > > Obviously, the URL is:
http://blogsearch.google.com/blogsearch?as_q=coca+cola+dasani&num=100...
>
> > > On Jul 11, 5:26 pm, jhandl <
jha...@gmail.com> wrote:
>
> > > > Can you send me the url of the page?
>
> > > > On Jul 11, 5:13 pm, Nico <
nicolasbottar...@gmail.com> wrote:
>
> > > > > Thanks. I executed that command and obtained the text. Do you know why
> > > > > there is encoding problems?
> > > > > I get things like: "producto extra�do desde vertientes naturales"
>
> > > > > do i have to configure something?
>
> > > > > Thank you very much for your help!
>
> > > > > On Jul 11, 4:59 pm, jhandl <
jha...@gmail.com> wrote:
>
> > > > > > Nico, the quick and dirty way is to extract the text from the index
> > > > > > using the idx script as follows:
>
> > > > > > cd indexer
> > > > > > idx list indexes/index 0
>
> > > > > > The best way, though, is to write a crawler module to extract the
> > > > > > parsed text to a file or database, or directly via rpc to any post-
> > > > > > processing you might want to do with it.
>
> > > > > > Hope this helps.
>
> > > > > > -- Jorge
>
> > > > > > On Jul 11, 3:53 pm, Nico <
nicolasbottar...@gmail.com> wrote:
>
> > > > > > > Hi, is there a way to extract from command line the text of the
> > > > > > > crawled pages?
>
> > > > > > > Thanks
>
> > > > > > > nicolas Bottarini