Nico, the quick and dirty way is to extract the text from the index
using the idx script as follows:
cd indexer
idx list indexes/index 0
The best way, though, is to write a crawler module to extract the
parsed text to a file or database, or directly via rpc to any post-
processing you might want to do with it.
Hope this helps.
-- Jorge
On Jul 11, 3:53 pm, Nico <nicolasbottar...@gmail.com> wrote:
Thanks. I executed that command and obtained the text. Do you know why
there is encoding problems?
I get things like: "producto extra�do desde vertientes naturales"
do i have to configure something?
Thank you very much for your help!
On Jul 11, 4:59 pm, jhandl <jha...@gmail.com> wrote:
> Nico, the quick and dirty way is to extract the text from the index
> using the idx script as follows:
> cd indexer
> idx list indexes/index 0
> The best way, though, is to write a crawler module to extract the
> parsed text to a file or database, or directly via rpc to any post-
> processing you might want to do with it.
> Hope this helps.
> -- Jorge
> On Jul 11, 3:53 pm, Nico <nicolasbottar...@gmail.com> wrote:
> > Hi, is there a way to extract from command line the text of the
> > crawled pages?
> Thanks. I executed that command and obtained the text. Do you know why
> there is encoding problems?
> I get things like: "producto extra�do desde vertientes naturales"
> do i have to configure something?
> Thank you very much for your help!
> On Jul 11, 4:59 pm, jhandl <jha...@gmail.com> wrote:
> > Nico, the quick and dirty way is to extract the text from the index
> > using the idx script as follows:
> > cd indexer
> > idx list indexes/index 0
> > The best way, though, is to write a crawler module to extract the
> > parsed text to a file or database, or directly via rpc to any post-
> > processing you might want to do with it.
> > Hope this helps.
> > -- Jorge
> > On Jul 11, 3:53 pm, Nico <nicolasbottar...@gmail.com> wrote:
> > > Hi, is there a way to extract from command line the text of the
> > > crawled pages?
> On Jul 11, 5:13 pm, Nico <nicolasbottar...@gmail.com> wrote:
> > Thanks. I executed that command and obtained the text. Do you know why
> > there is encoding problems?
> > I get things like: "producto extra�do desde vertientes naturales"
> > do i have to configure something?
> > Thank you very much for your help!
> > On Jul 11, 4:59 pm, jhandl <jha...@gmail.com> wrote:
> > > Nico, the quick and dirty way is to extract the text from the index
> > > using the idx script as follows:
> > > cd indexer
> > > idx list indexes/index 0
> > > The best way, though, is to write a crawler module to extract the
> > > parsed text to a file or database, or directly via rpc to any post-
> > > processing you might want to do with it.
> On Jul 11, 5:26 pm, jhandl <jha...@gmail.com> wrote:
> > Can you send me the url of the page?
> > On Jul 11, 5:13 pm, Nico <nicolasbottar...@gmail.com> wrote:
> > > Thanks. I executed that command and obtained the text. Do you know why
> > > there is encoding problems?
> > > I get things like: "producto extra�do desde vertientes naturales"
> > > > Nico, the quick and dirty way is to extract the text from the index
> > > > using the idx script as follows:
> > > > cd indexer
> > > > idx list indexes/index 0
> > > > The best way, though, is to write a crawler module to extract the
> > > > parsed text to a file or database, or directly via rpc to any post-
> > > > processing you might want to do with it.
> > > > Thanks. I executed that command and obtained the text. Do you know why
> > > > there is encoding problems?
> > > > I get things like: "producto extra�do desde vertientes naturales"
> > > > > Nico, the quick and dirty way is to extract the text from the index
> > > > > using the idx script as follows:
> > > > > cd indexer
> > > > > idx list indexes/index 0
> > > > > The best way, though, is to write a crawler module to extract the
> > > > > parsed text to a file or database, or directly via rpc to any post-
> > > > > processing you might want to do with it.
Nico, we found a bug in the way the crawler treated ISO-8859-1 encoded
pages.
We fixed it and a new version of Hounder is ready for download at
http://hounder.org/downloads/hounder-1.0-binary_installer.tgz Once you download the new version, just run "ant jar" and copy output/
hounder-trunk.jar to the lib directory where hounder is installed.
You will have to re-crawl to get the pages correctly encoded though.
Hope this fixes the problem.
--Jorge
On Jul 11, 6:21 pm, Nico <nicolasbottar...@gmail.com> wrote:
> > > > > Thanks. I executed that command and obtained the text. Do you know why
> > > > > there is encoding problems?
> > > > > I get things like: "producto extra�do desde vertientes naturales"
> > > > > > Nico, the quick and dirty way is to extract the text from the index
> > > > > > using the idx script as follows:
> > > > > > cd indexer
> > > > > > idx list indexes/index 0
> > > > > > The best way, though, is to write a crawler module to extract the
> > > > > > parsed text to a file or database, or directly via rpc to any post-
> > > > > > processing you might want to do with it.