Can I use "regex-normalize.xml" from nutch in Hounder?

48 views
Skip to first unread message

Gustavo Arjones

unread,
Aug 3, 2009, 2:36:06 PM8/3/09
to hounder
Hi,
I'm facing some problems with session query string and anchors which
cause crawler loops and several duplicate pages.

Doing some research I found nutch has addressed this issue through a
"url normalization" configuration file, above link contains a sample
of this file.

http://svn.apache.org/viewvc/lucene/nutch/trunk/conf/regex-normalize.xml.template?revision=627890

I would like to know if Hounder can use some kind of configuration
like this, and in case it isn't if you could point me on right
direction to add this feature.

Thanks a lot!
Gustavo Arjones

Jorge Handl

unread,
Aug 5, 2009, 5:48:39 PM8/5/09
to hou...@googlegroups.com
Sure, you can use the nutch regex-normalize plugin in Hounder.

Gustavo Arjones

unread,
Aug 9, 2009, 1:51:03 AM8/9/09
to hounder
Hi Jorge,

Should I have to configure the crawler to tell this file is available?
I checked my <INSTALL>hounder/plugins/ contains urlnormalizer-*
folders and files.

Also I've created at <INSTALL>/hounder/crawler/conf/regex-
normalize.xml file containing expressions to normalize.

I injected a realworld example from one site I'm crawling, containing
same url but SessionID parameter.
------- pagedb.seeds ----
http://foro.powers.cl/viewtopic.php?f=1&p=3473271&sid=022f7824e5b8e82f65a85b73dfae30ce
http://foro.powers.cl/viewtopic.php?f=1&p=3473271&sid=042d5bf1459bedca6c1a6bf8471b9e9d
http://foro.powers.cl/viewtopic.php?f=1&p=3473271&sid=066029b66d5e5991f756f7626510e66f
http://foro.powers.cl/viewtopic.php?f=1&p=3473271&sid=068a3e158ed43280e62dc7181376e93b
http://foro.powers.cl/viewtopic.php?f=1&p=3473271&sid=077ab64f64464bbb61d8dbaa6d88b258
http://foro.powers.cl/viewtopic.php?f=1&p=3473271&sid=0ad42f3235077ddd914e8f94a3027317
http://foro.powers.cl/viewtopic.php?f=1&p=3473271&sid=0c02e56add7a3f574c33d270db18b4de
http://foro.powers.cl/viewtopic.php?f=1&p=3473271&sid=0c21f553c1a729b3e5e0c3bc5a9a881e
http://foro.powers.cl/viewtopic.php?f=1&p=3473271&sid=0d56e9837903ecdd6af5ea9ca8c65920
------- pagedb.seeds ----

After one cycle I checked pagedb folder and got several repeated urls
from this page with different SID.
So process didn't worked at all.

http://foro.powers.cl/viewtopic.php?p=3478850&sid=aa86ac8298a64cf1149e97852886a2f5
http://foro.powers.cl/viewtopic.php?p=3478850&sid=c1395ec4fe370c07779f3cad5ecd6a28
http://foro.powers.cl/viewtopic.php?p=3478850&sid=d5b9768a5187a1d6b242b48f008c89e7
http://foro.powers.cl/viewtopic.php?p=3478850&sid=e6da605307790fc2af7f35c66967363e
http://foro.powers.cl/viewtopic.php?p=3479038&sid=09313b3c9d92de58a201fa4ab986dcd1
http://foro.powers.cl/viewtopic.php?p=3479038&sid=0f39a4be35389013b44b8bcc9949d19c
http://foro.powers.cl/viewtopic.php?p=3479038&sid=20232e00d5f7d513dc65519aded55493
http://foro.powers.cl/viewtopic.php?p=3479038&sid=2ce51b31904d88baeea0888018922cd6

Any thoughts?
Thanks a lot!



On Aug 5, 6:48 pm, Jorge Handl <jha...@gmail.com> wrote:
> Sure, you can use the nutch regex-normalize plugin in Hounder.
>
> On Mon, Aug 3, 2009 at 3:36 PM, Gustavo Arjones
> <gustavo.arjo...@gmail.com>wrote:
>
>
>
> > Hi,
> > I'm facing some problems with session query string and anchors which
> > cause crawler loops and several duplicate pages.
>
> > Doing some research I found nutch has addressed this issue through a
> > "url normalization" configuration file, above link contains a sample
> > of this file.
>
> >http://svn.apache.org/viewvc/lucene/nutch/trunk/conf/regex-normalize....

Jorge Handl

unread,
Aug 9, 2009, 1:55:02 AM8/9/09
to hou...@googlegroups.com
Gustavo, did you configure the urlnormalizer plugin in conf/nutch-site.xml?
Reply all
Reply to author
Forward
0 new messages