URL Normalization / Canonicalization

84 views
Skip to first unread message

Avi Hayun

unread,
Jun 14, 2015, 4:56:35 AM6/14/15
to crawler...@googlegroups.com
Hi,

When crawling, normalization of URLs is quite important as you don't want to crawl the same URL twice or more...



Upon looking at the source code of Nutch, Heritrix & crawler4j it seems that all of them have implemented their own URL Normalizer / Canonicalizer.



Do you think that a URL Normalizer is a good candidate for Crawler-Commons ?

Julien Nioche

unread,
Jun 15, 2015, 7:24:05 AM6/15/15
to crawler...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "crawler-commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to crawler-commo...@googlegroups.com.
Visit this group at http://groups.google.com/group/crawler-commons.
For more options, visit https://groups.google.com/d/optout.



--

Avi Hayun

unread,
Jun 15, 2015, 7:39:33 AM6/15/15
to crawler...@googlegroups.com
Thanks Julien,

Can you point me to the BasicUrlNormalizer class in github ? - I can't seem to find it...  (I get stuck here: https://github.com/apache/nutch/tree/trunk/src/java/org/apache/nutch/net )





Julien Nioche

unread,
Jun 15, 2015, 8:15:18 AM6/15/15
to crawler...@googlegroups.com

Avi Hayun

unread,
Jun 15, 2015, 8:22:42 AM6/15/15
to crawler...@googlegroups.com
Nice


I will add it to the issue as a candidate for an initial url normalizer

Avi Hayun

unread,
Jun 15, 2015, 9:08:20 AM6/15/15
to crawler...@googlegroups.com
Looking at the thread in on storm crawler (https://github.com/DigitalPebble/storm-crawler/issues/120)

I will take it as a vote of +1 for adding the url normalizer from Ken and Julien


Which answers my initial question.


Any further conversations on this issue should be on issue #74
Reply all
Reply to author
Forward
0 new messages