Recent contributions to NUTCH

10 views
Skip to first unread message

julien nioche

unread,
Oct 14, 2009, 6:46:29 AM10/14/09
to DigitalPebble
Quite a few contributions from DigitalPebble have recently been
committed to Nutch.

NUTCH-754 : Use GenericOptionsParser instead of FileSystem.parseArgs()
NUTCH-679 : Fetcher2 implementing Tool
NUTCH-731 : Redirection of robots.txt in RobotRulesParser
NUTCH-702 : Lazy Instanciation of Metadata in CrawlDatum
NUTCH-756 : CrawlDatum.set() does not reset Metadata if it is null

the last two have an impact on the performance of Nutch when the
crawlDB gets a bit large. There are more patches and contributions
waiting to be reviewed and committed, in particular

NUTCH-753 Prevent new Fetcher to retrieve the robots twice
NUTCH-719 fetchQueues.totalSize incorrect in Fetcher2
NUTCH-712 ParseOutputFormat should catch
java.net.MalformedURLException coming from normalizers
NUTCH-692 AlreadyBeingCreatedException with Hadoop 0.19
NUTCH-658 Add Counter for # of doc fetched in Reporter
NUTCH-655 Injecting Crawl metadata

NUTCH-719 has an impact on the performance of the fetchers as it
prevents them from being locked in a time out.

Thanks to Andrzej and Dogacan for taking the time to review the
patches. Nutch is a great project to contribute to.

Reply all
Reply to author
Forward
0 new messages