You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to DigitalPebble
Quite a few contributions from DigitalPebble have recently been
committed to Nutch.
NUTCH-754 : Use GenericOptionsParser instead of FileSystem.parseArgs()
NUTCH-679 : Fetcher2 implementing Tool
NUTCH-731 : Redirection of robots.txt in RobotRulesParser
NUTCH-702 : Lazy Instanciation of Metadata in CrawlDatum
NUTCH-756 : CrawlDatum.set() does not reset Metadata if it is null
the last two have an impact on the performance of Nutch when the
crawlDB gets a bit large. There are more patches and contributions
waiting to be reviewed and committed, in particular
NUTCH-753 Prevent new Fetcher to retrieve the robots twice
NUTCH-719 fetchQueues.totalSize incorrect in Fetcher2
NUTCH-712 ParseOutputFormat should catch
java.net.MalformedURLException coming from normalizers
NUTCH-692 AlreadyBeingCreatedException with Hadoop 0.19
NUTCH-658 Add Counter for # of doc fetched in Reporter
NUTCH-655 Injecting Crawl metadata
NUTCH-719 has an impact on the performance of the fetchers as it
prevents them from being locked in a time out.
Thanks to Andrzej and Dogacan for taking the time to review the
patches. Nutch is a great project to contribute to.