Hey,
I'm considering a project where I crawl a certain academic blog for the sake of digital preservation.
Frontera looks interesting. Is it recommended or are there other tools to consider? How does it compare to Apache Nutch?
I am interested in a library which provides a very sophisticated crawling strategy, so I don't have to reinvent the wheel. I am thinking it needs to build a model of the site structure so it can infer where to crawl, when it does.
Thanks,
Julius