Hi,
I've drafted a translator for the British Newspaper Archive, the largest database of British newspapers.
BNA is a subscription service (free in the British Library), but I think we can still test this usefully:
- BNA search is free, and widely used by scholars without a subscription, so I have written the translator to scrape the search page directly rather than to query the linked newspaper scan pages. Consequently scrape from search results (multiple) can be tested without restriction.
- any account can view three newspaper scan pages free of charge. You can view those same three newspaper scan pages an unlimited number of times (they can be bookmarked in your account for easy retrieval). This allows any tester to choose three articles to test, and to test the single item scraper against those same newspaper scan pages an unlimited number of times.
Limitations of the translator:
- I have written (for my own use) code to attach a pdf of the page scan and omitted it from this draft as it may be considered superfluous. I can include this function in the submitted version if desired.
- The OCR on the underlying database is poor and often hard to understand. This includes the article 'titles', which are generally raw OCR and often do not match the actual article titles.
- Article titles generally appear in BLOCK CAPS on the search results page. The translator forces titles into Title Case when scraping multiples, as I thought this was probably the least-worst option. This is not necessary on the newspaper scan pages since the BNA website forces title case for the page titles of newspaper scans. Since most users don't have subscriptions, though, we can't follow the links with ZU.processDocuments to use the BNA's own approach. Instead we have to code our own.
Thanks,
Emma