Actually that document looked like one of the ones that has been prepared with whatever tool it is that creates 3 layers for every page, and one of those layers is the text only layer in grey scale, with the background already removed (although it is inverted white on black which is easily fixed). You can extract those images from the file and keep every third one which will be the text. I don't know which tool is creating pdfs in this format, but it's similar to the way that Deja Vu originally pioneered separating the background and replacing it with a more compact version. I've seen it in files from both Google Books and
archive.org. In my current project, this was all I found necessary to add to those extracted layers - basically just removing a little noise:
convert \
$1 \
-write MPR:source \
-morphology close rectangle:3x4 \
-clip-mask MPR:source \
-morphology erode:8 square \
+clip-mask \
scan_intermediate.jpg
convert scan_intermediate.jpg -shave 150x150 -fuzz 20% -trim +repage ../images/$1
btw while I'm posting... some 'gotchas' to look out for which I've come across myself recently when OCRing and proofreading similar 18th and 19th C documents, some of which were due to the typesetter substituting what was available for a less common character: the actual letter 'f' substituted for the long medial s; 'y' substituted for thorn - the old style thorn that looks like a y or a gamma, not the representation used by UTF-8 that looks somewhat like a p or b or beta. (example: for the using þe way of witchcraft of moudiwart's feet upon him in his purse given to him þe Satan for the cause that sa lang as he had them upon him he sould never want siller.), the which is frequently erroneously rendered (and mispronounced) as 'ye'. An apostrophe being used in Scottish names like M`Donald in place of a superscript 'c'. Various ligatures that you don't see much nowadays (eg ct). Much more common uses of superscripts where in modern times we'd use an apostrophe to denote missing letters before the word-final cluster of letters. u for v and vice-versa. Qu for W. Thin spaces before some punctuation (caused by mechanical issues with the type, eg ' ;' which should be OCR'd as just ';'.) More common use of ligatures (eg Æneas). Use of the old style '&' which looks more like the letters "Et". Use of accents that you might not be expecting and might dismiss as bad OCR, eg "We hairtlie thank thé Hevinlie Father". Use of vulgar fractions with a horizontal bar which cannot be represented in UTF-8 which only supports a diagonal bar. The old letter yogh which is written with a descender and often rendered as (and similarly mispronounced as) 'z' as in the surname 'Menzies' which is pronounced 'meengis' - the name of American jazz musician Charlie Mingus actually preserves the pronunciation but not the spelling of the original name of Menzies. Few of these are caught by tesseract and will require manual proofreading.
Good luck with your project.
Graham