ABBYY FineReader VII XIX For Fraktur (for OCR Old Rare Books .rar

0 views
Skip to first unread message

Gracia Bradshaw

unread,
Aug 18, 2024, 4:00:00 PM8/18/24
to croswaiwade

This overview will not cover commercially available OCR to any great extent, but will instead survey the currently available open-source software and explain how to use it. There are many commercial options available and it would be difficult to do any justice to them since that is not the main goal of this guide. Additionally, we prefer open-source software solutions because proprietary software is not always easily available and we find that open source software is best for running batch OCR on the scale required for our purposes. For those interested in using commercial OCR software, ABBYY Finereader is a good place to start. But even with the commercial software, one's mileage may vary significantly depending on the language of the texts one is confronting, the condition of the books one hopes to convert, and the font in which they are printed.

Because I am a Mac/Linux user, the directions for installing and running the software on the CLI will be given in the unix-based system-compatible form that covers both Mac and Linux. Windows users who would like to know more about the command line can find relevant information in the Programming Historian tutorial.

ABBYY FineReader VII XIX for Fraktur (for OCR old rare books .rar


Download Zip https://mciun.com/2A2wcK



The first thing one needs to ensure is that there is only one page per image file. If this is not the case, one will need to install the program ImageMagick. ImageMagick is a pretty nifty and powerful tool for manipulating image files and converting them into different formats in batches; it automates these processes and prevents a lot of unnecessary clicking. ImageMagick has compatible packages for all three major operating systems, and more information about that (as well as explanations of the various commands) can be found at the documentation website.

(the terminal is space and case-sensitive). Enter the account password for the computer admin when prompted to do so, and the installation will begin to run. When ImageMagick indicates how much space it is going to use, type 'Y' to confirm, and the program will install. It could very well be that ImageMagick is already there, especially if the computer is running Linux, since it is bundled with many distributions of Linux; if so, the terminal will update the software if there is an update available.

OCR software is error-prone. For most use-cases other than the most error-tolerant forms of text-mining, it is necessary to clean up the output of an OCR routine. If interested in comparing the OCR results to a dictionary in order to assist in clean-up, one might want to install Enchant. Enchant is a spell-checking program that can run from the command line. If the computer is running a recent version of Linux, Enchant may already be installed. To check, type

If trying to OCR a language other than English or a particular kind of font, one may have to experiment or see if Tesseract or OCRopus has made additional language/font packages available. Many languages are accommodated in the standard installation, but those additional packages that were developed later must be installed as add-ons. Both programs allow users to create their own training data and to make their own packages for reading text files. This feature is advantageous if users need something more specific or find that the packages, as distributed, are not providing good results. Results from such customizations will certainly vary here.

The earliest versions of Tesseract were in development in the late 1980s and early 1990s. The project lay dormant for a time, but since 2006 Google has been maintaining it. More information and the documentation is available here and here.

Unlike Tesseract, OCRopus is officially available only for Linux users. In my experience, a straightforward workaround for Mac users wishing to install OCRopus is to install an Ubuntu Linux partition on the Mac (also known as dual-booting; it would also be possible to run Linux in a virtual machine). The installation is not very difficult, and the operating system is actually quite lovely if one prefers things on the simple side. Instructions may be found here. The download page has options and instructions for Windows and Mac, including booting from a USB drive.

OCRopus is not so much a single software package as it is a compilation of various scripts that work together from the command line interface. The installation is therefore slightly more involved than that for Tesseract. The following works for Linux only.

The next step is to install the necessary Python packages OCRopus needs to run. Python is a type of programming language, and OCRopus is written in it. To do this, type the following into the command line:

One can install other packages for other languages and fonts, though there are not as many separate ones available for the latest version of OCRopus just yet. The Fraktur package (for many pre-WWII German texts) works a bit differently from the default English one. It is stored at a different web address and must then be moved to a different folder than the English package.

Those contemplating an OCR project for the first time sometimes forget what is obvious to even moderately experienced OCR users: Successful OCR requires uniform, rectilinear, high-quality scans. Securing such scans will save many headaches at future stages of the workflow. If running OCR on large batches of files (for example, multiple multi-volume adaptations of Jane Eyre in German), it is desirable to automate the workflow as much as possible, and having good scans that do not require a lot of curating can go a long way toward making that happen. This is not to say that a project is doomed if the only images available do not meet the criteria listed above; it simply means there will be a little more work to do in step two.

Much of the scanning depends, of course, on what kinds of books need to be scanned and what resources are available for accomplishing this task. Most books will probably fit easily on most basic face-down scanners. For dealing with oversized folios that will not fit on the average scanner, one would be better served by overhead scanners or other special scanning setups (sometimes available in libraries); these types of scanners either allow the book to lie face-up or use glass panes to press the pages flat to prevent page distortions. Flattening is important for getting good scans, but can be difficult if the books have either a very tight binding or are older and therefore more fragile. This problem is particularly pertinent when dealing with rare materials and does require negotiating the the relation between the needs of digital preservation and those of physical preservation. One of the many reasons face-up scanning solutions are often preferable to traditional scanners is the fact that this form of scanning is usually less damaging to the book in the long run.

Scans that are skewed or otherwise imperfect may require some alterations before they can be read by the OCR software. Below are a few tips for making some basic alterations to the files by means of a batch process.

Important Note: It is good practice to make a copy of the original images, and work only with the copies, leaving the originals intact in case something goes wrong. Nobody wants to scan the same book twice. That is why the following commands are designed to make copies of the original files with the same file names that are then directed to go into a new folder. Before each change to the images, then, one should make a new folder for the files that will be created. One can of course do this from a graphical file manager, but one can also do it from the CLI with the command mkdir [name_of_folder]. If doing it from the CLI, just make sure to navigate to the proper directory where the folder should be added.

It could be the case that there is no physical copy of the text available and one may be reliant upon a preservation site or database where the only format available is a PDF. PDFs are quite challenging because there can be so many variables in play (quality of original scan, quality of PDF, how the pages were scanned, etc.). They are perhaps less troublesome if the language is English and the font is a traditional modern font, but even PDFs with those attributes can be tricky. In short, I do not recommend working from them, but when this is the only choice, there are a few things one can do to insure better quality OCR output.

The main objective is to split the PDF into individual page files. This challenge is more difficult if the PDFs contain two-page spreads as opposed to individual pages. The main goal one needs to achieve is to split the pages while retaining enough image quality that it can be read by the OCR software.

Some lower quality images may not meet the basic minimum page scale of 12 which is required for the software to work. The argument in brackets (which would not need brackets if actually running it) tells the script to lower the minimum page scale.

A step that has been recently added to the workflow is to go back and check for any strange lines that have either a lot of whitespace where the line was cut incorrectly or where multiple lines are cut into one line. These lines will not register properly in the OCR, and entire lines (or more than one line) can be lost at once. I went back and typed in some transcriptions by hand. This may not be necessary, depending on the overall goal of the project, but it is good to be aware that it can happen nonetheless.

The above searches for lines that end in letters with a hyphen and then reconnects the first word in the next line to the part of the word that precedes the hyphen. For a full tutorial of regular expressions refer to the Programming Historian. Of course, if there is a word that actually needs to be hyphenated at the end of a line, it will also be changed, but the occurrences of that are usually fewer than the appearances of hyphenated words.

The punctuation can also be quite a mess. Quotation marks are often mistaken for commas and apostrophes. This is not extremely problematic for the kinds of text mining that strip out punctuation anyway. But one can use the find and replace function to eliminate unnecessary or confusing punctuation if concerned about text file readability.

b37509886e
Reply all
Reply to author
Forward
0 new messages