Image Processing Text Book

0 views

Skip to first unread message

Barton Ostby

unread,

Aug 3, 2024, 4:52:00 PM8/3/24

to slobbuetingti

I've been using tesseract to convert documents into text. The quality of the documents ranges wildly, and I'm looking for tips on what sort of image processing might improve the results. I've noticed that text that is highly pixellated - for example that generated by fax machines - is especially difficult for tesseract to process - presumably all those jagged edges to the characters confound the shape-recognition algorithms.

What sort of image processing techniques would improve the accuracy? I've been using a Gaussian blur to smooth out the pixellated images and seen some small improvement, but I'm hoping that there is a more specific technique that would yield better results. Say a filter that was tuned to black and white images, which would smooth out irregular edges, followed by a filter which would increase the contrast to make the characters more distinct.

I've recently written a pretty simple guide to Tesseract but it should enable you to write your first OCR script and clear up some hurdles that I experienced when things were less clear than I would have liked in the documentation.

To some degree, Tesseract automatically applies them. It is also possible to tell Tesseract to write an intermediate image for inspection, i.e. to check how well the internal image processing works (search for tessedit_write_images in the above reference).

Reading text from image documents using any OCR engine have many issues in order get good accuracy. There is no fixed solution to all the cases but here are a few things which should be considered to improve OCR results.

1) Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This requires some pre-processing operations like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in OpenCV.

3) Presence of lines: While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results. There are other issues also but these are the basic ones.

Text Recognition depends on a variety of factors to produce a good quality output. OCR output highly depends on the quality of input image. This is why every OCR engine provides guidelines regarding the quality of input image and its size. These guidelines help OCR engine to produce accurate results.

Sometimes, I get better results with legacy engine (using --oem 0) and sometimes I get better results with LTSM engine --oem 1.Generally speaking, I get the best results on upscaled images with LTSM engine. The latter is on par with my earlier engine (ABBYY CLI OCR 11 for Linux).

Of course, the traineddata needs to be downloaded from github, since most linux distros will only provide the fast versions.The trained data that will work for both legacy and LTSM engines can be downloaded at -ocr/tessdata with some command like the following. Don't forget to download the OSD trained data too.

I've ended up using ImageMagick as my image preprocessor since it's convenient and can easily run scripted. You can install it with yum install ImageMagick or apt install imagemagick depending on your distro flavor.

Btw, some years ago I wrote the 'poor man's OCR server' which checks for changed files in a given directory and launches OCR operations on all not already OCRed files. pmocr is compatible with tesseract 3.x-5.x and abbyyocr11.See the pmocr project on github.

I am thinking of making an application that requires extracting TEXT from an image. I haven't done any thing similar and I don't want to implement the whole stuff on my own. Is there any known library or open source code (supported for ios, objective-C) which can help me in extracting the text from the image. A basic source code will also do (I will try to modify it as per my need).

Anyone have any ideas on how to go about this? I looked at most of the functions in Images.jl and its sattelite packages but I was not able to find any way to compose images or to add text. Happy to be pointed to something if I missed it!

You can render the text in Cairo, but when doing so I had to overcome a few pitfalls (e.g. for compatibility with Cairo, I just learnt you need to use the Images.RGB24 color type, and documentation for Cairo.jl seems largely absent).

I was thinking of this in relation to an ImageAnnotations.jl-based framework - to support both image classification visualisations, object detection visualisations etc. (as extension packages to ImageAnnotations.jl for MakieViews or LuxorViews etc.): Cf. -image-processing/topic/DL.20based.20tools/near/383544112

Using this API in a mobile device app? TryFirebase Machine Learning andML Kit,which provide platform-specific Android and iOS SDKs for using Cloud Vision services, as well ason-device ML Vision APIs and on-device inference using custom ML models.

TEXT_DETECTION detects and extracts text from any image. For example, aphotograph might contain a street sign or traffic sign. The JSON includesthe entire extracted string, as well as individual words, and their boundingboxes.

DOCUMENT_TEXT_DETECTION also extracts text from an image, but the responseis optimized for dense text and documents. The JSON includes page, block,paragraph, word, and break information.

You can use the Vision API to perform feature detection on a remote image file that is located in Cloud Storage or on the Web. To send a remote file request, specify the file's Web URL or Cloud Storage URI in the request body.

Both types of OCR requests support one or more languageHints that specify the language of any text in the image. However, an empty value usually yields the best results, because omitting a value enables automatic language detection. For languages based on the Latin alphabet, setting languageHints is not needed. In rare cases, when the language of the text in the image is known, setting a hint helps get better results (although it can be a significant hindrance if the hint is wrong). Text detection returns an error if one or more of the specified languages is not one of the supported languages.

If you choose to provide a language hint, modify the body of your request (request.json file) to provide the string of one of the supported languages in the imageContext.languageHints field as shown in the following sample:

For example, the language hint "en-t-i0-handwrit" specifies English language (en), transform extension singleton (t), input method engine transform extension code (i0), and handwriting transform code (handwrit). This code says that the language is "English transformed from handwriting." You don't need to specify a script code because Latn is implied by the "en" language.

Cloud Vision offers you some control over where the resources for your projectare stored and processed. In particular, you can configure Cloud Vision to store and process your data only in the European Union.

By default Cloud Vision stores and processes resources in a Global location, which means that Cloud Vision doesn't guarantee that your resources will remain within a particular location or region. If you choose the European Union location, Google will store your data and process it only in the European Union. You and your users can access the data from any location.

The Vision API supports a global API endpoint (vision.googleapis.com) and also two region-based endpoints: a European Union endpoint (eu-vision.googleapis.com) and United States endpoint (us-vision.googleapis.com). Use these endpoints for region-specific processing. For example, to store and process your data in the European Union only, use the URI eu-vision.googleapis.com in place of vision.googleapis.com for your REST API calls:

The Vision API client libraries accesses the global API endpoint (vision.googleapis.com) by default. To store and process your data in the European Union only, you need to explicitly set the endpoint (eu-vision.googleapis.com). The following code samples show how to configure this setting.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Document image processing and handwritten text recognition have been applied to a variety of materials, scripts, and languages, both modern and historic. They are crucial building blocks in the on-going digitisation efforts of archives, where they aid in preserving archival materials and foster knowledge sharing. The latter is especially facilitated by making document contents available to interested readers who may have little to no practice in, for example, reading a specific script type, and might therefore face challenges in accessing the material.

The first part of this dissertation focuses on reducing editorial artefacts, specifically in the form of struck-through words, in manuscripts. The main goal of this process is to identify struck-through words and remove as much of the strikethrough artefacts as possible in order to regain access to the original word. This step can serve both as preprocessing, to aid human annotators and readers, as well as in computerised pipelines, such as handwritten text recognition. Two deep learning-based approaches, exploring paired and unpaired data settings, are examined and compared. Furthermore, an approach for generating synthetic strikethrough data, for example, for training and testing purposes, and three novel datasets are presented.

The second part of this dissertation is centred around applying handwritten text recognition to the stenographic manuscripts of Swedish children's book author Astrid Lindgren (1907 - 2002). Manually transliterating stenography, also known as shorthand, requires special domain knowledge of the script itself. Therefore, the main focus of this part is to reduce the required manual work, aiming to increase the accessibility of the material. In this regard, a baseline for handwritten text recognition of Swedish stenography is established. Two approaches for improving upon this baseline are examined. Firstly, a variety of data augmentation techniques, commonly-used in handwritten text recognition, are studied. Secondly, different target sequence encoding methods, which aim to approximate diplomatic transcriptions, are investigated. The latter, in combination with a pre-training approach, significantly improves the recognition performance. In addition to the two presented studies, the novel LION dataset is published, consisting of excerpts from Astrid Lindgren's stenographic manuscripts.