Mixed Language (Greek-Latin Alphabet) OCR: OCR of Scientific Documents

69 views
Skip to first unread message

Karen G

unread,
Mar 15, 2023, 3:01:19 AM3/15/23
to tesseract-ocr
I am new to OCR and also Tesseract OCR.

I need to do OCR of scientific documents, which contain a mix of both Latin letters (English language) and Greek letters (Greek language), and also mathematical symbols.

Our commercial vendor (Dassault Systemes, BIOVIA) recommends integrating with Tesseract OCR but we are finding issues using this for our purposes.
(We are not experts in OCR -- only Ph.D. scientists developing code to parse scientific instrument data files & reports.)
 
QUESTION: Can anyone please point us to any conversations or other references or projects that might help us optimize Tesseract OCR for this use case (mixed Greek-English-Math)?

Thank you in advance for your patience & any info you might be able to provide.

Ger Hobbelt

unread,
Mar 15, 2023, 9:19:04 AM3/15/23
to tesser...@googlegroups.com
- or up-to-date Windows executables (and installers), see Uni of Mannheim (Stefan Weil). google search should dig those up quickly. ("tesseract mannheim windows installer" I bet)  See also: https://tesseract-ocr.github.io/tessdoc/Downloads.html

- tesseract supports multi-language OCR actions by specifying multiple languages on the command line using the `-l` command line parameter. Here's what I use at the moment: 

tesseract -l eng+rus+chi_sim+chi_tra+deu+fra+spa+jpn+hin+urd+vie+osd   ....etc...

which is, frankly, an almost insane combo, but that's what feeding tess in the local tests. 

You can get a list of languages from tesseract (once installed) when you run it with the `--list-langs` command line parameter.

See for a leg up:

Depending on your needs, you might also want to look into looking the "generic scripts" instead of the language-specific models: this is done by, for example, specifying

tesseract -l script/Latin+script/Greek    ....etc.....

or

tesseract -l eng+script/Greek    ....etc.....

(Tip: a quick peek in your 'tessdata' language models' directory tree will show you quickly what you get distilled from `--list-langs`: I checked the `script` subdirectory to come up with the above:

image.png

which is a snapshot from Windows Explorer straight from my local development environment, so reckon that your directory tree will be located elsewhere, but those .traindata files should be available on your machine after you installed tesseract + tessdata.


Re math OCRing: sorry, can't help you there. 
Haven't accomplished that myself yet, but one direction to investigate there would be to look into "legacy mode" as tesseract v3 had a dedicated "math mode" -- no idea how well that ever worked, so cave canem.

If I were you, I'ld first attack the english(latin)+greek text problem and see if I'ld get tesseract to produce something sensible for a couple of such test files.


Be advised: "optimizing" may be desired, but make sure you first get decent results and a workflow you like; tuning an OCR engine is hairy business so leave that for last, so you'll have a dependable baseline to work from and compare your changes against.


This is generic info; you might get more detailed help from the mailing list when you provide more info about your setup and what you've been trying to accomplish so far. your current info is a little "thin".  ;-)



Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/920e1722-246b-452b-992e-e20f9f9ec7d0n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages