Tesseract for recognition the international phonetic transcription

Epin Dorsal

unread,

Jan 18, 2014, 4:32:11 AM1/18/14

to tesser...@googlegroups.com

Hi, collegues! I've been looking for a soft means for recognition the international phonetic transcription, may be applied into C++ Builder. Would you help me to find it, please!

Tesseract

Nick White

unread,

Jan 22, 2014, 11:55:28 AM1/22/14

to tesser...@googlegroups.com

Hi Epin,

Tesseract could certainly be used to recognise the International
Phonetic Alphabet, though to my knowledge nobody has trained it for
that yet. As there are quite a lot of different diacritics the
training set would be quite large, but that's no problem,
particularly if you use a tool like VietOCR[0] or my tools[1] to
generate the training images. Detailed instructions for training
Tesseract can be found on the Tesseract wiki[2].

Tesseract has a C++ API, so you can certainly integrate it into a C++
project.

Hope this helps. If you need more advice, please try to be specific
in your question.

Nick

0. http://vietocr.sourceforge.net/training.html
1. https://gitorious.org/ancient-greek-training-for-tesseract/tesstrainingtools
2. https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

Tom Morris

unread,

Apr 23, 2015, 4:32:13 PM4/23/15

to tesser...@googlegroups.com

Has anyone tackled training for the IPA since this initial query?

I'm considering using Tesseract to OCR the first edition of the Oxford English Dictionary (as input to a crowdsourced proofing process) and trying to decide whether it's worth training it to recognized the pronunciations. I'm also not sure how close the OED version is since the original IPA was developed in 1886 and the first volume of the first edition of the OED was published in 1888, so perhaps they used a homegrown variant.

Here's the alphabet: https://archive.org/stream/ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/01.NEDHP.AB.Oxford.Murray.1888.#page/n22/mode/1up

here's an example page with it in use: https://archive.org/stream/ANewEnglishDictionaryOnHistoricalPrinciples.10VolumesWithSupplement/01.NEDHP.AB.Oxford.Murray.1888.#page/n326/mode/1up

Any opinions on whether it's worth training for the phonetic alphabet or is it going to just be too difficult to recognize even with specific training?

Tom

Nick White

unread,

Apr 30, 2015, 6:39:42 PM4/30/15

to Tom Morris, tesser...@googlegroups.com

Hi Tom,

Just on the off chance it's useful, I wrote some scripts which
process the 2nd edition OED SGML into DICT format, called oed2dict:
https://njw.name/oed2dict/

Sounds like an interesting project you're involved in. Can I have
more details?

FWIW, I'd imagine the full IPA would be a hard script to train, as
it's lots of little diacritical marks on similar letters. But, while
the quality might not be as high as other scripts, it'd probably
still be worth it, to provide a better base for crowdsourced
proofing. The 'symbols' file oed2dict lists all the named symbols
(basically all non-ascii characters) in the digital 2nd edition OED,
which is probably a very good starting point for generating training
texts.

Nick

> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> tesseract-ocr/4f2be0bc-71e5-4483-b4a4-bf0064ddcdf1%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,

Apr 30, 2015, 10:12:55 PM4/30/15

to Nick White, tesser...@googlegroups.com

Hi Nick,

Thanks for the heads up. As a colonial, I'm afraid that I'm woefully unaware of the dates of the editions of the OED and, consequently, whether this edition is in the public domain. I've guessing it's not due to the reference to CDROM -- a rather new-ish technology.

Here's my fork of the OKFN project: https://github.com/tfmorris/oed

The overarching goal is OCR, then crowdsource correct, a public domain edition of the Oxford English Dictionary. Of course, the devil is in the details. The OKFN's starting point is to use Internet Archive's ABBYY FineReader transcription, with which I'm somewhat unimpressed. Nonetheless, I've been attempting to push that front as far as it will go as a starting point. I have previous experience with reading/interpreting FineReader XML, so it's a relatively easy path to explore to start with.

The ultimate pipeline will exploit lots of lexical, typographic, layout, and semantic information to generate starting canonical word entries to be corrected in some type of crowd sourcing environment. I'm not an OCR expert, but love the multi-layered and machine + man approaches that this will require.

I've got hundreds of pages of vol. 1 HTML generated with color-coded low OCR accuracy highlighting, cleaned up layout, and a few other things. I've committed to the OKFN that I'll get them up somewhere for folks to review (probably github.io). Of course, I've only just scratched the surface of the analysis which will be required. When I look at low accuracy segments, I find blocks of Persian and other languages, in additional to all other manner of variability.

I'd love to have more collaborators on this. I should write up some more formal description of my current plan of attack and possible alternatives. The OKFN isn't big on that kind for formality. They're more in the "Let's do this thing!" kind of vein.

Thanks for your interest. I should tell you upfront, that this is a sideline to my sidelines, so it may not get tons of attention.

Tom

Arild Edvard Båsmo

unread,

Jan 18, 2019, 8:17:00 AM1/18/19

to tesseract-ocr

Hi, I am wondering if anyone ever got any further on OCR on IPA? I am interested in employing it on the SED, so before I start training my own, I am trying to find out if anyone has already done the work for me.

Grant Barrett

unread,

Apr 30, 2022, 12:29:39 AM4/30/22

to tesseract-ocr

Arild, did you ever manage to find or train your own IPA data for tesseract?

Grant Barrett

Reply all

Reply to author

Forward