Problems recognized mixed scripts in Tesseract 4 alpha

65 views
Skip to first unread message

Brendan O'Kane

unread,
Aug 31, 2017, 3:01:30 AM8/31/17
to tesseract-ocr
Hi all,

Running 'tesseract -l eng+chi_tra' on a scanned page of English text mixed with Chinese characters does not detect any Chinese characters at all: 

> The five chapters on fiction, memoirs, and other kinds of prose that
> follow offer as many approaches to our understanding of the transition
> between 1644 and 1700. Focusing on the lives of Mao Xiang § X (161-
> 93) and Yu Huai A1% (1616-96), Oki Yasushi develops portraits of these
> two "romantic Jiangnan loyalists," who clung to patterns of late Ming
> feeling and aestheticism long after the Ming had fallen. The image of
> loyalism as romantic is in striking contrast to starker images of loyalist
> experience. Both Mao and Yu are best known for their memoirs, which
> focus prominently on women, one of the new ways of figuring nos-
> talgia and resistance in male writings of the early Qing. Robert Hegel's
> "Dreaming the Past" is similarly concerned with the individual, fo-
> cusing on Chu Renhuo #ARE (ca. 1630-1705+), as well as his novel,
> Sui Tang yany: G B® #&, (ca. 1675), but it extends well beyond Chu and
> his work in contemplating how "the past" (the Tang past in particular)
> shaped imaginative literature in an era when the present offered little
> solace.

The characters are (mostly) correctly recognized when only 'chi_tra' is set as the OCR language, but at the cost of seriously degraded accuracy in English OCR:

> The fve chapters on fiction,menoirs, and other kinds of prose thar
> follow offer as nany approaches to our understanding ofthe transition
> between :644 and I7oo. Focusing on the |ives of Mao 文 iang 冒 裱 (I6II-
> 93andYuTiuai 余 懷 ((616-96), OkiYasushidevelops portraits ofthese
> two "ronantic Jiangnan loyalists"who clung to patterns of ]ate N{ing
> feeling and aestheticismn long after the Ming had fallen. The of
> loyalisn as ronantic is in striking contrast to starker 1nages of |oyalisr
> experience. Both Mao and Yu are best known fortheir memolrs, wˇhich
> focus Proninently on womnen, one of the new ways of figuring nox-
> talgia and resistance in male writings of the early Cuing. Roberr Tiegel's
> "1reaning the Past" is simnilarly concerned with the individual, fo-
> cusing on ChuRenhuo 褚 人 穫 (ca. I63o-I7oy+)}, as well a$ his novel,
> 5#77mzg5227 隋 唐 演 義 (Ca.I67y, butit extends well beyondChu and
> his work in contemplating how "the past" (the Tang Past in Particulan
> shaped imaginative ]iterature in an era when Lhe present offered |ittle
> $olace.

Is this a known issue? Am I doing something wrong here? 

--Brendan

ShreeDevi Kumar

unread,
Aug 31, 2017, 4:25:06 AM8/31/17
to tesser...@googlegroups.com
Have you tried the best trained data for Chinese which has English in addition to Chinese as part of the training. That maybe a better option than using eng+

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0ed8e7da-72cb-4bb8-8f48-44f8fc76f7c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brendan O'Kane

unread,
Sep 5, 2017, 7:29:36 PM9/5/17
to tesseract-ocr
Aha! Worked like a charm -- thanks very much! Combining HanT+Japanese seems also to degrade recognition accuracy pretty significantly, but HanT from the best trained data works pretty well on its own for pages that are just English and Chinese.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages