tesseract training flags to rtl languages

951 views
Skip to first unread message

Daniel

unread,
Jul 7, 2013, 5:45:18 AM7/7/13
to tesser...@googlegroups.com
Hi everyone,

I worked on a project that I need to do training for rtl languages. (hebrew and arabic)
After I do the training process everything works great, except that the text printed as ltr text.
Is there any flag to set during the training process that tell tesseract to treat the trained file as rtl language file so he can print the text in the right order?

Thanks for helping!
Daniel

WHITE N.

unread,
Jul 7, 2013, 8:34:40 AM7/7/13
to tesser...@googlegroups.com
Hi Daniel,

There is a direction flag in the unicharset file, I suspect that should be set correctly. See the manpage: http://tesseract-ocr.googlecode.com/svn-history/r719/trunk/doc/unicharset.5.html

Other than that my advice would be to unpack the official rtl training files using combine_tessdata -u, and see if there is a flag in the config file that's needed.


From: tesser...@googlegroups.com [tesser...@googlegroups.com] on behalf of Daniel [dan...@iolite.co.il]
Sent: Sunday, July 07, 2013 10:45 AM
To: tesser...@googlegroups.com
Subject: tesseract training flags to rtl languages

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Shree Devi Kumar

unread,
Jul 7, 2013, 11:38:06 AM7/7/13
to tesser...@googlegroups.com

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Daniel

unread,
Jul 14, 2013, 7:02:59 PM7/14/13
to tesser...@googlegroups.com
Thanks WHITE N. & sdk.

Both of you helped me so much! thank you!

For anybody else that looking for solution to this problem (with non correct unicharset file generated by unicharset_extractor)
I also port the python script to correct the unicharset file to php, so if anyone need such code, you can send me email and I will send you the code.

karo...@yahoo.com

unread,
Oct 18, 2013, 4:02:37 PM10/18/13
to tesser...@googlegroups.com
Dear Daniel,

Is there not an easyer way to do this, because I use GUI when I work and this is my problem:

I'm trying to train Tesseract for Kurdish, this is good too for the Persian, Kurdish has some more other letters, but the way of writing is the same as Arabic or Farsi. The problem I'm getting is that the final OCR result is not from right to left, but from left to right, which means that u can't read the text, but the letters r correct. I use  qt-box-editor to edit the box, then I use Serak tesseract Trainer V0.4 to train the OCR, after all I put the Traineddata file in the Tesseract dir., every thing goes well except the missing Arabic mechanism of writing from right to left.

So is there any way to change that unicharset file with a GUI i.s.o. the command line?

Thanks alot in advanced
Karo

Op maandag 15 juli 2013 01:02:59 UTC+2 schreef Daniel:

moen....@gmail.com

unread,
Aug 9, 2018, 10:31:39 AM8/9/18
to tesseract-ocr
Hello daniel,
i am developing ocr for urdu language, iam also facing the same problem, model is working correctly but output is printing ltr, will you please sahre the solution. thankyou in advance.

Shree Devi Kumar

unread,
Aug 9, 2018, 10:39:06 AM8/9/18
to tesser...@googlegroups.com
There is an Urdu traineddata for tesseract 4. Have you tried it



You can also check script/Arabic which should also support Urdu.

Please provide feedback as to its accuracy for Urdu.

On Thu 9 Aug, 2018, 8:01 PM , <moen....@gmail.com> wrote:
Hello daniel,
i am developing ocr for urdu language, iam also facing the same problem, model is working correctly but output is printing ltr, will you please sahre the solution. thankyou in advance.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Mohammad Moin

unread,
Aug 9, 2018, 10:43:03 AM8/9/18
to tesser...@googlegroups.com
this is not much accurate, i am trying to develop my own traineddata from scratch, i have completed every thing but the output is ltr in testing, dont know whats the wrong in training. can you please point out. thank you.


For more options, visit https://groups.google.com/d/optout.


--
Regards : Mohammad Moin Ud Din

Shree Devi Kumar

unread,
Aug 9, 2018, 10:56:38 AM8/9/18
to tesser...@googlegroups.com
Are you training for tesseract 3 or tesseract 4(LSTM training)? 

Mohammad Moin

unread,
Aug 10, 2018, 5:34:55 AM8/10/18
to tesser...@googlegroups.com

Dror Musai

unread,
Feb 22, 2024, 1:31:57 AMFeb 22
to tesseract-ocr
Hi

using version 5.3  of tesseract with hebrew  lang.  still not understand  why adobe + foxit   ,  can not find word in   the pdf   after ocr.
with google find work fine.
aslo just view the file   in adobe + foxit  looks fine.
the revered issue is on searching something 


ב-יום ראשון, 7 ביולי 2013 בשעה 12:45:18 UTC+3, Daniel כתב/ה:

Ger Hobbelt

unread,
Feb 22, 2024, 7:04:41 AMFeb 22
to tesseract-ocr


On Thu, 22 Feb 2024, 07:32 Dror Musai, <dro...@gmail.com> wrote:
Hi

using version 5.3  of tesseract with hebrew  lang.  still not understand  why adobe + foxit   ,  can not find word in   the pdf   after ocr.

Pdf does not equal "text"! Pdf is a complex format where, more often than not, human-visible "text" is actually just a bunch of picture(s) instead of rendered glyphs: https://en.m.wikipedia.org/wiki/Glyph

Your line IMPLIES that the pdf(s) you struggle with are generated by/via tesseract. Lacking information, this is what I assume, for now.

OCR is complex machinery, and first order of business with diagnosing complex machinery is reducing the scope of error. For that, and hence for anyone possibly being able to assist you, you need to check and reduce.

Check: nobody human needs actual text to "read" (means: view on screen or on printed paper) pdf content. We look at images=pictures and that is what "pdf readers" produce - except specialized ware for blind people and the otherwise visually handicapped. 
As you mention "search" as the problem area, which DOES require machine text rather than basic pictures, first you must find out whether the OCR process actually does produce "text", and if so, what that text actually IS: pdf viewers hide "text overlays" by default, so you need specialized tools to uncover the text inside the pdf or, much easier, change the OCR output format.

For that it is strongly advised (I'd say mandatory) to adjust your OCR process to have it produce HOCR format, which is a kind of augmented HTML: you can open such a file in notepad and actually read the raw content. Some of us are okay with TEXT output format, because that is the simplest format, but it drops info that is available in HOCR and thus obscures/hides several problem types, hence my advice to find out how you can produce HOCR format directly from tesseract.

Reduce:
To enable anyone to possibly assist, you must reduce = boil down the issue to tesseract in a structure and mini process that makes it potentially reproducible; along the way you may find that the issue you have is not tesseract related but located elsewhere in your process/pipeline. Here we'll assume your issue is with tesseract or it's immediate surroundings.



Required action

 Here's what you need to do (everyone has to, because there's a plethora of processes around, before, after and on top of tesseract out there and those only make things easy as long as things go exactly as advertised. You, on the other hand, have an issue, so you will have to divide and conquer, i.e. reduce your problem zone/area/scope, or you will forever be unable to discover where the problem originates); reduce your (OCR) process to this and report:

>>>>>>>>>>> (Checklist)

- you use the tesseract CLI (aka "tesseract executable/binary with its command line interface"); this is not a python script, not anything "script"-ish otherwise; you execute tesseract directly in bash/cmd and specify the precise command line (tesseract + argument set). This command line is also needed by anyone else out there to possibly reproduce your issue and help diagnose & fix.

- you feed tesseract a (page) image, preferably PNG format. If your original source is jpeg, use the jpeg.

- your tesseract commandline is such that tesseract outputs HOCR format (my preference) or plain text; this already empowers you to diagnose your issue deeper yourself as you can easily check yourself whether tesseract then produces desired/expected output or something else. Which is also useful to know as you're looking for the root cause here.

In your particular case, with the minimal information handed over, three general main problem sources are to be expected and reduction must be applied to discover which of these is yours

1. errors in pdf text embedding process (part of OCR postprocess); failure to correctly and compatibly embed text in pdf 

2. failure to produce a page image that is ready for OCR by tesseract. (OCR preprocess) Lots of issues are due to this.

3. unexpected/faulty OCR results for the given input image (the OCR process itself: tesseract)

- for reporting, anyone will need your tesseract commandline, the input image(s) used and the results you get (error+info console output; output text/file(s)) plus the tesseract version/build info, which can, for example, be obtained by running 

tesseract -v


<<<<<<<<<(Checklist ends)


with google find work fine.

:-S   to have a pdf indexed and searchable by Google, you need to publish the pdf online and the Google index bot must go and find and access it; that is a nontrivial process, so I wonder... Besides, once Google gets to your pdf, it will judiciously run it's own OCR process internally before indexing your pdf content, which makes this a non-starter for diagnostics purposes regarding your own process/pipeline.... At the very least, this is way off into any postprocessing pipeline and definitely not instantaneous for anyone; Google indexing is arbitrary in time.

This is also indicative that you might want to seek additional, local, technical support while diagnosing your issue.

aslo just view the file   in adobe + foxit  looks fine.

As I stated near the beginning: these are pdf viewers and they are happy to show you page scans or any other picture format/potpourri in your pdf, next to possible text glyphs. Pdf is a very complex format, you don't need machine text to show text and "text overlays" are not shown on screen or in print.

Meanwhile, SEARCHING in a pdf requires TEXT (machine text) plus pdf search permissions (pdfs can be "secured" against search, copy-paste, etc. to complicate those pdf text search issues even further).

Hence the advice to REDUCE your problem surface area; currently, also due to the minimal provided information, it is... without bounds.

the revered issue is on searching something 

I'm sure you meant something else then "revered" here.  ;-) 
 Perhaps a Google Translate to English automaton mistake?

Cheers,

Ger



ב-יום ראשון, 7 ביולי 2013 בשעה 12:45:18 UTC+3, Daniel כתב/ה:
Hi everyone,

I worked on a project that I need to do training for rtl languages. (hebrew and arabic)
After I do the training process everything works great, except that the text printed as ltr text.
Is there any flag to set during the training process that tell tesseract to treat the trained file as rtl language file so he can print the text in the right order?


PS: you quoted this message from a long time ago, but, given your own message, this is only potentially an issue (much) further down the road (after much reducing!) and while related to search issues, not the top contender.
You must first discover what is actually produced and analyse that. Only once that is cleared, might you possibly run into rtl vs ltr, etc.












Thanks for helping!
Daniel

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Dror Musai

unread,
Feb 22, 2024, 9:08:18 AMFeb 22
to tesseract-ocr
thanks for long answer,  but for true, didn't  understand what  actually  to do.
so, I make short video that describe the process (in zip file). 
as far I read , this is known and Unsolved issue.



ב-יום חמישי, 22 בפברואר 2024 בשעה 14:04:41 UTC+2, g...@hobbelt.com כתב/ה:
BH.zip

Tom Morris

unread,
Feb 22, 2024, 1:19:59 PMFeb 22
to tesseract-ocr
I only skimmed Ger's long reply, but didn't see a link to the issue, which I think is the important bit of information:


It's a long standing (and complex) problem in which behavior varies across different PDF viewers.

Tom
Reply all
Reply to author
Forward
0 new messages