is it possible to use the latest source from git to train Arabic?

70 views
Skip to first unread message

ShreeDevi Kumar

unread,
Nov 20, 2014, 9:42:12 AM11/20/14
to tesser...@googlegroups.com, tesser...@googlegroups.com, Ray Smith
Well, I just saw Arabic config file in langdata (uploaded on Aug 12th by Ray) and I am not sure whether training will be possible with existing tools available to us ...


It says:
# We do not yet have Tesseract for Arabic, so use OEM_CUBE_ONLY
# (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).
tessedit_ocr_engine_mode 1 Other than that, in order to use Jtess or commandline tools for training, you will need font_properties, wordlists etc ... ---- Ray, is it possible to use the latest source from git to train Arabic?


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Nov 20, 2014 at 7:37 PM, iram akbar <irama...@gmail.com> wrote:
it seems its a known issue of Serak. i have created the "ara" folder with files as "vie" folder in jtessbox editor as you can see in attachment. after that i have set the box file path in jtessbox editor of "Tesseract executable" and "Training data" for "ara" as attached. when i click the "Run" button i got the attached error. i don't know what goes wrong here.
Question: m i giving the wrong file in the path in "Tesseract executable" and "Training data" i.e ara box file? or what goes wrong.
note: i have put no data words_list, frequent_words, font_properties file. 


On 20 November 2014 17:32, ShreeDevi Kumar <shree...@gmail.com> wrote:
I have not used Serak - but the issues page there indicates problems with RTL languages - see https://code.google.com/p/serak-tesseract-trainer/issues/detail?id=6

why are u not using jtessbox editor's trainer or the command line programs? I think the binaries are bundled with JTess...



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Nov 20, 2014 at 4:26 PM, iram akbar <irama...@gmail.com> wrote:
Hello shree,

i am having an issue while training arabic in Serak (for box file generation i am using jtessbox editor). i am doing some testing. i have assigned  english alphabet for a single arabic word and created the box file as attached (jtessbox file). now following all training process in serak i got the OCR result as attached. although you can see in the box file there is 4 alphabets "A,B,C,D" but i was expecting OCR result will be ABCD but the results are BDBBAABBBBA as attached (serak result).
Question: why i a getting that result? is it some wrong while making box file in jtessbox editor or training in serak?

On Monday, 10 November 2014 15:30:21 UTC+5, shree wrote:
Look under jtessboxeditor/samples/vie folder

and create similar files for your language

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Nov 10, 2014 at 1:10 PM, iram akbar <irama...@gmail.com> wrote:
Quan,
i am able to generate some files with jtess ox editor but i am having an issue, when i select "Train with existing box" or "Train from Scratch" under the Trainer tab i am getting this attached message.
Question: How i can generate the Arabic.font_properties, Arabic.frequent_word_list and Arabic.words_list files using jtessbox editor?

On Friday, 7 November 2014 19:42:37 UTC+5, Quan Nguyen wrote:
Look in samples folder for a working example. You can start out from a UTF-8 text file about 2-page long, generate TIFF/Box from it, and prepare other necessary input files for training. You can train entirely in jTessBoxEditor.

On Thursday, November 6, 2014 6:19:53 AM UTC-6, iram akbar wrote:
thank you for your help but my issue still exits. if i need to generate the Tiff of an image text i am unable to generate the TIFF as it only ask to load the text file not image file. second if i have a lots of documents i need to copy paste first then generate the TIFF. Any one else can help me in this.
Question: how can i Input the Arabic text image in jtessbox editor to generate Tiff (as attached). 

On Thursday, 6 November 2014 16:38:25 UTC+5, shree wrote:
Click on the 'generate' box - with some devanagri fonts I have found that text does not display but the tiff/box are generated. Maybe same for the arabic font you are using. Give it a try.

You can also try to copy and paste the text, sometimes that works.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d7396d3d-c4d1-4fcc-a58d-6cc02927989c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1422c53d-8ad5-4339-8e4a-3de540a3dfa5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/QQ8wC59YKUI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWieFAj7ZnJKRTYPwL-UzJWnTK-wRSFPZgOEy-%2BM4D4-g%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CACYj_gEhH225qfiX79X3Ma7zB0MDJD%3DSVv7zcY26NrTgnvyKUw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages