Diagonal space in Urdu Nastaleeq font

28 views
Skip to first unread message

Muhammad Azeem

unread,
Apr 21, 2019, 1:30:24 AM4/21/19
to tesseract-ocr
Hi,
I am trying to train Tesseract4 with (ocrd-train) Makefile except box file generation. Box file(s) are generated using below command

text2image --text "/traintext.txt"  --outputbase "/traintext"  --fontconfig_tmpdir "/fontconfig"  --fonts_dir "/usr/share/fonts" --font "Jameel Noori Nastaleeq" --leading 32

I have used (25000) iterations by updating Makefile. I have used following command for generating traindata

sudo make training MODEL_NAME=urd1 START_MODEL=urd TESSDATA_REPO=_best WORD_LIST=urd1.worldlist.clean

 Start model is existing Urdu model from _best repository.

Please find below all related files
"Jameel Noori Nastqleeq" font is ligature based. After successful training when I try to use following code to perform OCR on below image, I am facing an issue related to space between few words


public static void main(String[] args) {
File imageFile = new File("testing.png");
ITesseract instance = new Tesseract(); 
instance.setDatapath("tessdata");

try {
instance.setLanguage("urd1")
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}

testing3.png
OCR result:
ہفتوںکی منصوبہ بندی حقیقتکا روپ دھار نےلگی

This is same line which is used during training. OCR Output is fine except the missing space between few words i.e.

result.png


I believe it is because space in Urdu Nastaleeq (with kerning) writing style is diagonal instead of vertical

diagonalspace.PNG


Is there any possibility to resolve this issue in Tesseract4?

Reply all
Reply to author
Forward
0 new messages