Hi,
I am trying to train Tesseract4 with (
ocrd-train) Makefile except box file generation. Box file(s) are generated using below command
text2image --text "/traintext.txt" --outputbase "/traintext" --fontconfig_tmpdir "/fontconfig" --fonts_dir "/usr/share/fonts" --font "Jameel Noori Nastaleeq" --leading 32
I have used (25000) iterations by updating Makefile. I have used following command for generating traindata
sudo make training MODEL_NAME=urd1 START_MODEL=urd TESSDATA_REPO=_best WORD_LIST=urd1.worldlist.clean
Start model is existing
Urdu model from
_best repository.
Please find below all related files
"Jameel Noori Nastqleeq" font is ligature based. After successful training when I try to use following code to perform OCR on below image, I am facing an issue related to space between few words
public static void main(String[] args) {
File imageFile = new File("testing.png");
ITesseract instance = new Tesseract();
instance.setDatapath("tessdata");
try {
instance.setLanguage("urd1")
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}

OCR result:
ہفتوںکی منصوبہ بندی حقیقتکا روپ دھار نےلگی
This is same line which is used during training. OCR Output is fine except the missing space between few words i.e.

I believe it is because space in Urdu Nastaleeq (with kerning) writing style is diagonal instead of vertical

Is there any possibility to resolve this issue in Tesseract4?