Diagonal space in Urdu Nastaleeq font

28 views

Skip to first unread message

Muhammad Azeem

unread,

Apr 21, 2019, 1:30:24 AM4/21/19

to tesseract-ocr

Hi,

I am trying to train Tesseract4 with (ocrd-train) Makefile except box file generation. Box file(s) are generated using below command

text2image --text "/traintext.txt"  --outputbase "/traintext"  --fontconfig_tmpdir "/fontconfig"  --fonts_dir "/usr/share/fonts" --font "Jameel Noori Nastaleeq" --leading 32

I have used (25000) iterations by updating Makefile. I have used following command for generating traindata

sudo make training MODEL_NAME=urd1 START_MODEL=urd TESSDATA_REPO=_best WORD_LIST=urd1.worldlist.clean

Start model is existing Urdu model from _best repository.

Please find below all related files

"Jameel Noori Nastqleeq" font is ligature based. After successful training when I try to use following code to perform OCR on below image, I am facing an issue related to space between few words

	public static void main(String[] args) {
		File imageFile = new File("testing.png");
		ITesseract instance = new Tesseract(); 
		instance.setDatapath("tessdata");

		try {
			instance.setLanguage("urd1")
			String result = instance.doOCR(imageFile);
			System.out.println(result);
		} catch (TesseractException e) {
			System.err.println(e.getMessage());
		}
	}

OCR result:

ہفتوںکی منصوبہ بندی حقیقتکا روپ دھار نےلگی

This is same line which is used during training. OCR Output is fine except the missing space between few words i.e.

I believe it is because space in Urdu Nastaleeq (with kerning) writing style is diagonal instead of vertical