Hi All,
I'm using Tesseract 3.02.02 on a windows 7 computer, via gImageReader GUI front-end (so I don't have to go into the black stuff, ms-dos).
Works well, except... same problem as everyone else: character sequence fi and fl are replaced by unicode(?) characters 0xFB01 and 0xFB02, latin ligatures small fi and fl.
Solution in a few other threads is to put a blacklist in the config file, but I've tried and not succeeded. How do you actually do that in the windows operating system?
Firstly: There is no config file, as such. Tesseract is not "installed", but has its files copied across to the directory:
C:\Users\rob\AppData\Local\Tesseract-OCR
Deeper down there are 3 more directories:
1. C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata
which has the files:
eng.traineddata
eng.cube.fold
eng.cube.lm_
eng.cube.word-freq
eng.cube.size
eng.cube.nn
eng.cube.params
eng.cube.bigrams
eng.cube.lm
eng.tesseract_cube.nn
osd.traineddata
plus 2 directories:
2. C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata\configs
which has the files:
ambigs.train
api_config
bigram
box.train
box.train.stderr
digits
hocr
inter
kannada
linebox
logfile
makebox
quiet
rebox
strokewidth
unlv
3. C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata\tessconfigs
which has the files:
batch
batch.nochop
matdemo
msdemo
nobatch
segdemo
Is one of these the "configuration" file I need to edit?
Note also, windows standard editor would be ms-notepad, you have option to save text as ANSI, UTF-8, Unicode or Unicode big-endian. Which is the correct one to use - ANSI is standard, but won't allow you to save the ligatures, so it must be one of the others. I've tried them all, editing existing files and adding new files. Always failed.
More info: I know nothing about programming, have no compiler on my computer. I downloaded working executables from sourceforge or github or googlecode or somewhere. Managed to get them going without too much fuss by following the instructions.
I never did any training of Tesseract - it came already trained, presumably.
But I can't find any simple configuration instructions to follow to get rid of the latin fi and fl ligatures by editing windows files. And I want to get rid of them - convert each to two standard english letters for saving the files as english text.
Any help appreciated,
Regards,
Rob