Limit on number of whitelist characters

92 views

Skip to first unread message

Reuben L.

unread,

Sep 9, 2014, 3:19:05 AM9/9/14

to tesser...@googlegroups.com

Hi all experts,

I would like to clarify if there is a limit to the number of whitelisted characters when using the tessedit_char_whitelist parameter in the config file. In my case, I noticed that once the number of whitelisted characters exceeds ~1300, an error "read_params_file: parameter not found" along with the remaining characters will be thrown. This suggests that tesseract is attempting to pass the rest of the characters as a parameter once it passes the around 1300 characters (multibyte ones).

It might sound strange that I have so many characters, but this is due to my need to limit Japanese kanji character recognition down to only the 1900+ commonly used kanji, instead of the whole lot (which is many times more). Blacklist is also out of the question as there are MORE to blacklist than to whitelist.

I've also tried passing the tessedit_char_whitelist parameter twice, but only the latter one was considered. Apart from that, I have also tried passing it as a -c parameter in the commandline but that also failed.

While I know it would be possible to train for only the limited set of kanji, we are already at a point where doing so would be very wasteful in terms of time.

Does anyone know of any other solution to this issue? Thanks in advance.

Reply all

Reply to author

Forward

0 new messages