Hi,
AFAICT tesseract OCR quality deteriorates a lot when being fed 'inverted colors', i.e. white text on black background. (Can't dig up the tesseract blog / article I first saw this mentioned and google fails me in this regard right this minute, sorry.)
Second, from what I gather from all the applications/code I've investigated which feed images to tesseract, the last stage is always a [type of] 'threshold' stage where text is converted to a simple black&white picture: tesseract expects black text on white background.
Given your purple+yellow "image test" image, a simple threshold action very probably would render that as white text on black background, which is the wrong way around if you want to get the best performance from tesseract.
Hence a potential solution vector would be:
- find ways to 'preprocess' your images to ensure each is converted to black text on white background in a subsequent thresholding pass. (Do the thresholding yourself in your preprocess to have maximum control over the image you feed to tesseract.)
(Quick initial thought: it might be good enough to count pixels with each hue, then find the two major 'bulges' in the color distribution and code a quick filter which assigns the hues in the least major hump to black and ones in the most major one to white.
Another way would be to run a threshold filter and then do this counting on the threshold /output/: pixels there can only be either black or white as the threshold action outputs a monochrome image and thus the code would be extremely easy to count pixels and flip the colors if the black color count happens to be larger than the white color count. Just some rough idea, this.)
Did a quick scan of that one sounds like it might be good to check out further for you.
HTH
Met vriendelijke groeten / Best regards,