Tried
https://github.com/Sicos1977/TesseractOCR and Leptonica to convert jpg receipt slip to text:
using TesseractOCR;
using TesseractOCR.Enums;
byte[] imageStream = <jpg image>;
var img = TesseractOCR.Pix.Image.LoadFromMemory(imageStream);
Engine osdengine = new(@"./tessdata",Language.Osd,EngineMode.Default);
using TesseractOCR.Page osdpage = osdengine.Process(img,PageSegMode.AutoOsd);
int orie = -1;
float conf = 0;
osdpage.DetectOrientation(out orie,out conf);
if(orie != 0)
img = img.Rotate(ConvertDegreesToRadians(360 - orie));
Engine engine = new(@"./tessdata",Language.Estonian ,EngineMode.Default);
using TesseractOCR.Page page = engine.Process(img,PageSegMode.SingleBlock);
Console.WriteLine("Result " + page.Text);
Receipt image contains background:
https://i.sstatic.net/XXlaWJcg.jpgRecognized text contains random characters. If background is removed manually:
https://i.sstatic.net/wimOc8CY.pngtext is mostly recognized but VAT sum 18,37
https://i.sstatic.net/Um7RLlmE.png is not recognized.
How to properly digitalize this receipt? How to remove background from image or force OCR to ignore background?
What pre-processing should applicated to receipt slips beforre OCR?