for(var pageIndex = 0; pageIndex < numberOfPages; pageIndex++)
{
pdfImage.convertPage(pageIndex).then(function (pageImage) { Tesseract.recognize(pageImage, 'eng').then(({ data: { text } }) => {
console.log(text);
//perform other synchronous processing with the text before moving to the next page...
});
});
}
I can process one page without the for loop in about 200ms, but when I try to loop everything gets messed up. I'm not sure how to proceed with processing these promises synchronously. I know promises are supposed to be more efficient, but sometimes order is important and resources are limited for unchecked parallel processing... Like file type conversion and OCR for 300-500 page documents.
As a nice-to-have, I would also like to figure out how to load and initialize tesseract.js once and then just call the recognize method. I have tried the following code to achieve that, but I think it loads and initializes then reloads and initializes when it calls the recognize method. Controlling that behavior may not be possible, but I figured I'd throw it out there.
(async () =>
{
await Tesseract.load();
await Tesseract.loadLanguage('eng');
await Tesseract.initialize('eng');
});
//then perform the convertPage then recognize as shown in the first code block above...
Thank you!