Ah, a misunderstanding there.
Ok, the key message of those pages is: you must extract each "table cell" as a /separate/ image to help OCR, then, if needed, combine the text results for each of those smaller images to form the text of your page.
That's often referred to as "segmentation".
Tesseract has an algorithm for that built in AFAICT, but that is geared towards pages of text (reams of texts, lines of text) and picking out the individual words in there. That task gets very confused when you feed it a table layout, which has all kinds of edges in the images that are /not/ text, but table cell /borders/.
So what those links are hinting at is that you need to come up with an image *preprocess* which can handle your type of table. This depends on your particular table layout, as there are many ways to "design / style" a table.
So you will have to write some script which will find and then cut out each table cell as an image to feed tesseract.
When you look for segmentation approaches on the net, leptonica and opencv get mentioned a lot.
Unfortunately most segmentation work when googling for it is about object and facial recognition. Not a problem per se, isnt a table cell an object too? Well, not really, not in the sense they're using it as those algorithms approach the image segmentation from the concept of each object being an area filled with color(s). This would be applicable if the table was styled as cells with an alternating background, for instance, but yours is all white and just some thin black borders.
There's a couple of ideas for that:
1: conform the image to an (empty) form template, i.e. seek a way to make your scanned form overlay near perfectly on a template image. Then you have to define your areas of interest (box coordinates in the template) and clip those parts out, save them as individual files and feed those to tesseract. This is often done for government application forms: there is a reason you're supposed to only write within the boxes. 😉
That is what that first link alludes at. It's just one idea among many to try.
2: what if you cannot or must not apply idea 1? Can we perhaps detect those table borders through image processing and /then/ come up with something that can take that data and help us extract the cell images?
Here are a couple if fellows who have thought "out of the box" (pun intended) and gotten some results by phrasing my question in an entirely different way: instead of wondering how we can detect and extract those table cells, they try to answer the question: "what if we are able to *remove* those cell borders visually? Yes, we will worry about the texts in the cells looking as a haphazard ream of text later and expect trouble to discern which bit of recognized text was in what cell exactly (tesseract can output hOCR + other formats which deliver text plus coordinates of placement - you may have to wprk on that *afterwards* when you do something like they're doing:
Looks promising to me. What I'ld attempt next with their approach is see if I can make those detected borders extend and then me being able to extract each individual black area (cell!) as pixel *mask*, to be applied to my (conformed) page image so everything is thrown out except the pixels in that cell and thus giving me one image of one cell worth of text. Repeat that for each black area (see
https://stackoverflow.com/questions/33949831/whats-the-way-to-remove-all-lines-and-borders-in-imagekeep-texts-programmatic answers to see what i mean: th3 result image he gets which is pure black with the table borders (lines) in white.)
/They/ tackle the problem similarly but conceptually in a very different way than I am thinking about now: they go and mask out the detected table borders in one go.
That can work very well and is much faster as they are not extracting subimages by masking or other means.
Their *potential* trouble will be deciding which bit of text was together in which cell. That can be done in bbox analysis after ocr/tesseract has done its job. (again, google can provide hints. Again, it depends on your particular circumstances)
My (very probable) trouble will be identifying the black cell areas singularly: doing a simple flood fill with a color, then extract anything covered by that color, is troublesome as the table border detection might very well not be perfect and thus cause my simpke flood fill to color adjacent cells too. 😢 So, if I had your task, I'ld be looking at ways to extract, say, each individual *minimum rectangle* which does not contain white pixels (uh-oh, need noise removal then!) OR perhaps a way where each detected line segment is described as a vector and then extend those lines out across the page to get my rectangles in between: those would be my cells then. That's a bother when the table has cells spanning columns or rows. So more research needed before I'ld code that preprocess.
Another issue with the line detection + removal/zoning techniques would be making sure the lines are either near perfect horizontal and vertical all (*orienting*/*deskewing* the image will help some there) OR you must come up with an algo that's able to find angled lines (while it should ignore the curvy text characters). Again, yet another area of further investigation if I were at it.
The key here is that you'll have to do some work on your images before you can call tesseract and expect success.
HTH.