So you tried all the easy parts and leave difficult parts to the forum :-)
First of all - yes - this is a table problem => you need to do page segmentation by yourself before OCR. Tesseract is OCR eng. It is able to make simple page segmentation like scanned book pages, but for complex layouts, you need to make layout segmentation with something else
Next, there are plenty of graphics - you will need to get rid of them (e.g. not to OCR it with tesseract).
If the text positions are stable you create/use uzn file (search forum) to OCR just text areas.
If the text positions are changing, then the solution could be to detect the position of the expected image part like "x" and calculate the text positions from it.
Or try to use some text detection tools like OpenCV’s EAST text detector[1] or Yolo...