Hello,
I am new to Tesseract and could use some guidance on how a versed person would tackle this issue. I have a php website where I can get the data out of a pdf without any issues but the order of the data that I am pulling is a mess. The issue is that the return is only one long sting without any return characters or other way to break it down into parts I was going to slice the pdf into several chunks and run each one though OCR at a time but I find that Tesseract has the power to do what I need it to do. Also with the 1000s of times the user will be uploading a new pdf it might not line up exactly the way I need it to.
My end goal is to be able to update all these values to my database in the order they are related. For the 4th generation that would be 31 different areas to scoop up the data I need. If these are in order with an X coordinate I can always use that and work my Y values down.
Even if all I had to work with is a /n character for each line I might be able to make that work.
On the 4th generation Pedigree I tried to cut the last entire 4th generation out. If I go that route that would only be 6 crops I need to make on this (1 for the dog, two for each of those parents, and then each generation. My users will have 3 or 4 generation pedigrees.
Any advice would be greatly appreciated.
Thanks
Daron

