I'm working on something related, and going about it a bit differently: you can get the bounding boxes of each word from the Google OCR (or Tesseract or whatever) response itself.
In the case of Google OCR, this is in jsonResponse.responses[0].textAnnotations, in elements of the array after the first one. A gist from the thing I'm working on (will share when ready, hopefully in a few days):
(I use this bounding box data to identify lines of the text, then manually select a group of lines and hit a button to group those lines into regions corresponding to the different texts. I'm fine doing it manually because it takes only a couple of seconds per region and if it's a text I care about I may not mind glancing over it anyway. An earlier version of this was what I used to extract matching regions — verse and footnote — for
this or
this, for example: the same thing can be used for different texts or text vs commentary.)