is hOCR the best route to convert a large number of repetitive forms into structured data?

maxim...@gmail.com

unread,

Jul 13, 2015, 1:23:09 AM7/13/15

to tesser...@googlegroups.com

I'm working on converting a large number of tax forms into structured data, is hOCR the best way to do this? maybe there are other ways? I would imagine this is a problem that is at least partially solved.

Thanks in advance! Tesseract is awesome :)

James Owers

unread,

Jul 14, 2015, 2:47:40 AM7/14/15

to tesser...@googlegroups.com

You should consider also using the PAGE format. You can use this tool for conversion: http://www.primaresearch.org/tools/TesseractOCRToPAGE

Tom Morris

unread,

Jul 14, 2015, 2:35:19 PM7/14/15

to tesser...@googlegroups.com

On Tuesday, July 14, 2015 at 2:47:40 AM UTC-4, James Owers wrote:

You should consider also using the PAGE format. You can use this tool for conversion: http://www.primaresearch.org/tools/TesseractOCRToPAGE

Most PAGE format tools aren't available as open source and use a custom license specific to the lab that produces them and the primary thing that PAGE adds over hOCR (ground truth text) doesn't sound like it's needed here.

Tom

Janusz S. Bien

unread,

Jul 14, 2015, 3:21:42 PM7/14/15

to tesser...@googlegroups.com

Quote/Cytat - Tom Morris <tfmo...@gmail.com> (Tue 14 Jul 2015
08:35:19 PM CEST):

In what sense PAGE adds ground truth text over hOCR? In my opinion
hOCR is as good as PAGE for ground truth texts.

Personally I find the simple TSV format potentially quite useful. You
can find a sample output here:

http://teksty.klf.uw.edu.pl/12/
http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv

Regards

Janusz

--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

maxim...@gmail.com

unread,

Jul 14, 2015, 4:43:09 PM7/14/15

to tesser...@googlegroups.com

Thanks everyone for helpful pointers! These all appear to be different ways of describing the position of the identified words on the page? This definitely seems like it would help me produce structured data because I can classify the words as belonging to certain attributes of a json object for each page based on their vertical and horizontal positions.

I am afraid since I am so new to Tesseract and OCR in general I am missing important points or asking stupid questions, so unless you all suggest otherwise I will spend quite a bit of time with the tesseract source code on github.

Helmut Wollmersdorfer

unread,

Jul 15, 2015, 3:41:53 AM7/15/15

to tesser...@googlegroups.com

Am Dienstag, 14. Juli 2015 22:43:09 UTC+2 schrieb maxim...@gmail.com:

Thanks everyone for helpful pointers! These all appear to be different ways of describing the position of the identified words on the page? This definitely seems like it would help me produce structured data because I can classify the words as belonging to certain attributes of a json object for each page based on their vertical and horizontal positions.

There is another HTML format using positioning via CSS-classes (i.e. valid HTML): pdf2htmlEX. See example here:

http://coolwanglu.github.io/pdf2htmlEX/demo/geneve.html

Project:

https://github.com/coolwanglu/pdf2htmlEX

Reply all

Reply to author

Forward