Re: Tesseract ocr to XML with text positions (X and Y)

Nick White

unread,

Dec 1, 2012, 9:47:59 AM12/1/12

to tesser...@googlegroups.com

Hi Benito,

Use the 'hocr' configuration option, like this:

tesseract image.png output hocr

See the tesseract manual page for more details.

Nick

Nick White

unread,

Dec 2, 2012, 5:56:55 AM12/2/12

to tesser...@googlegroups.com

On Sun, Dec 02, 2012 at 01:29:54AM -0800, Benito2313 wrote:
> Thank you for your reply, i cant fine the manual page of tesseract could you post � link?

http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

(there are links to it from the ReadMe and FAQ pages)

Presumably it's also included in the Windows Tesseract install, but
I don't know where it would be.

Nick

zdenko podobny

unread,

Dec 3, 2012, 2:48:20 AM12/3/12

to tesser...@googlegroups.com

On Sun, Dec 2, 2012 at 11:56 AM, Nick White <nick....@durham.ac.uk> wrote:

On Sun, Dec 02, 2012 at 01:29:54AM -0800, Benito2313 wrote:

> Thank you for your reply, i cant fine the manual page of tesseract could you post à link?

http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

(there are links to it from the ReadMe and FAQ pages)

Presumably it's also included in the Windows Tesseract install, but
I don't know where it would be.

Unfortunately they are not included. Maybe in next version...

--
Zdenko

Nick White

unread,

Dec 3, 2012, 3:22:13 AM12/3/12

to tesser...@googlegroups.com

On Sun, Dec 02, 2012 at 08:04:50AM -0800, Benito2313 wrote:
> I got a HTML output, its getting there. But is it possible to get the hocr to
> give an XML output?

What is it that you're trying to do? HTML is an XML dialect, after
all (or can be, if XHTML). You should be able to parse it with all
XML tools.

The only way to get a different XML representation would be to
either delve into the API, or convert the hOCR to something more to
your liking. But hOCR is *the* XMLish OCR output standard; I don't
see why you'd want anything else.

Nick

zdenko podobny

unread,

Dec 3, 2012, 4:24:40 AM12/3/12

to tesser...@googlegroups.com

On Mon, Dec 3, 2012 at 9:19 AM, Benito2313 <benit...@hotmail.com> wrote:

Op maandag 3 december 2012 08:48:20 UTC+1 schreef zdenop het volgende:

Zdenko, are you replying on the manual page? Or on the XML output?

manual page(s).

--
Zdenko

Nick White

unread,

Dec 3, 2012, 5:05:54 AM12/3/12

to tesser...@googlegroups.com

On Mon, Dec 03, 2012 at 01:49:08AM -0800, Benito2313 wrote:
> What is it that you're trying to do? HTML is an XML dialect, after
> all (or can be, if XHTML). You should be able to parse it with all
> XML tools.
>

> My program handles with Xml's.
> I can see the script code of the HTML when i open it noteblock. how can i see
> if it is XHTML?

I just checked the HTML output from Tesseract. It is XHTML, so it is
a proper dialect of XML. You can tell from the <?xml opening tag,
plus the doctype and xmlns on the following lines.

Nick

Ajay

unread,

Apr 26, 2016, 10:24:36 AM4/26/16

to tesseract-ocr

Hi Nick White,

I want your help in solving my problem. Can you please let me know how to get the data present in the scanned pdf from the specified location using Tesseract OCR.

and also we help me how to get coordinates of a word along with it if we are reading the whole data from the page.