Which page segmentation mode (psm) did you try?
I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see that no "columns" are detected. Probably more important is that three "blocks" are detected, one around the first and last line, and one encompassing everything in between. Is there a way to train block detection, or some parameters that I can tweak to optimize this?
I have attached the image merely as an abstract representation of the text layout to show the types of columns I am dealing with. Ideally, it would also be nice to know if tab stops can be trained and used to oneline each individual topic, which I could do postprocess if I could get tabstops printed.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Which page segmentation mode (psm) did you try?
On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see that no "columns" are detected. Probably more important is that three "blocks" are detected, one around the first and last line, and one encompassing everything in between. Is there a way to train block detection, or some parameters that I can tweak to optimize this?--
I have attached the image merely as an abstract representation of the text layout to show the types of columns I am dealing with. Ideally, it would also be nice to know if tab stops can be trained and used to oneline each individual topic, which I could do postprocess if I could get tabstops printed.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Try psm 6, also 11, 12
https://github.com/tesseract-ocr/tesseract/issues/434
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com.
Try psm 6, also 11, 12
On 13 Oct 2016 1:13 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
I tried psm 0-3--
On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:Which page segmentation mode (psm) did you try?
On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com> wrote:I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see that no "columns" are detected. Probably more important is that three "blocks" are detected, one around the first and last line, and one encompassing everything in between. Is there a way to train block detection, or some parameters that I can tweak to optimize this?--
I have attached the image merely as an abstract representation of the text layout to show the types of columns I am dealing with. Ideally, it would also be nice to know if tab stops can be trained and used to oneline each individual topic, which I could do postprocess if I could get tabstops printed.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
You can also experiment with hocr and tsv output modes to see if they help.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e0ab1c62-de29-4042-b622-a3a06827b057%40googlegroups.com.
I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see that no "columns" are detected. ...
I have attached the image merely as an abstract representation of the text layout to show the types of columns I am dealing with. Ideally, it would also be nice to know if tab stops can be trained and used to oneline each individual topic, which I could do postprocess if I could get tabstops printed.