Not able to lift text from the attached file please help!!!

63 views
Skip to first unread message

Shobhit Kapil

unread,
Apr 3, 2019, 10:28:14 AM4/3/19
to tesseract-ocr
Hi Team,

i am using Tesseract version 4 and using page segment mode is LSTM, so with that i am not able to lift text properly from the attached file, please let me know what extra things need to do for this sort of files.
i have posted multiple concerns regarding Tesseract issues but none of them answered hoping to have some answers this time.

Team please help!!!

Thanks,
Shobhit 
4b78d0b4-dc14-4d31-83b6-c84aac0ca327.PDF
LiftedText.txt

Shobhit Kapil

unread,
Apr 3, 2019, 10:41:06 AM4/3/19
to tesseract-ocr
Correcting myself EngineMode: LSTMonly and PageSegMode: SparseText

Shree Devi Kumar

unread,
Apr 3, 2019, 12:30:06 PM4/3/19
to tesser...@googlegroups.com
Why are you using PageSegMode: SparseText?

I get much better results from command line with default psm. See the various types of available outputs - txt, hocr, tsv, alto



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1002dec9-e4e7-4255-baa0-9536f2d33245%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
shobhit.txt
shobhit.html
shobhit.tsv
shobhit.xml

Shobhit Kapil

unread,
Apr 5, 2019, 6:27:01 AM4/5/19
to tesseract-ocr
Hi, 

Thanks for the reply, i just want to elaborate what i am actually doing will give you a clean picture for a proper guidance!!!

I am using tesseract dll in windows and i am using the below code for engine and page segment mode....

 private static TesseractEngine _engine;
        private static TesseractEngine Engine
        {
            get
            {
                if (_engine == null || _engine.IsDisposed)
                {
                    try
                    {
                        _engine = new TesseractEngine(Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().Location) + "\\tessdata", "eng", EngineMode.LstmOnly);
                    }
                    catch (Exception ex)
                    {

                        ex.Message.ToString();
                    }

                }
                return _engine;
            }
        }

then later i am initializing this engine for reading the text from image:

private string LiftText(Bitmap img)
        {
            string resultText = string.Empty;
            try
            {
                TesseractEngine engine = new TesseractEngine("./tessdata", "eng", EngineMode.LstmOnly);
                Tesseract.Page mypage = engine.Process(img, PageSegMode.SparseText);

                resultText = mypage.GetText();
                mypage.Dispose();
                engine.Dispose();
            }
            catch (Exception ex)
            {
                Exceptioninfo("LiftText()", ex.StackTrace);
            }

            return resultText;
        }


On Wednesday, April 3, 2019 at 7:58:14 PM UTC+5:30, Shobhit Kapil wrote:

Shobhit Kapil

unread,
Apr 9, 2019, 7:45:07 AM4/9/19
to tesseract-ocr
I am using Sparse Text because while reading from the Image some line are parallel with spacing in between attached is the image due to wish i need to forcibly need to go for Sparse Text.

Please suggest me for this.



On Wednesday, April 3, 2019 at 7:58:14 PM UTC+5:30, Shobhit Kapil wrote:
TownOfOrange.bmp

Shobhit Kapil

unread,
Apr 10, 2019, 1:10:42 PM4/10/19
to tesseract-ocr
Hi Shree,

Please share your inputs!!!

Thanks,
Shobhit


On Wednesday, April 3, 2019 at 7:58:14 PM UTC+5:30, Shobhit Kapil wrote:
Reply all
Reply to author
Forward
0 new messages