produce delimited output using hOCR or by preserving original document spacing

894 views
Skip to first unread message

Maureen Kole

unread,
Oct 6, 2014, 8:57:26 PM10/6/14
to tesser...@googlegroups.com
Hello,

Thank you Tesseract developers! I really appreciate your work. I am running:

Ubuntu 14.04 LTS
Tesseract 3.03.03
Leptonica 1.70

I would like to import data stored in tables into R as a dataframe. This will be easiest if the output produced by Tesseract is a delimited file. It is not clear to me if using the hOCR option can produce a delimited file. If yes, how would one do this for my version? The other option I am looking at is preserving multiple spaces in the output txt file and using multiple spaces for the delimiter. If this is possible, how would one do this for my version?

I attached the png (ndomprod93), the output produced by Tesseract not using the hOCR option (out1), and the output produced by Tesseract using the hOCR option, (out2).

Cheers and kindly,
Maureen
out1.txt
ndomprod93.png.zip
out2.hocr

Andrew Defries

unread,
Oct 7, 2014, 1:34:36 PM10/7/14
to tesser...@googlegroups.com
Hello,

For an R solution to importing text for text mining us the package tm. 

Check out line 6-11 in the following repo:


Using tm you can import text and perform some operations:

MyCorpus<-tm_map(MyCorpus, tolower)
MyCorpus<-tm_map(MyCorpus, stemDocument, language="english")
MyCorpus<-tm_map(MyCorpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(Corpus(VectorSource(MyCorpus)), control = list(removePunctuation = TRUE, stopwords = TRUE))
dtm<-removeSparseTerms(dtm,0.1) #or 0.2


Also you can import text without a package like so:

LoadMe<-readLines("out2.txt")

#split document by spaces
wordList<-strsplit(LoadMe, "\\W+", perl=TRUE)

#change into vector from list
wordVector<-unlist(wordList)

Sven Pedersen

unread,
Oct 8, 2014, 8:16:02 AM10/8/14
to tesser...@googlegroups.com
You should look at the different tesseract page segmentation (PSM) modes. The data you have is in a table and you'll need to process it differently. hOCR format is HTML, so it will not work as CSV format, though it does supply accuracy info, so if you want to evaluate that and product CSV you can.
--Sven

Maureen Kole

unread,
Oct 13, 2014, 7:27:23 PM10/13/14
to tesser...@googlegroups.com
Andrew,

I apologize for my delayed response. I just saw your post. Thank you for your response. I am still working on this issue. I will run your code and then post the results.


Cheers,
Maureen

Maureen Kole

unread,
Oct 13, 2014, 7:33:11 PM10/13/14
to tesser...@googlegroups.com
Sven,

I apologize for my delayed response. I just saw your post. Thank you for your response. As I said in my post to Andrew, I am still working on this issue.

I investigated the PSM mode prior to posting my question here on the forum and found this website to be useful for describing the PSM options.
https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

Have you used any of these options to produce delimited output csv or other?

Cheers,
Maureen

Sven Pedersen

unread,
Oct 14, 2014, 11:14:32 AM10/14/14
to tesser...@googlegroups.com
Hi Maureen,
I generally use PSM 4 or 3. Tesseract cannot actually product a CSV (or any other delimited file) but you can get a clean text file and make a CSV from that with a little editing. To actually created a CSV in an automated fashion you'd have to write custom code and use the API.

http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data
--Sven

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7c7b9fa5-470a-4db0-934f-88f1609c8b93%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

zdenko podobny

unread,
Oct 14, 2014, 11:32:37 AM10/14/14
to tesser...@googlegroups.com
Just a hint: there is a fork that tries to output HOCR details in a TSV format file[1].
I did not test it :-), so I have not clue if it fits to the original request...


Zdenko

Rick Leir

unread,
Oct 17, 2014, 11:18:16 AM10/17/14
to tesser...@googlegroups.com
If you like Perl you can parse values from the hOCR.  You will need to change this to suit:

sub saveStats {
    my ( $outHcr,  $outStats) = @_;
    open( STFILE, "> $outStats");

    # get just the x_wconf values from the hocr file:
    # write to a stats file with a wconf per line
    my $confsum = 0;
    my $confcount = 0;

    my $html = HTML::TagParser->new( $outHcr );
    my @list = $html->getElementsByTagName( "span" );
    foreach my $elem ( @list ) {
    my $innertext = $elem->innerText;

    my $titlevalue = $elem->getAttribute( "title" );
    my $wconf = "none";
    if ( $titlevalue =~ / x_wconf ([0-9]*)/ ) {
        $wconf = $1;
        $confsum += $1;
        $confcount ++;
    }
    print STFILE " $wconf $innertext \n";
    }
    # avoid divide by zero
    if( $confcount == 0) { $confcount ++; }
    my $avg = $confsum / $confcount;

    print STFILE " $avg average \n";
    close( STFILE);
    return ($avg, $confcount) ;
Reply all
Reply to author
Forward
0 new messages