OCR and furigana

Oroszlany Balazs

unread,

May 29, 2014, 1:39:47 AM5/29/14

to hon...@googlegroups.com

Dear all,

Three years have passed since the last discussion on OCR programs in this list.

Partly because of a forthcoming project, partly because I would like to eliminate all those photocopied pages filling the shelves, I started looking for an OCR solution.

Since the majority of the text I have to deal with contains furigana, I was about to follow Alys Lindholm's advice and buy 読んde!!ココ.

The only problem is that Epson stopped supporting the program, and it is not available since 2012.

読取革命, at least ver. 14, still not supports furigana.

Have you got any good experience with OCR and furigana recently? Would Abby or Adobe Acrobat's OCR be able to handle it?

Thank you,

Balazs Oroszlany

Matthew Schlecht

unread,

May 29, 2014, 8:54:29 AM5/29/14

to Honyaku

On Thu, May 29, 2014 at 1:39 AM, Oroszlany Balazs <rori...@gmail.com> wrote:

Have you got any good experience with OCR and furigana recently? Would Abby or Adobe Acrobat's OCR be able to handle it?

Adobe's onboard OCR engine will pick up furigana, with accuracy proportional to the legibility, but often attempts to integrate the moji into the main lines. It requires some post-editing.

ReadIRIS/Asian picks up furigana and usually makes it into a separate line. Some realignment is necessary.

I don't know of any program that offers truly satisfactory handling of furigana.

Matthew Schlecht, PhD
Word Alchemy
Newark, DE, USA
wordalchemytranslation.com

John Stroman

unread,

May 29, 2014, 11:04:02 AM5/29/14

to hon...@googlegroups.com

Matt,

I seldom get image PDFs anymore, but I was wondering how Abbey and ReadIris compare with Adobe's onboard OCR engine in terms of accuracy and ease of use. Any recommendations?

John Stroman

----------------

--
You received this message because you are subscribed to the Google Groups "Honyaku E<>J translation list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to honyaku+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matthew Schlecht

unread,

May 29, 2014, 11:23:56 AM5/29/14

to Honyaku

On Thu, May 29, 2014 at 11:03 AM, John Stroman <stromana...@gmail.com> wrote:

Matt,
I seldom get image PDFs anymore, but I was wondering how Abbey and ReadIris compare with Adobe's onboard OCR engine in terms of accuracy and ease of use. Any recommendations?

I don't have any experience with Abbyy.

When the legibility is good enough, the onboard PDF OCR engine works fine. Below a certain level, the accuracy drops off for the PDF engine, faster than for ReadIRIS.

I really like OmniPage for European languages because of the spellchecking and dictionary options. A few years ago OP (Nuance) added an engine for Asian languages, but the accuracy was inferior to ReadIRIS. I know that OmniPage came out with a new version within the last year, but I don't know if it includes any upgrade to the kanji/kana capability.

When you write that you seldom receive PDFs any longer, does that mean most of what you get is in DOC or RTF? I've noticed that some clients run whatever they get through OCR to generate what looks like an original in DOC or RTF. The kludgy formatting often gives it away, though. Mid-sentence line or page breaks, odd collection of font attributes, and the rendering of stamps into punctuation gibberish.

Matthew Schlecht, PhD

John Stroman

unread,

May 29, 2014, 11:55:57 AM5/29/14

to hon...@googlegroups.com

Matt,

I used Abbyy on my previous XP computer and was relatively happy with its performance. I particularly like the preconversion editing option, and it does pick up furigana.

The only thing I didn't like was that it tended to insert Chinese characters into the spots it considered to have poor recognition. I was unable to tweak the system (nonunicode programs set to Japanese, etc.) to make it use Japanese kanji automatically even though I had a Japanese font selected, and I had to make the corrections manually from the list of options provided.

I wrote that I seldom get "image PDFs" anymore. Almost all my documents are regular Word or text-extractable PDFs, so I usually don't have to mess with seals and other ugliness.

Thanks for your insight.

John

----------------

Oroszlany Balazs

unread,

Jun 2, 2014, 2:58:02 AM6/2/14

to hon...@googlegroups.com

Dear all,

Thank you for the suggestions.

It is good to see that there are more options, not only 読取革命.

Now I have to test them all.

Balazs Oroszlany

On 30 May 2014 00:55, John Stroman <stromana...@gmail.com> wrote:

Matt,
     I used Abbyy on my previous XP computer and was relatively happy with its performance. I particularly like the preconversion editing option, and it does pick up furigana.

     The only thing I didn't like was that it tended to insert Chinese characters into the spots it considered to have poor recognition. I was unable to tweak the system (nonunicode programs set to Japanese, etc.) to make it use Japanese kanji automatically even though I had a Japanese font selected, and I had to make the corrections manually from the list of options provided.

     I wrote that I seldom get "image PDFs" anymore. Almost all my documents are regular Word or text-extractable PDFs, so I usually don't have to mess with seals and other ugliness.

     Thanks for your insight.
     John

----------------

On Thu, May 29, 2014 at 11:23 AM, Matthew Schlecht <matthew.f...@gmail.com> wrote:

On Thu, May 29, 2014 at 11:03 AM, John Stroman <stromana...@gmail.com> wrote:

Matt,
I seldom get image PDFs anymore, but I was wondering how Abbey and ReadIris compare with Adobe's onboard OCR engine in terms of accuracy and ease of use. Any recommendations?

     I don't have any experience with Abbyy.
     When the legibility is good enough, the onboard PDF OCR engine works fine. Below a certain level, the accuracy drops off for the PDF engine, faster than for ReadIRIS.

     I really like OmniPage for European languages because of the spellchecking and dictionary options. A few years ago OP (Nuance) added an engine for Asian languages, but the accuracy was inferior to ReadIRIS. I know that OmniPage came out with a new version within the last year, but I don't know if it includes any upgrade to the kanji/kana capability.

     When you write that you seldom receive PDFs any longer, does that mean most of what you get is in DOC or RTF? I've noticed that some clients run whatever they get through OCR to generate what looks like an original in DOC or RTF. The kludgy formatting often gives it away, though. Mid-sentence line or page breaks, odd collection of font attributes, and the rendering of stamps into punctuation gibberish.

Matthew Schlecht, PhD
Word Alchemy
Newark, DE, USA
wordalchemytranslation.com

Reply all

Reply to author

Forward