Convert an imagebook to a textbook (perhaps using OCR?)

3 views
Skip to first unread message

Rudolph Rhein

unread,
Aug 25, 2023, 9:04:23 PMAug 25
to
My sister's next-month Great Books is Noel Coward's play comedy named
"Private Lives" from the 1930s. She's almost blind from complications.

She is not technical and she only has an iPad and an iPhone but I have
Android & Windows so she asked me to help her with IOS text to speech.

She sent me the link to the PDF because it won't text-to-speech read out.
<https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>

Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
it) but just a set of scanned images of the book (with no actual text).

I tried converting that PDF with Calibre on Windows to an EPUB format,
but the EPUB was nothing more than a set of the same images in a file.

What's a good way for me to convert that "imagebook" (whatever you call it)
to a "textbook" so that I can send it to her to use TTS on her iPad?

Eli the Bearded

unread,
Aug 25, 2023, 10:00:52 PMAug 25
to
Follow-ups set to comp.text.pdf.

In rec.photo.digital, Rudolph Rhein <Rudolp...@nospam.net> wrote:
> My sister's next-month Great Books is Noel Coward's play comedy named
> "Private Lives" from the 1930s. She's almost blind from complications.
...
> <https://ia801404.us.archive.org/12/items/in.ernet.dli.2015.210130/2015.210130.Private-Lives.pdf>
> Looking at that PDF, it seems to be not a "textpdf" (whatever you'd call
> it) but just a set of scanned images of the book (with no actual text).

It's archive.org. They have documents in multiple formats already.

https://archive.org/details/in.ernet.dli.2015.210130

DOWNLOAD OPTIONS
* ABBYY GZ download
* DAISY download For print-disabled users
* EPUB download
* FULL TEXT download
* ITEM TILE download
* KINDLE download
* PDF download
* PDF WITH TEXT download
* SINGLE PAGE PROCESSED JP2 ZIP

Their FULL TEXT and PDF WITH TEXT will be OCRed by them, so expect
typical OCR errors in it.

Elijah
------
does not know what all of the formats are

Paul

unread,
Aug 25, 2023, 11:29:15 PMAug 25
to
Noel Coward is a genius.

He picked the perfect font, to prevent OCR :-)

Italics font, with rough edges. The scanning team did a great job, but maybe
they should have tried OCR first, before cleanup.

*******

https://archive.org/stream/in.ernet.dli.2015.210130/2015.210130.Private-Lives_djvu.txt <=== try TTS on this

( https://archive.org/details/in.ernet.dli.2015.210130 )

Ocr ABBYY FineReader 11.0
Ppi 600 <=== Didn't look like 600 to me...

Each scanned page is 2800 x 4000 pixels, so it would
depend on the size of the printed page, as to whether
600 is true or not.

Windows apparently has an OCR library. Fat lot of good that does me.

https://blogs.windows.com/windowsdeveloper/2016/02/08/optical-character-recognition-ocr-for-windows-10/

If you watch how the OCR in the old Acrobat Distiller package
used to work, first it does layout analysis. It recognizes text columns
in a three-column layout. Then, it selects lines of text (pixmap sections)
and does OCR on them, and it associates the text with the column.

The Microsoft OCR library, at a guess, does not do layout analysis. It
takes whatever pixmap section you feed it, and makes a line of text
(with little or no punctuation or layout info). This is why the
sample image they fed it, only had one line of text in it, because
the output result would be indistinguishable from whether a layout
engine had been present or not. If the image had just two lines of
text, you would realize what its capabilities actually were.

I could easily feed the sample through some package running
Tesseract, but we all know how that will turn out.

Paul

Rudolph Rhein

unread,
Aug 26, 2023, 2:36:07 AMAug 26
to
Eli the Bearded <*@eli.users.panix.com> wrote:

> It's archive.org. They have documents in multiple formats already.

How the heck did you know that?

> https://archive.org/details/in.ernet.dli.2015.210130

That's a much better link (to send to the other Great Bookers!).

> DOWNLOAD OPTIONS
> * ABBYY GZ download
> * DAISY download For print-disabled users
> * EPUB download
> * FULL TEXT download

Even though I was aiming for a PDF, a "full text" seems to be the most
native for a speech-to-text program, wouldn't you think it would be?

> * ITEM TILE download
> * KINDLE download
> * PDF download
> * PDF WITH TEXT download
> * SINGLE PAGE PROCESSED JP2 ZIP

Usually I'm comfortable starting with an EPUB or Kindle for conversion.
But what's the difference between "PDF" and "PDF with text" anyway?

> Their FULL TEXT and PDF WITH TEXT will be OCRed by them, so expect
> typical OCR errors in it.

How do you know that?
Are you saying the EPUB/Kindle are the most faithful then?

> Elijah
> ------
> does not know what all of the formats are

Kindle:
<https://archive.org/download/in.ernet.dli.2015.210130/2015.210130.Private-Lives.mobi>

EPUB:
<https://archive.org/download/in.ernet.dli.2015.210130/2015.210130.Private-Lives.epub>

I opened that EPUB file in the Windows Calibre program.
It had a mixture of mostly text, but some scanned pages.

The disclaimer at the beginning said:
"This book was produced in EPUB format by the Internet
Archive.The book pages were scanned and converted to EPUB
format automatically. This process relies on optical
character recognition, and is somewhat susceptible to
errors. The book may not offer the correct reading
sequence, and there may be weird characters, nonwords, and incorrect
guesses at structure. Some page numbers and headers or footers may remain
from the scanned page. The process which identifies images might have found
stray marks on the page which are not actually images from the book. The
hidden page numbering which may be available to your ereader corresponds to
the numbered pages in the print edition, but is not an exact match; page
numbers will increment at the same rate as the corresponding print edition,
but we may have started numbering before the print book's visible page
numbers. The Internet Archive is working to improve the scanning process
and resulting books, but in the meantime, we hope that this book will be
useful to you."

Using Calibre, I converted that 271KB EPUP into a 625KB PDF file instead.
Unlike before, the font is a normal font now, and it seems to be PDF text.

I think, thanks to you, that the mission was accomplished.
But I'll only know later when her iPad reads that PDF out as text.

Stan Brown

unread,
Aug 26, 2023, 11:50:43 AMAug 26
to
On Sat, 26 Aug 2023 09:37:00 +0300, Rudolph Rhein wrote:
> Usually I'm comfortable starting with an EPUB or Kindle for conversion.
> But what's the difference between "PDF" and "PDF with text" anyway?


The text is a second "layer". PDF-Xchange, among others, can OCR the
images and create that layer. The quality of the text rendering is
_highly_ dependent on the quality of the images.

--
Stan Brown, Tehachapi, California, USA https://BrownMath.com/
Shikata ga nai...
Reply all
Reply to author
Forward
0 new messages