tesseract (OCR) for other languages

Yamen Saleh

unread,

Aug 18, 2020, 5:48:08 AM8/18/20

to islandora-dev

hello, when I upload an Arabic PDF the Extracted text is empty, for English it is working fine.

the strange thing is that when I upload an Arabic Tiff File it generates the extracted text with no issues.

i tried tesseract command line for the same PDF file i have uploaded, and it works when i specify the lang parameter.

any idea how can we solve this?

one more thing, i need to create a Microservices for AudioToText, any one have accomplished that can guide me how to create the service, or is there any module for this matter that i can use.

thank you and best regards

Tristan Chambers

unread,

Aug 18, 2020, 9:13:37 AM8/18/20

to island...@googlegroups.com

Hey Yamen,

The solution probably depends on the solution pack. Which content
model were you using when you ingested the PDF?

Best regards,

Tristan

> --
> You received this message because you are subscribed to the Google Groups
> "islandora-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to islandora-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/islandora-dev/0894b0fd-0fa4-4b81-9987-15e9815e513bn%40googlegroups.com.
>

--
Digital Library Applications Administrator
Smith College Libraries
Northampton, MA - USA
Pronouns: he / him / his

Danny Lamb

unread,

Aug 18, 2020, 12:00:01 PM8/18/20

to islandora-dev

Hi Yamen,

In 8, there's two different tools used to extract text. Tesseract for images and pdftotext for pdfs. It must be a setting in pdftotext. If you need additional command line arguments to be passed along to the microservice, we do that for other scenarios and it can be easily added.

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora-dev/CAAemO%3DtotV9h%3DUcf_Qz%2B8tiCqQ_YfobNcK%2BWwZqM%3DWeWp5x79Q%40mail.gmail.com.

--

- Daniel Lamb

Tech Lead

Islandora Foundation

http://islandora.ca

Seth Shaw

unread,

Aug 18, 2020, 12:12:11 PM8/18/20

to island...@googlegroups.com

Hypercube doesn't really have any configuration options and the current code doesn't allow adding arguments for pdftotext v. tesseract.

If I'm reading the documentation correctly, the simplest fix would be to change https://github.com/Islandora/Crayfish/blob/dev/Hypercube/src/Controller/HypercubeController.php#L81 to be something like `$cmd_string = $this->pdftotext_executable . " $args -enc UTF-8 - -";`

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora-dev/CAOqT600-PDvP6tWCy5odswTSK6h8wi2%3DNwWohw9Amiz5o%3DoLCw%40mail.gmail.com.

seth...@unlv.edu

unread,

Aug 18, 2020, 1:18:26 PM8/18/20

to islandora-dev

I've made a PR to make pdftotext output UTF-8, https://github.com/Islandora/Crayfish/pull/103, which I believe is a reasonable setting. We can address making a better action configuration form later.

Yamen Saleh

unread,

Aug 18, 2020, 1:20:41 PM8/18/20

to island...@googlegroups.com

Thank you very much, Will test it tomorrow and report back.

Thanks

You received this message because you are subscribed to a topic in the Google Groups "islandora-dev" group.

To unsubscribe from this topic, visit https://groups.google.com/d/topic/islandora-dev/oPr1ZsJx-HA/unsubscribe.

To unsubscribe from this group and all its topics, send an email to islandora-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora-dev/1747aefd-85dc-474c-bc7c-2308de4c45c9n%40googlegroups.com.

--

Yamen Saleh

Danny Lamb

unread,

Aug 18, 2020, 1:27:45 PM8/18/20

to islandora-dev

That's fantastic Seth 🙇 If baking that in yields something more robust than what we have now, we can deal with adding args in the action later (if ever).

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora-dev/CAOkdi7VeBqFn0mCLCDFNROYhr4GZqSuvpW7p27qmgJAGmN6BDQ%40mail.gmail.com.

Yamen Saleh

unread,

Aug 19, 2020, 5:32:57 AM8/19/20

to islandora-dev

Hello Seth,

I have applied the change on my local installation, and uploaded my file again, but for that particular file it didnt work, then i took the file which was uploaded to the pull to test and it worked,.

maybe the issue with my file because it is using some kind of old Artistic arabic font, I will upload the file here as attachment for you to have a look.

thank you and best regards

1-5 Pages.pdf

Seth Shaw

unread,

Aug 19, 2020, 2:18:23 PM8/19/20

to island...@googlegroups.com

Yamen,

I took a look at the PDF and it appears to be a scanned document without embedded text. Currently Islandora can't OCR an image-only PDF (although we should certainly consider that use-case) but there are other tools that can such as http://www.sakhr.com/index.php/en/solutions/ocr and https://www.rdi-eg.com. I would use one of those (or another) tools to generate a new PDF with the arabic OCR text embedded and then upload that to Islandora which can then pull out the arabic text for indexing.

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora-dev/79a10b50-4f2f-4271-80e0-cd721f343e25n%40googlegroups.com.

ezo...@asu.edu

unread,

Sep 25, 2020, 12:06:21 PM9/25/20

to islandora-dev

Hi Yamen,

I ran into a similar issue where I had a PDF in Arabic. I got extracted text back from pdftotext, but it contains a large amount of funky characters which drupal is treating as �. Does that sound like an issue with the PDF itself or some setting I'm missing in pdftotext? Or even just linux for handling these non-latin characters.

Thanks,

Eli

Yamen Saleh

unread,

Sep 28, 2020, 4:24:01 AM9/28/20

to islandora-dev

the issue occurred for me with Arabic scanned pdfs, i had to convert them to tiff, then extract the text.

regards

Reply all

Reply to author

Forward