OCR for Syriac and Arabic

494 views
Skip to first unread message

Gregory Kessel

unread,
Apr 21, 2017, 12:11:07 PM4/21/17
to List Hugoye, NASCAS
Dear members of Hugoye and NASCAS,

I am sure this news is going to be important for many of you.
As you know, there have been some attempts to create OCR programs that can process Syriac and Arabic scripts. And it seems that an effective solution came from an unexpected side. Namely, Google Docs does it for both Syriac and Arabic!!!
Once you upload either an image or a pdf on Google Drive and then choose to open the file in Google doc you quickly get a recognised text! You can find online information about application of this method for Arabic but it works also for Syriac. The recognition can process Estrangelo, Serto as well as some clearly written manuscripts and it is able to recognise the diacritical signs, punctuation and vocalisation !
It goes without saying that the method is not perfect and depends on the quality of the original material (and any kind of formatting and footnotes is likely to get mixed up) but it offers a ready tool that can save time and money. Just try it out!

(Parenthetically, I should add that occasionally and without clear reason the program fails to process some files but it is definitely worth playing with it)

With best wishes
Grigory Kessel


------------------------
Dr Grigory Kessel
ERC Project “Transmission of Classical Scientific and Philosophical Literature from Greek into Syriac and Arabic”
Abteilung Byzanzforschung
Institut für Mittelalterforschung
Österreichische Akademie der Wissenschaften
Hollandstrasse 11-3/4, 1020 Wien
Tel.: +43-1-51581-3181
E-mail: grigory...@oeaw.ac.at


Florian Jäckel

unread,
Apr 21, 2017, 2:34:06 PM4/21/17
to nas...@googlegroups.com, hugoy...@yahoogroups.com

Dear Dr. Kessel, dear all,

thank you for this helpful hint. May I add some further information:

(1) The software most likely working in the background of google docs here is tesseract, developed by google, too. It is open source, though (more information here: https://en.wikipedia.org/wiki/Tesseract_(software)). The documentation on the usage and trials for Arabic has been steadily growing, while I cannot find anything on Syriac.

(2) There is a tool to crop multipage pdf files, which might come in handy to process the text and the footnotes separately, at least if the pages have a more or less similar layout: https://sourceforge.net/p/briss/wiki/Home/

Looking forward to more development in that area!

Best

Florian Jäckel
-- 
Diese Nachricht ist mit meinem pgp-Schlüssel signiert, möglicherweise
als Anhang "sig.asc" angezeigt.
signature.asc

RogerAkhrass

unread,
May 8, 2018, 2:40:25 PM5/8/18
to North American Society for Christian Arabic Studies
Dear Grigory,
Do you have updates about OCR for Syriac?
It seems that Google books is recognizing Syriac, even in old printed texts, but unfortunately, the Syriac is not among the languages that Google docs OCR can automatically recognize as you announced last year (see: https://support.google.com/drive/answer/176692?co=GENIE.Platform%3DDesktop&hl=en).
Google docs may convert some letters only of the PDF into Syriac (.docx) but not yet readable words or texts. Have you yourself tried it successfully?

Thank you,
Roger Akhrass

Gregory Kessel

unread,
May 15, 2018, 7:09:15 AM5/15/18
to nas...@googlegroups.com
Dear Father Roger,

I discovered your message in the spam folder - so please forgive me for my belated reply.

I am afraid I can’t provide you with any news as I am not involved in this Google enterprise and simply follow the developments as a user.

I tried to recognise different Syriac texts with Google Docs and in most cases it performed very well but I couldn’t understand why the results are so different.

Maybe somebody with better understanding of the technical side of the process could help with that ?

Sincerely
Grigory Kessel
> --
> --
> You received this message because you are subscribed to the "North American Society for Christian Arabic Studies" group.
> To post to this group, send email to nas...@googlegroups.com
> To unsubscribe from this group, send email to
> nascas+un...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/nascas?hl=en
> ---
> You received this message because you are subscribed to the Google Groups "North American Society for Christian Arabic Studies" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to nascas+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages