Multiple language scans: defining accented characters for OCR

10 views
Skip to first unread message

Z Z

unread,
Nov 13, 2019, 11:06:30 PM11/13/19
to Paperwork
Hi there,

I'm testing out Paperwork and the OCR part of the program is really impressive, well done! I've been scanning some documents that are mixed French and English and the only accent that is properly recognized is the lower case accent aigu/acute accent [é] (upper case accents are not recognized either). All the other accents seem to be tossed and are sometimes recognized as a period above letters (especially the accent grave [à]). How can I get the OCR to properly recognize the other French accents?

I've tried going into the settings and changing the language, but English is the only language I have available. Is it possible to add new languages? I'm running Paperwork as a flatpak and am not so familiar as to how flatpaks work with locales and languages, etc. Either way, for a mixed language document, would the language with the greater number of accents be the best way to capture?

Thanks for any information/help you may provide.

Kind Regards,
Z

Jerome Flesch

unread,
Nov 20, 2019, 4:11:41 AM11/20/19
to paperw...@googlegroups.com, Z Z
Hello,


AFAIK, French includes all English characters, you can just set the OCR
to French.
I'm not sure there is anything else you can do. Based on my experience
(I'm French myself), Tesseract always had some issues with accents.
Maybe a solution would be to improve Tesseract training, but it is quite
difficult and it is out of the scope of Paperwork.

Anyway, when Paperwork indexes a document, Paperwork strips all the
accents. When you search something, it also strips all the accents from
your query. So for searching, accents don't matter.

Regarding languages in Flatpak, when installing an application, Flatpak
looks at the system locale and install the corresponding files but
nothing more. I don't know of any method to add additional locales.

If you need support for more languages, I would suggest installing
Paperwork "the manual way":
- Install Libinsane (
https://doc.openpaper.work/libinsane/latest/libinsane/install.html )
- Install Paperwork (
https://gitlab.gnome.org/World/OpenPaperwork/paperwork/blob/master/doc/install.debian.markdown#build-dependencies
)
- You can then install all the packages tesseract-ocr-XXXX you need on
your system

The drawback is that it's much harder to uninstall cleanly.


Best regards,
> --
> You received this message because you are subscribed to the Google
> Groups "Paperwork" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to paperwork-gu...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/paperwork-gui/7546a828-6d43-4ebc-9187-1f06020b255f%40googlegroups.com
> [1].
>
>
> Links:
> ------
> [1]
> https://groups.google.com/d/msgid/paperwork-gui/7546a828-6d43-4ebc-9187-1f06020b255f%40googlegroups.com?utm_medium=email&utm_source=footer

Giacomo Catenazzi

unread,
Nov 25, 2019, 3:13:40 AM11/25/19
to paperw...@googlegroups.com, Z Z
Hello,

This solved bug on Flatpak explains how to change/add languages: https://github.com/flatpak/flatpak/issues/1504

In any case, it would be nice if we can change the tesseract training language inside openpaperwork.

I'm Swiss, I have documents on various languages: German umlaut tend to miss, French is good. The problem with missing umlauts, it is that then tesseract may find an other match: ü become often ii or ij.

ciao
     cate



Reply all
Reply to author
Forward
0 new messages