AviChai-Shabat.fixed.pdf

28 views
Skip to first unread message

YZahn

unread,
Nov 30, 2017, 1:42:02 PM11/30/17
to opensiddur-tech
Hi,
I saw your request to help re encoding the AviChai Siddur and the challenge intrigued me, here are the results attached. I will do the other volume when I have chance...

AviChai-Shabat.fixed.pdf

Aharon Varady

unread,
Nov 30, 2017, 1:43:41 PM11/30/17
to Open Siddur Technical Discussion List
Thank you so much Yossi. Could you tell us more about how you fixed it?

On Thu, Nov 30, 2017 at 9:38 AM, YZahn <yoss...@gmail.com> wrote:
Hi,
I saw your request to help re encoding the AviChai Siddur and the challenge intrigued me, here are the results attached. I will do the other volume when I have chance...

--
You received this message because you are subscribed to the Google Groups "opensiddur-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opensiddur-tech+unsubscribe@googlegroups.com.
To post to this group, send email to opensiddur-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/opensiddur-tech.
For more options, visit https://groups.google.com/d/optout.



--
Aharon Varady, M.C.P., M.A.J.Ed.
Community Planner, Educator
Pronouns: He/him/his

Daniel Szymanski

unread,
Nov 30, 2017, 2:39:41 PM11/30/17
to opensid...@googlegroups.com
I would be very much interested in how you went about this.  My idea was to find a PDF parser, find the Font descriptions, substitute open source fonts, and write a new PDF.  I was unsuccessful finding such a program so I started to write one.  This was a pretty rewarding experience, as I got to debug my way into a partial solution with the PDF of interest in one hand and a PDF specification in the other.  Things got messy when most interesting text-related structures were compressed using an algorithm with which I was unfamiliar.  Eventually the road blocks got me down.  For a while, I thought I could do something with PDFBox (?) but a little tip-toeing into the wonderful world of Linux scared me off.

Ps. If you have some control over equivalent representations in PDF, it is possible to write the file in such a way that the first page gets loaded quickly, so the user has something to look at while the remaining 200+ pages get read in.

On Thu, Nov 30, 2017 at 9:38 AM, YZahn <yoss...@gmail.com> wrote:
Hi,
I saw your request to help re encoding the AviChai Siddur and the challenge intrigued me, here are the results attached. I will do the other volume when I have chance...

Daniel Szymanski

unread,
Dec 1, 2017, 5:53:43 AM12/1/17
to opensid...@googlegroups.com
Let me offer you hearty congratulations on a job well done.  Although I do not understand any of the text, the document is a thing of beauty.  May you thoroughly enjoy bringing this task to a successful conclusion.

On Thu, Nov 30, 2017 at 9:38 AM, YZahn <yoss...@gmail.com> wrote:
Hi,
I saw your request to help re encoding the AviChai Siddur and the challenge intrigued me, here are the results attached. I will do the other volume when I have chance...

YZahn

unread,
Dec 2, 2017, 4:48:16 PM12/2/17
to opensiddur-tech
OK, For those interested... First of all the issue of wrongly encoded PDFs has been bugging me for a (very) long time. The core of the problem lies in the way text is encoded in PDFs. Whereas regular file formats contain the character codes of the text, PDF contains only the Glyph IDs which directly refer to glyphs in the embedded font. These have no relationship to the underlying character codes. A well formed PDF will contain - for each embedded font -  a mapping of glyphs to unicode characters, known as the 'toUnicode' map. The issue is that many (older) Hebrew fonts are not encoded using Unicode but use their own encoding scheme. There exist multiple encodings that various different font vendors have used for Hebrew. The program creating the document must be aware of the encoding and theoretically can create a correct mapping, if however a virtual printer is used, it has no way of knowing that the encoding is not Unicode and will create an incorrect 'toUnicode' entry for the font.
The solution is to replace this mapping with a correct mapping. Now, originally I planned on doing it manually, however some intensive googling found me some programs that can help. Firstly there is "axesPDF QuickFix" which recently added support for correcting 'toUnicode' maps (https://www.axes4.com/axespdf-quickfix-features.html). There is also Infix Pro (http://www.iceni.com/infix.htm) which has (also only recently) added that feature. Another helpful tool is the PDF debugger which is part of the PDFBox project, it shows a graphical representation of the mapping but doesn't have the ability to edit it.

On Thursday, November 30, 2017 at 9:39:41 PM UTC+2, Daniel Szymanski wrote:
I would be very much interested in how you went about this.  My idea was to find a PDF parser, find the Font descriptions, substitute open source fonts, and write a new PDF.
I doubt that would have been successful given that the original Unicode text is lost.
I was unsuccessful finding such a program so I started to write one. This was a pretty rewarding experience, as I got to debug my way into a partial solution with the PDF of interest in one hand and a PDF specification in the other.  Things got messy when most interesting text-related structures were compressed using an algorithm with which I was unfamiliar.
Eventually the road blocks got me down.  For a while, I thought I could do something with PDFBox (?) but a little tip-toeing into the wonderful world of Linux scared me off.

PDFBox is written in Java and therefore cross-platform, it should work on Windows also. There are many many libraries available for parsing PDF.

Ps. If you have some control over equivalent representations in PDF, it is possible to write the file in such a way that the first page gets loaded quickly, so the user has something to look at while the remaining 200+ pages get read in.

Do you mean when viewing locally or from the web? If you are viewing locally, the file format is already optimized for that. For viewing from the web, the PDF must be linearized (AKA fast web  view), this can be done using the free Adobe Reader (Edit->Preferences->Document->Save As optimizes for Fast Web View = true, and the save as).

On Thu, Nov 30, 2017 at 9:38 AM, YZahn <yoss...@gmail.com> wrote:
Hi,
I saw your request to help re encoding the AviChai Siddur and the challenge intrigued me, here are the results attached. I will do the other volume when I have chance...

--
You received this message because you are subscribed to the Google Groups "opensiddur-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opensiddur-te...@googlegroups.com.
To post to this group, send email to opensid...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages