I have a two-page PDF document with minimal formatting written in Farsi
(Persian), which uses an RTL Arabic script. I would like to extract the
text into a plain text file in whatever encoding -- UTF-8, UTF-16,
CP-1256, or ISO-8859-6 would be fine. I'm using KDE 3.2.2 on GNU/Linux.
I have access to Adobe Acrobat Reader and all the standard GhostScript
tools. Access to a machine running Windows 2000 could be arranged at some
inconvenience.
Opening the PDF in Acroread and selecting the text with the mouse or with
Edit->Select All and then pasting it into a Unicode-capable text editor
(Kate) doesn't seem to work. For example, for the title line I get the
following gibberish which doesn't translate to Farsi text in any of the
aforementioned encodings.
*'('.*F' Ė G1'( 1/ GĖF'Ė( 2004 '~H1' F'ED1'~ F'G, 13'13 1/ E3ĖD'Ė3H3 'Ė
Ė&'~H1' Ė1'/ GĖ'E13
I have also tried using pdf2ps | ps2ascii with no success. Any other
suggestions would be appreciated.
Regards,
Tristan
--
_
_V.-o Tristan Miller [en,(fr,de,ia)] >< Space is limited
/ |`-' -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= <> In a haiku, so it's hard
(7_\\ http://www.nothingisreal.com/ >< To finish what you
In article <9969392.8...@ID-187157.News.Individual.NET>, Tristan
Miller wrote:
> I have a two-page PDF document with minimal formatting written in Farsi
> (Persian), which uses an RTL Arabic script. I would like to extract the
> text into a plain text file in whatever encoding -- UTF-8, UTF-16,
> CP-1256, or ISO-8859-6 would be fine. I'm using KDE 3.2.2 on GNU/Linux.
> I have access to Adobe Acrobat Reader and all the standard GhostScript
> tools. Access to a machine running Windows 2000 could be arranged at
> some inconvenience.
I had a friend of mine running Windows open the PDF in Acrobat, select the
text, and paste it into Microsoft Word. This worked, except that the
characters in each line were reversed. I suspect this has something to do
with the LTR/RTL text direction settings, but so far we've been unable to
correct the problem. Saving the resulting Word file as ISO-8859-6 text
and then using the "rev" Unix command doesn't reproduce the original
document -- probably something to do with different Arabic letterforms
being used for word-initial and word-terminal letters.
In the meantime, if anyone has further suggestions, please let me know!
Best Regards,
Paulo Soares
Tristan Miller <psych...@nothingisreal.com> wrote in message news:<60385706....@ID-187157.News.Individual.NET>...
> Any other
> suggestions would be appreciated.
Just out of curiosity, would you mind to send me this pdf?
I've no idea if my tries would be of any help, but it sounds interesting.
--
I am your root. If you see me laughing, you better have a backup.
Have you tried pdftotext (from my open source Xpdf package)? Linux
and Windows binaries are available from
http://www.foolabs.com/xpdf/download.html
It will output UTF-8 ("pdftotext -enc UTF-8 ..."), and it has
(somewhat rudimentary) support for right-to-left scripts.
I've run into some problems with Arabic PDF files, in which the
ToUnicode mappings are completely broken, i.e., whatever software is
creating the PDF files is including incorrect Unicode mapping info.
But if you were able to copy text using Acrobat, then pdftotext should
work.
- Derek
In article <slrncc9nic...@glyphandcog.com>, Derek B. Noonburg
wrote:
>> I have a two-page PDF document with minimal formatting written in Farsi
>> (Persian), which uses an RTL Arabic script. I would like to extract the
>> text into a plain text file in whatever encoding -- UTF-8, UTF-16,
>> CP-1256, or ISO-8859-6 would be fine.
>
> Have you tried pdftotext (from my open source Xpdf package)?
This worked almost perfectly -- thanks! The only problems were that in
some cases in the original document where LTR text was embedded in a RTL
paragraph, the LTR text was reversed. But there were only a couple
instances of this, so they're easily fixed by hand.
I have been seeking a general solution to this for Arabic
for some time.
--
Herb Martin
"Tristan Miller" <psych...@nothingisreal.com> wrote in message
news:60385706....@ID-187157.News.Individual.NET...
In article <sThxc.42948$4x2....@fe2.texas.rr.com>, Herb Martin wrote:
> PDF's are NOTORIOUSLY POOR at handling non-Roman
> text in a compatible manner.
>
> I have been seeking a general solution to this for Arabic
> for some time.
If you're reading this thread in soc.culture.arabic, take a look at the
same in comp.text.pdf. A solution has been posted (pdftotext, part of the
Xpdf package) which works at least in my case -- perhaps it also works in
the general case.
I will give it a try -- on your recommendation -- but without
much optimism as I have tried a variety of "to text" utilities,
code pages/encoding, and cut and paste methods and failed
to copy most Arabic script from PDFs.
(I am trying to build a study sheet/word list feed to the "flash
card" program "Pauker".)
--
Herb Martin
"Tristan Miller" <psych...@nothingisreal.com> wrote in message
news:1213585.9...@ID-187157.News.Individual.NET...
Thank you but it did not work for me -- I doubt that it is
any fault of your excellent program but rather in the
category of "files with fonts mangled so badly that only
OCR can extract the text."
Thanks (truly)
Herb.
"Derek B. Noonburg" <der...@glyphandcog.com> wrote in message
news:slrncc9nic...@glyphandcog.com...