Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Convert Farsi (Arabic) PDF to plain text

2,265 views
Skip to first unread message

Tristan Miller

unread,
Jun 5, 2004, 6:22:09 PM6/5/04
to
Greetings.

I have a two-page PDF document with minimal formatting written in Farsi
(Persian), which uses an RTL Arabic script. I would like to extract the
text into a plain text file in whatever encoding -- UTF-8, UTF-16,
CP-1256, or ISO-8859-6 would be fine. I'm using KDE 3.2.2 on GNU/Linux.
I have access to Adobe Acrobat Reader and all the standard GhostScript
tools. Access to a machine running Windows 2000 could be arranged at some
inconvenience.

Opening the PDF in Acroread and selecting the text with the mouse or with
Edit->Select All and then pasting it into a Unicode-capable text editor
(Kate) doesn't seem to work. For example, for the title line I get the
following gibberish which doesn't translate to Farsi text in any of the
aforementioned encodings.

*'('.*F' Ė G1'( 1/ GĖF'Ė( 2004 '~H1' F'ED1'~ F'G, 13'13 1/ E3ĖD'Ė3H3 'Ė
Ė&'~H1' Ė1'/ GĖ'E13

I have also tried using pdf2ps | ps2ascii with no success. Any other
suggestions would be appreciated.

Regards,
Tristan

--
_
_V.-o Tristan Miller [en,(fr,de,ia)] >< Space is limited
/ |`-' -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= <> In a haiku, so it's hard
(7_\\ http://www.nothingisreal.com/ >< To finish what you

Tristan Miller

unread,
Jun 7, 2004, 8:16:53 AM6/7/04
to
Greetings.

In article <9969392.8...@ID-187157.News.Individual.NET>, Tristan


Miller wrote:
> I have a two-page PDF document with minimal formatting written in Farsi
> (Persian), which uses an RTL Arabic script. I would like to extract the
> text into a plain text file in whatever encoding -- UTF-8, UTF-16,
> CP-1256, or ISO-8859-6 would be fine. I'm using KDE 3.2.2 on GNU/Linux.
> I have access to Adobe Acrobat Reader and all the standard GhostScript
> tools. Access to a machine running Windows 2000 could be arranged at
> some inconvenience.

I had a friend of mine running Windows open the PDF in Acrobat, select the
text, and paste it into Microsoft Word. This worked, except that the
characters in each line were reversed. I suspect this has something to do
with the LTR/RTL text direction settings, but so far we've been unable to
correct the problem. Saving the resulting Word file as ISO-8859-6 text
and then using the "rev" Unix command doesn't reproduce the original
document -- probably something to do with different Arabic letterforms
being used for word-initial and word-terminal letters.

In the meantime, if anyone has further suggestions, please let me know!

Paulo Soares

unread,
Jun 7, 2004, 1:50:22 PM6/7/04
to
You may need the Acrobat Middle Eastern version for arabic pasting to work.

Best Regards,
Paulo Soares

Tristan Miller <psych...@nothingisreal.com> wrote in message news:<60385706....@ID-187157.News.Individual.NET>...

Uli Wachowitz

unread,
Jun 7, 2004, 2:36:40 PM6/7/04
to
Tristan Miller <psych...@nothingisreal.com> wrote:

> Any other
> suggestions would be appreciated.

Just out of curiosity, would you mind to send me this pdf?
I've no idea if my tries would be of any help, but it sounds interesting.


--
I am your root. If you see me laughing, you better have a backup.

Derek B. Noonburg

unread,
Jun 7, 2004, 5:29:48 PM6/7/04
to
In article <9969392.8...@ID-187157.News.Individual.NET>, Tristan Miller wrote:
> Greetings.
>
> I have a two-page PDF document with minimal formatting written in Farsi
> (Persian), which uses an RTL Arabic script. I would like to extract the
> text into a plain text file in whatever encoding -- UTF-8, UTF-16,
> CP-1256, or ISO-8859-6 would be fine. I'm using KDE 3.2.2 on GNU/Linux.
> I have access to Adobe Acrobat Reader and all the standard GhostScript
> tools. Access to a machine running Windows 2000 could be arranged at some
> inconvenience.

Have you tried pdftotext (from my open source Xpdf package)? Linux
and Windows binaries are available from
http://www.foolabs.com/xpdf/download.html

It will output UTF-8 ("pdftotext -enc UTF-8 ..."), and it has
(somewhat rudimentary) support for right-to-left scripts.

I've run into some problems with Arabic PDF files, in which the
ToUnicode mappings are completely broken, i.e., whatever software is
creating the PDF files is including incorrect Unicode mapping info.
But if you were able to copy text using Acrobat, then pdftotext should
work.

- Derek

Tristan Miller

unread,
Jun 8, 2004, 6:18:52 AM6/8/04
to
Greetings.

In article <slrncc9nic...@glyphandcog.com>, Derek B. Noonburg
wrote:


>> I have a two-page PDF document with minimal formatting written in Farsi
>> (Persian), which uses an RTL Arabic script. I would like to extract the
>> text into a plain text file in whatever encoding -- UTF-8, UTF-16,
>> CP-1256, or ISO-8859-6 would be fine.
>

> Have you tried pdftotext (from my open source Xpdf package)?

This worked almost perfectly -- thanks! The only problems were that in
some cases in the original document where LTR text was embedded in a RTL
paragraph, the LTR text was reversed. But there were only a couple
instances of this, so they're easily fixed by hand.

Herb Martin

unread,
Jun 8, 2004, 8:00:24 AM6/8/04
to
PDF's are NOTORIOUSLY POOR at handling non-Roman
text in a compatible manner.

I have been seeking a general solution to this for Arabic
for some time.

--
Herb Martin


"Tristan Miller" <psych...@nothingisreal.com> wrote in message
news:60385706....@ID-187157.News.Individual.NET...

Tristan Miller

unread,
Jun 8, 2004, 9:02:29 AM6/8/04
to
Greetings.

In article <sThxc.42948$4x2....@fe2.texas.rr.com>, Herb Martin wrote:
> PDF's are NOTORIOUSLY POOR at handling non-Roman
> text in a compatible manner.
>
> I have been seeking a general solution to this for Arabic
> for some time.

If you're reading this thread in soc.culture.arabic, take a look at the
same in comp.text.pdf. A solution has been posted (pdftotext, part of the
Xpdf package) which works at least in my case -- perhaps it also works in
the general case.

Herb Martin

unread,
Jun 8, 2004, 11:15:20 AM6/8/04
to
> If you're reading this thread in soc.culture.arabic, take a look at the
> same in comp.text.pdf. A solution has been posted (pdftotext, part of the
> Xpdf package) which works at least in my case -- perhaps it also works in
> the general case.


I will give it a try -- on your recommendation -- but without
much optimism as I have tried a variety of "to text" utilities,
code pages/encoding, and cut and paste methods and failed
to copy most Arabic script from PDFs.

(I am trying to build a study sheet/word list feed to the "flash
card" program "Pauker".)

--
Herb Martin


"Tristan Miller" <psych...@nothingisreal.com> wrote in message

news:1213585.9...@ID-187157.News.Individual.NET...

Herb Martin

unread,
Jun 8, 2004, 12:08:46 PM6/8/04
to
> Have you tried pdftotext (from my open source Xpdf package)? Linux
> and Windows binaries are available from
> http://www.foolabs.com/xpdf/download.html


Thank you but it did not work for me -- I doubt that it is
any fault of your excellent program but rather in the
category of "files with fonts mangled so badly that only
OCR can extract the text."


Thanks (truly)
Herb.

"Derek B. Noonburg" <der...@glyphandcog.com> wrote in message
news:slrncc9nic...@glyphandcog.com...

zbin...@gmail.com

unread,
Nov 6, 2015, 1:31:59 AM11/6/15
to
you can try this free online pdf to text converter http://www.online-code.net/pdf-to-word.html to convert pdf to plain text file online.
0 new messages