copy Chinese from PDF

1,723 views
Skip to first unread message

Nathan Sturtevant

unread,
Apr 13, 2006, 1:46:45 AM4/13/06
to chine...@googlegroups.com
I have a pdf file with Chinese in it which I'd like to copy out for easier reading. But, when I copy and paste the text, I essentially get garbage:

e8+4 . 6.EIjl ii k'4-4.8 *- , {1.4+
,,1i l! " ?A6ia1+|it+tL+ H,h:' ,\..U.l'.r+
vL+a ';h$ -l ,*.)L"

This is the same with both Apple's Preview and Adobe's Acrobat Reader. Does anyone know a way to get the text out intact?

Thanks,

Nathan

Eric Rasmussen

unread,
Apr 13, 2006, 2:10:46 PM4/13/06
to chine...@googlegroups.com
That is rather strange-looking garbage. It doesn't look like the
usual result of an encoding problem.

It might stem from the type of font that is embedded in the PDF, or
the way it is embedded, or something like that. I don't know,
obviously. Is it an old document? Or something recent?

ER

Magnus Lewan

unread,
Apr 13, 2006, 3:01:01 PM4/13/06
to chine...@googlegroups.com
I think that was an excellent question, because it taught me that you occasionally can copy Chinese text from PDFs. I have failed so often in the past that I had simply given it up completely.

My experiments today showed that if I print to PDF from TextEdit using the built in MacOS X Print to PDF function, I can copy and paste from the resulting file. However, if I print to PDF using the expensive Adobe Acrobat program for professionals, the output is rubbish. It displays fine. It prints fine. But copy and paste is a no-no. If I paste the text from my Adobe output and drag a sample character to the Character Palette, it tells me that I am in the Supplementary Private Use Characters-B area, and when Character Palette starts talking about Private Use areas, I throw in the towel.

So, sorry, Nathan, I have no solution to the problem.

Cheers
M

Eric Rasmussen

unread,
Apr 13, 2006, 3:31:57 PM4/13/06
to chine...@googlegroups.com
On Apr 13, 2006, at 3:01 PM, Magnus Lewan wrote:
> I think that was an excellent question, because it taught me that
> you occasionally can copy Chinese text from PDFs. I have failed so
> often in the past that I had simply given it up completely.

I do know that the Text Tool in Preview works for the CBETA (Buddhist
canon) PDFs. So it is not something inherent in the PDF format.

It is possible for the author of a PDF document to use Acrobat's
document-security settings to prevent readers from copying its
content. But I don't think that results in garbage text, rather, it
results in not being able to select the text or copy it to the
clipboard in the first place.

Eric

Nien-Po Chen

unread,
Apr 13, 2006, 10:57:36 PM4/13/06
to Chinese Mac
Nathan,

Can you look at the particular PDF file's property and see its creator?
The information is available in acrobat reader menu item 'file'.

I can create pdf files with such behavior by using LaTeX CJK and
pdflatex's subfont mechanism. It is due to the font management
limitation in LaTeX. It can only handle 256 characters in a given
font. To handle large character counts in Chinese fonts, it divides
Chinese characters into several sub-fonts and each Chinese character is
assigned to a character in the sub-font. To remedy the situation,
using dvipdfmx instead of pdflatex as the LaTeX CJK work flow will use
CID keyed index mechanism, and consequently yield a PDF file with
normal copy-and-paste behavior for Chinese characters. You can see
more detail information on LaTeX CJK at
<http://edt1023.sayya.org/tex/mycjk/node3.html#SECTION00320000000000000000>

Your PDF file may / may not be generated by the above method, but the
principle should be similar. I don't think there's an easy way to
remedy this from your side.

I consider this as an 'un-intentional' copy-protection effect. :)

Nobumi Iyanaga

unread,
Apr 14, 2006, 1:23:54 AM4/14/06
to chine...@googlegroups.com
Hello,

I don't know about LaTeX CJK, but using pLaTeX2e with oftcjk package,
I can generate pdfs (with CJK) which can be copied... (on otfcjk
package, you can have a look at <http://oku.edu.mie-u.ac.jp/~okumura/
texwiki/?xyzzy#xyutfasd> [in Japanese]).

On the other hand, there is a utility named "PDF2RTFService.service";
with it, you can open pdf files with TextEdit (and other rtf
applications, like Jedit X), and edit them (you must save the result
in another rtf file, using Save As...). For
"PDF2RTFService.service", see <http://www.devon-technologies.com/
products/freeware/services.html>. Note that "PDF2RTFService.service"
is an invisible utility; you put it in your ~/Library/Services/, and
voila, without doing anything, you will be able to open pdf files,
PostScript and Encapsulated PostScript (EPS) files as paginated rich
text documents...

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

Magnus Lewan

unread,
Apr 14, 2006, 2:01:42 AM4/14/06
to chine...@googlegroups.com
Nifty tool!

A PDF created using MacOS X' Print to PDF opens fine in TextEdit with
all Chinese characters (but no formating or pictures of course).
However, the "corrupt" file I created from Adobe Acrobat cannot be
opened, but triggers a message "File filename.pdf could not be opened".

I also tried saving the "corrupt" file as rtf, doc and html directly
from Adobe Acrobat professional, and none of the Chinese characters
came out right.

It is somewhat frustrating. You can see the correct characters, so
you know they are there in the file, but there seems to be no way of
getting them out.

Cheers
M

On 14 Apr 2006, at 07:23, Nobumi Iyanaga wrote:
>
> On the other hand, there is a utility named
> "PDF2RTFService.service"; with it, you can open pdf files with
> TextEdit (and other rtf applications, like Jedit X), and edit them
> (you must save the result in another rtf file, using Save As...).
> For "PDF2RTFService.service", see <http://www.devon-
> technologies.com/products/freeware/services.html>. Note that
> "PDF2RTFService.service" is an invisible utility; you put it in
> your ~/Library/Services/, and voila, without doing anything, you
> will be able to open pdf files, PostScript and Encapsulated
> PostScript (EPS) files as paginated rich text documents...

http://lewan.chez-alice.fr/


Joe Wicentowski

unread,
Apr 14, 2006, 8:22:46 AM4/14/06
to chine...@googlegroups.com
Hello Magnus,

Could it be that this PDF is a scanned image which Acrobat tried
(unsuccessfully) to perform OCR on? One of the options when you scan
a document with Acrobat is to perform OCR on the document (thus
allowing copy & paste from the PDF) while preserving the original
image. The file size might give you a hint; if the file is huge
compared to a PDF of the same length generated by, say, an MS-Word
document converted to PDF, then it's likely that this is the case.

- Joe

Magnus Lewan

unread,
Apr 14, 2006, 8:41:43 AM4/14/06
to chine...@googlegroups.com
Nope. Created from a pure text document in TextEdit. Print and then use Adobe's print to PDF option. In the resulting file, the text can be selected using the "select text" tool, and it scales fine without any pixels appearing.

It's the same result if you generate it from MS-Word using the Adobe option, I think.

By the way, I tried to create a PDF in Adobe Acrobat pro directly from the url http://zh.wikipedia.org/ . It bluntly refused, probably because the redirected URL contains hanzi. Adobe are not doing their best to help their Chinese speaking users.

Cheers
M
--
http://lewan.chez-alice.fr/

Joe Wicentowski

unread,
Apr 14, 2006, 9:19:39 AM4/14/06
to chine...@googlegroups.com
Sorry Magnus!  I meant to address my question to Nathan - regarding the PDF he originally posted having had problems with.  So, Nathan: 

Could it be that this PDF is a scanned image which Acrobat tried (unsuccessfully) to perform OCR on?  One of the options when you scan a document with Acrobat is to perform OCR on the document (thus allowing copy & paste from the PDF) while preserving the original image.  The file size might give you a hint; if the file is huge compared to a PDF of the same length generated by, say, an MS-Word document converted to PDF, then it's likely that this is the case.

- Joe

On Apr 13, 2006, at 1:46 AM, Nathan Sturtevant wrote:

Nathan Sturtevant

unread,
Apr 15, 2006, 1:41:56 AM4/15/06
to chine...@googlegroups.com
To answer two different questions:

> Can you look at the particular PDF file's property and see its
> creator?
> The information is available in acrobat reader menu item 'file'.

The file has no author. The application listed is "Canon". Under
fonts it says an ANSI encoding.

> Could it be that this PDF is a scanned image which Acrobat tried
> (unsuccessfully) to perform OCR on? One of the options when you
> scan a document with Acrobat is to perform OCR on the document
> (thus allowing copy & paste from the PDF) while preserving the
> original image. The file size might give you a hint; if the file
> is huge compared to a PDF of the same length generated by, say, an
> MS-Word document converted to PDF, then it's likely that this is
> the case.

That actually seems like the most likely possibility. Parts of the
document do look scanned, and when I zoom in the text isn't smooth.
Acrobat did a good enough job finding the characters that it had me
fooled when I could select them.

Thanks for the help!

Nathan

Reply all
Reply to author
Forward
0 new messages