It might stem from the type of font that is embedded in the PDF, or
the way it is embedded, or something like that. I don't know,
obviously. Is it an old document? Or something recent?
ER
I do know that the Text Tool in Preview works for the CBETA (Buddhist
canon) PDFs. So it is not something inherent in the PDF format.
It is possible for the author of a PDF document to use Acrobat's
document-security settings to prevent readers from copying its
content. But I don't think that results in garbage text, rather, it
results in not being able to select the text or copy it to the
clipboard in the first place.
Eric
Can you look at the particular PDF file's property and see its creator?
The information is available in acrobat reader menu item 'file'.
I can create pdf files with such behavior by using LaTeX CJK and
pdflatex's subfont mechanism. It is due to the font management
limitation in LaTeX. It can only handle 256 characters in a given
font. To handle large character counts in Chinese fonts, it divides
Chinese characters into several sub-fonts and each Chinese character is
assigned to a character in the sub-font. To remedy the situation,
using dvipdfmx instead of pdflatex as the LaTeX CJK work flow will use
CID keyed index mechanism, and consequently yield a PDF file with
normal copy-and-paste behavior for Chinese characters. You can see
more detail information on LaTeX CJK at
<http://edt1023.sayya.org/tex/mycjk/node3.html#SECTION00320000000000000000>
Your PDF file may / may not be generated by the above method, but the
principle should be similar. I don't think there's an easy way to
remedy this from your side.
I consider this as an 'un-intentional' copy-protection effect. :)
I don't know about LaTeX CJK, but using pLaTeX2e with oftcjk package,
I can generate pdfs (with CJK) which can be copied... (on otfcjk
package, you can have a look at <http://oku.edu.mie-u.ac.jp/~okumura/
texwiki/?xyzzy#xyutfasd> [in Japanese]).
On the other hand, there is a utility named "PDF2RTFService.service";
with it, you can open pdf files with TextEdit (and other rtf
applications, like Jedit X), and edit them (you must save the result
in another rtf file, using Save As...). For
"PDF2RTFService.service", see <http://www.devon-technologies.com/
products/freeware/services.html>. Note that "PDF2RTFService.service"
is an invisible utility; you put it in your ~/Library/Services/, and
voila, without doing anything, you will be able to open pdf files,
PostScript and Encapsulated PostScript (EPS) files as paginated rich
text documents...
Best regards,
Nobumi Iyanaga
Tokyo,
Japan
A PDF created using MacOS X' Print to PDF opens fine in TextEdit with
all Chinese characters (but no formating or pictures of course).
However, the "corrupt" file I created from Adobe Acrobat cannot be
opened, but triggers a message "File filename.pdf could not be opened".
I also tried saving the "corrupt" file as rtf, doc and html directly
from Adobe Acrobat professional, and none of the Chinese characters
came out right.
It is somewhat frustrating. You can see the correct characters, so
you know they are there in the file, but there seems to be no way of
getting them out.
Cheers
M
On 14 Apr 2006, at 07:23, Nobumi Iyanaga wrote:
>
> On the other hand, there is a utility named
> "PDF2RTFService.service"; with it, you can open pdf files with
> TextEdit (and other rtf applications, like Jedit X), and edit them
> (you must save the result in another rtf file, using Save As...).
> For "PDF2RTFService.service", see <http://www.devon-
> technologies.com/products/freeware/services.html>. Note that
> "PDF2RTFService.service" is an invisible utility; you put it in
> your ~/Library/Services/, and voila, without doing anything, you
> will be able to open pdf files, PostScript and Encapsulated
> PostScript (EPS) files as paginated rich text documents...
Could it be that this PDF is a scanned image which Acrobat tried
(unsuccessfully) to perform OCR on? One of the options when you scan
a document with Acrobat is to perform OCR on the document (thus
allowing copy & paste from the PDF) while preserving the original
image. The file size might give you a hint; if the file is huge
compared to a PDF of the same length generated by, say, an MS-Word
document converted to PDF, then it's likely that this is the case.
- Joe
> Can you look at the particular PDF file's property and see its
> creator?
> The information is available in acrobat reader menu item 'file'.
The file has no author. The application listed is "Canon". Under
fonts it says an ANSI encoding.
> Could it be that this PDF is a scanned image which Acrobat tried
> (unsuccessfully) to perform OCR on? One of the options when you
> scan a document with Acrobat is to perform OCR on the document
> (thus allowing copy & paste from the PDF) while preserving the
> original image. The file size might give you a hint; if the file
> is huge compared to a PDF of the same length generated by, say, an
> MS-Word document converted to PDF, then it's likely that this is
> the case.
That actually seems like the most likely possibility. Parts of the
document do look scanned, and when I zoom in the text isn't smooth.
Acrobat did a good enough job finding the characters that it had me
fooled when I could select them.
Thanks for the help!
Nathan