I am using python 3 on windows 10 (though OS X is also available). I am attempting to extract the text from lots of .pdf files, all in Chinese characters. I have had success with pdfminer and textract, except for certain files. These files are not images, but proper documents with selectable text. If I use Adobe Acrobat Pro X and export to .txt, My output looks like:
I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste. I thought maybe it was a font issue, the font is DFKaiShu sb-estd-bf which I already have installed (it appears to automatically come with windows).
I'm trying to support someone to create a PDF from a MS Word 2011 document in OS X. She normally does the layout in Word, then to generate the PDF goes to the Print dialog to convert the file to PS, which she then opens in Distiller to make the PDF (generating the PDF directly from the Print dialog in Word creates visual oddities in the images, so she says she's kind of stuck doing it this way). For some reason though she's getting an error where Distiller crashes when it hits a specific font (Proxima-Nova, which she says she is NOT using in the file). How to resolve this? In this forum they were trying to solve the issue by re-install the Proxima-Nova font, but it doesn't appear to be clear if that's definitely the answer, or if there are any steps to take in the Distiller settings beyond just downloading and installing the font.
That error is very specific. The font is in fact embedded in the PostScript file that is being distilled, but it apparently has an error in the font definition itself. The font file itself installed on that Mac may be corrupted or the font itself might be crufty. Obviously, if the document's author doesn't believe she is using the font, one would need to thoroughly go through the Word document and try to find and eliminate the reference to the font.
Why in the world would they create PDF from Word, especially under MacOS, by creating PostScript and then distilling? Even without Acrobat installed, you can easily create PDF from Word both directly from Word and from the print dialog. Adding Acrobat provides a means of getting an Adobe Acrobat-optimized PDF file.
The only reason to go through the craziness of creating PostScript and then converting that to PDF would be if you have EPS graphics in your Word document, something that is exceptionally rare these days.
Note that one of the possibilities regarding the unexpected call for a font that one believes is not used in a document is that (1) such a font is used in the underlying Word template and/or (2) a non-printing area within the document such as a tab or paragraph ending is formatted in that font.
Just got another message from this user that they are having similar issues again. Wondering if you'd mind taking a look at this and providing some advice? Would love to find a permanent solution for this client.
The first is with regards to the Distiller finding PDF/X-4-based joboptions. Distiller does not support PDF/X-4 and thus rejects that joboptions file. However, that is a warning and has nothing to do with failure of converting the file from PostScript to PDF.
This indicates an actual error in the font file that was inserted into the PostScript file, probably by a printer driver someplace in your workflow. The problem may be the original font itself or how the driver inserted the font (which is actually a font subset).
That having been said, the fact that the Distiller is croaking on font SMFIZI+ProximaNova-Regular means that a subset font was found in the PostScript and not inserted by the Distiller. Whatever was generating the PostScript (possibly MacOS drivers?) is the likely source of the problem assuming that there wasn't some font handling bug in that older version of Distiller.
Unless there was EPS artwork in the original Word documents, distilling PostScript to get PDF is absolutely not recommended in modern workflows. More recent versions of Acrobat interface with Word to directly create PDF from the Word document. That is what you should be using, not conversions of PostScript. Even the MacOS PDF creation is better than this PostScript route.
It would be easy to say that using InDesign would solve your problems for this particular situation, but if that person doesn't have InDesign and/or know how to use it (a long and very steep learning curve), you aren't doing them any favors by recommending it.
I'm not 100% sure I'm following all of this but I'm having the same issue. I created a document in GoogleDocs using Montserrat font, downloaded it to my computer and am trying to make a PDF from the Word doc. (I need to use Word so others on my team can use the file - they don't have access to InDesign or GoogleDocs.) Every time I try to export, the formatting changes and it shifts text to different pages. The error that pops up is below. If I change the font the layout works but I would like to use Montserrat if at all possible.
it's probably not the cause, but there is no situation in which the PDF/X4:2008 job options should be used with Distiller. Distiller can't make PDF/X-4. This job option is for other apps. The error may be an issue in the font itself or in the PostScript driver.
This fonts are authors' property, and are either shareware, demo versions or public domain. The licence mentioned above the download button is just an indication. Please look at the readme-files in the archives or check the indicated author's website for details, and contact him if in doubt. If no author/licence is indicated that's because we don't have information, that doesn't mean it's free.
Fonts begin where character sets end. The characters defined by the encodings inside your computer are abstract, whereas the glyphs defined by a font are concrete visual forms that can be rendered on screen or paper.
Outline fonts are fonts in which glyphs are described mathematically as "outlines," a series of line segments, arcs, and curves. They are fully scalable: to print or display a character, the outline is scaled to the desired size, then rendered by filling the outline with bits or pixels. The information provided here is limited to what the typical Chinese Mac user might want to know. If you want to learn more about font formats and printing technologies, Ken Lunde's CJKV Information Processing is very thorough on these topics.
Developed by Adobe, PostScript is a "page-description" language for printers. It supports both graphics and text, with built-in support for fonts. The most common PostScript font format is Type 1. Chinese Postscript fonts use the CID format, which uses Type 1 character descriptions tailored especially for East Asian writing systems. CID stands for "Character Identifier," which refers to the numbers that are used to index and access the characters in the font. OS X provides full support for all types of PostScript-based fonts.
In 1991, Microsoft adopted Apple's TrueType font format, but they used a different approach to storing the font data. Font files had to be converted between Windows and Macintosh. Regardless, all TrueType fonts contain "cmap" tables that map its glyphs to various encodings. With Mac OS X 10.5 (2007), Apple introduced full support for Windows TrueType font files, but the files must contain Unicode cmap tables. Most Windows 98 and later fonts have them, while most Windows 95 and earlier fonts do not.
OpenType is an open standard developed by Microsoft and Adobe in 1996 to absorb the underlying differences between the TrueType and PostScript formats. OpenType fonts also use cmap tables. There are two kinds of OpenType fonts: those that use PostScript Type 1 names and outlines and carry the .OTF extension, and those that use TrueType names and outlines and carry the .TTF (or .TTC) extension.
TrueType "collections" with the .TTC extension contain multiple fonts, usually different weights of the same font. They can also use the Unicode technology of glyph variants (supported in OS X 10.6 and above) to provide localized glyphs for users in China/Singapore (the "SC" locale), Hong Kong (the "HK" locale), and Taiwan (the "TC" locale).
Note: Formerly part of the TC ("Traditional Chinese") locale, the HK locale became necessary with HKSCS-2016. Previous editions of the HKSCS were compatible with Big Five, but the 2016 standard is Unicode-only and diverges by replacing 22 Big Five characters with variant forms from Unicode. See HKSCS.
One way for individuals to obtain reliable, high-quality Chinese fonts is in retail bundles from established foundries. There aren't many of these companies. The making of an original Chinese font is a huge undertaking, somewhat less so now with the advent of new approaches and advanced technologies, but producing a finished, unique font is still a monumental task, involving a team of people working for months, if not years.
The current model for distributing fonts is via annual subscriptions. Adobe led the way with what is now TypeKit, and the rest of the industry has, for the most part, followed their lead. [NEED MORE DETAILS HERE] [DISCUSSION OF WEB FONTS AND CSS3]
Hong Kong. Formerly DynaLab. Maker of the "DynaFont" [金蝶] line.Theyarethe source of the current Apple fonts LiHei Pro and LiSong Pro in OS X, as well as most of Apple's fonts for Traditional Chinese in the Chinese Language Kit and OS 9. They also make the MingLiU/PMingLiU and DFKai-SB fonts that come with Windows. Most recently, their Shanghai, Hong Kong, and Taiwan divisions worked together with Apple to create PingFang, the new system font introduced in OS X 10.11 El Capitan.
c80f0f1006