Problem extracting text on linux

117 views
Skip to first unread message

glor...@trainsolutions.com.ar

unread,
Dec 11, 2014, 2:35:44 PM12/11/14
to pdfne...@googlegroups.com

we have a problem extratcting text in linux

#pdfinfo test.pdf
Producer:       PDFTron PDFNet
CreationDate:   
ModDate:        
Tagged:         no
Pages:          1
Encrypted:      no
Page size:      595.44 x 841.68 pts
File size:      35733 bytes
Optimized:      no
PDF version:    1.4

#pdffonts test.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Arial-BoldMT                         CID TrueType      no  no  no       8  0
ArialMT                              CID TrueType      no  no  no      14  0

#/usr/bin/pdftotext - raw test.pdf test.txt
not output

#vim test.txt
^L

and

#/usr/bin/pdftotext -enc UNICODE  -raw test.pdf test.txt
Error: Couldn't find unicodeMap file for the 'UNICODE' encoding
Error: Couldn't get text encoding

#vim test.txt
^L

#gs -o test2.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress test2.pdfGPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
   **** Warning: Ignoring bad ToUnicode CMap.
   **** Warning: Ignoring bad ToUnicode CMap.
   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> PDFTron PDFNet <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

# pdfinfo test2.pdf
Producer:       GPL Ghostscript 9.05
CreationDate:   Thu Dec 11 16:23:24 2014
ModDate:        Thu Dec 11 16:23:24 2014
Tagged:         no
Pages:          1
Encrypted:      no
Page size:      595.44 x 841.68 pts
File size:      60002 bytes
Optimized:      no
PDF version:    1.4

# /usr/bin/pdftotext test2.pdf test2.txt
Error: Missing language pack for 'Adobe-Japan1' mapping
Error: Missing language pack for 'Adobe-Japan1' mapping
Error: Unknown font tag 'R12'
Error (365): No font in show/space
Error: Unknown font tag 'R15'
Error (482): No font in show/space
Error (538): No font in show/space
Error (570): No font in show/space
Error (590): No font in show/space
Error: Unknown font tag 'R12'
...
Error (4517): No font in show/space
Error (4570): No font in show/space
Error (4598): No font in show/space
Error: Unknown font tag 'R15'
Error (4729): No font in show/space
Error (4822): No font in show/space
Error (4892): No font in show/space

thanks for your reply.



Ryan

unread,
Dec 11, 2014, 9:13:11 PM12/11/14
to pdfne...@googlegroups.com
It looks like you are trying to run a ghost script program on a PDF created using PDFNet, and ghostscript is complaining about bad unicode mapping?

If you open the PDF in a PDF reader, such as Evince, or Adobe Reader, can you select and copy+paste the text into a text editor? Is it correct?

Did you create the PDF file? If so, what is the code you used to create the font, and create a text run? You might want to take a closer look at the ElementBuilder sample.
https://www.pdftron.com/pdfnet/samplecode.html#ElementBuilder
Reply all
Reply to author
Forward
0 new messages