Indexing PDF files

267 views
Skip to first unread message

JLChip

unread,
Jan 4, 2011, 7:54:50 AM1/4/11
to Constellio
I am indexing different types of PDF files, with metadata and without
them, safely and in different versions of Adobe PDFWriter. I only
managed to index files in version 1.4 Does anyone know tell me the
restrictions on the pdf?
Likewise, when there are metadata and the file is password protected,
produces an erroneously the title of the document. (In GSA does not
happen).
Thanks in advance for any consideration.

JLChip

unread,
Jan 12, 2011, 5:37:29 AM1/12/11
to Constellio
As indicated here are some pdf files that I've tried, and give or not,
problems
The files are protected or not (Password=password). Can download it
from:
http://www.mediafire.com/?2mieoz1pzx9wq
The results are different, and are these to the Search:

Ficheros PDF para Constellio
http://www1.intranet.gc/publico/prueba/
Free text tags : pdf; constellio; prueba; protegidos

http://www1.intranet.gc/publico/prueba/constellio-03P_Acrobat_6_pdfwriter_1_5.pdf
http://www1.intranet.gc/publico/prueba/constellio-03P_Acrobat_6_pdfwriter_1_5.pdf
Free text tags : None

CONSTELLIO-03P Versión 1.5 PDFWriter
http://www1.intranet.gc/publico/prueba/constellio-03A_Acrobat_6_pdfwriter_1_5.pdf
Free text tags : CONSTELLIO-03;Lorem ipsum;protegido

constellio-03P2_Acrobat_6_pdfwriter_1_4
http://www1.intranet.gc/publico/prueba/constellio-03A_Acrobat_6_pdfwriter_1_4.pdf
Free text tags : CONSTELLIO-03;Lorem ipsum

SI 'tEáb*—BÉ iïÞ€w8ôÜÅ>ûVAÅ Úuû3ŒÝ1»¡
http://www1.intranet.gc/publico/prueba/constellio-03P_Acrobat_6_pdfwriter_1_4.pdf
Free text tags : si'0 TeÁB —BÉv{üð‘`:¶ÔÁ ¸d ‹ßhæ"ŽÆ å


Thanks for the help
Salu

Vincent Dussault

unread,
Jan 12, 2011, 8:46:39 AM1/12/11
to const...@googlegroups.com
Hi JL,

Thank you for the files, we are testing them as I type. 

Regards,

Vincent Dussault


--
Vous recevez ce message, car vous êtes abonné au groupe Google Groupes Constellio.
Pour envoyer un message à ce groupe, adressez un e-mail à const...@googlegroups.com.
Pour vous désabonner de ce groupe, envoyez un e-mail à l'adresse constellio+...@googlegroups.com.
Pour plus d'options, consultez la page de ce groupe : http://groups.google.com/group/constellio?hl=fr


Vincent Dussault

unread,
Jan 12, 2011, 2:32:23 PM1/12/11
to const...@googlegroups.com
Hi JL,

The problem comes from PDFBox, a library used by Apache Tika, which parses fetched content for Constellio. I tried updating both Tika and PDFBox to the latest version.

Now, the files without security (constellio-03A_Acrobat_6_pdfwriter_1_4.pdf and constellio-03A_Acrobat_6_pdfwriter_1_5.pdf) show up when I search "ipsum". 

It is not the case for the files with security : constellio-03P_Acrobat_6_pdfwriter_1_4.pdf and constellio-03P_Acrobat_6_pdfwriter_1_5.pdf. 

Title is correctly extracted for the files without security.

I found the problem for the keywords, it will be fixed in the next build. They are now extracted from the files without security.

So content and metadata are extracted correctly when the files don't have security. 

Here is the stack trace when I try to index the files with security : 

 WARN [http-8080-4] (PDFParser.java:182) - Parsing Error, Skipping Object
java.io.IOException: Error: Expected an integer type, actual='ŽRª çíŠq2ä)/Title(SI 'tEáb*—BÉ iïހw8ôÜÅ>ûVAÅ Úuû3Œà 1»¡'
at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1384)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:499)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:881)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:846)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
at com.doculibre.constellio.feedprotocol.FeedProcessor.asContentParse(FeedProcessor.java:282)
at com.doculibre.constellio.feedprotocol.FeedProcessor.asRecord(FeedProcessor.java:344)
at com.doculibre.constellio.feedprotocol.FeedProcessor.addRecord(FeedProcessor.java:209)
at com.doculibre.constellio.feedprotocol.FeedProcessor.processRecord(FeedProcessor.java:124)
at com.doculibre.constellio.feedprotocol.FeedProcessor.processFeed(FeedProcessor.java:92)
at com.doculibre.constellio.feedprotocol.FeedServlet.doPost(FeedServlet.java:118)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.doculibre.constellio.filters.LocalRequestFilter.doFilter(LocalRequestFilter.java:64)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Unknown Source)
 WARN [http-8080-4] (PDFParser.java:182) - Parsing Error, Skipping Object
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@199d3fa
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:530)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:881)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:846)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
at com.doculibre.constellio.feedprotocol.FeedProcessor.asContentParse(FeedProcessor.java:282)
at com.doculibre.constellio.feedprotocol.FeedProcessor.asRecord(FeedProcessor.java:344)
at com.doculibre.constellio.feedprotocol.FeedProcessor.addRecord(FeedProcessor.java:209)
at com.doculibre.constellio.feedprotocol.FeedProcessor.processRecord(FeedProcessor.java:124)
at com.doculibre.constellio.feedprotocol.FeedProcessor.processFeed(FeedProcessor.java:92)
at com.doculibre.constellio.feedprotocol.FeedServlet.doPost(FeedServlet.java:118)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.doculibre.constellio.filters.LocalRequestFilter.doFilter(LocalRequestFilter.java:64)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Unknown Source)
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@18059e6
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:150)
at com.doculibre.constellio.feedprotocol.FeedProcessor.asContentParse(FeedProcessor.java:282)
at com.doculibre.constellio.feedprotocol.FeedProcessor.asRecord(FeedProcessor.java:344)
at com.doculibre.constellio.feedprotocol.FeedProcessor.addRecord(FeedProcessor.java:209)
at com.doculibre.constellio.feedprotocol.FeedProcessor.processRecord(FeedProcessor.java:124)
at com.doculibre.constellio.feedprotocol.FeedProcessor.processFeed(FeedProcessor.java:92)
at com.doculibre.constellio.feedprotocol.FeedServlet.doPost(FeedServlet.java:118)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.doculibre.constellio.filters.LocalRequestFilter.doFilter(LocalRequestFilter.java:64)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:946)
at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:108)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 26 more

Thank you for your help!

Regards,

Vincent Dussault

JLChip

unread,
Jan 13, 2011, 2:32:09 PM1/13/11
to Constellio
Thank you very much for the effort.
It was important to solve the problem for files without security. For
me it is also important that the metadata could be read to be security-
enabled, as is usual in my work environment. I have not said that the
files are safely support Acrobat 1.5, which is supposed to encrypt all
the contents except metadata. There is an option in Adobe Acrobat,
when you activate the security file, which is supposed to guarantee
access. I have a GSA indexed files with security enabled, and you gain
access to metadata perfectly, and the content for version 1.5 of the
PDF files.

I hope that you can have the updated version soon and meanwhile I will
continue testing with other content and size.
I also studied Nutch and when I have some concrete information
referred to this issue, and only this, I will comment ... if relevant.

On the other hand, since my potential users would need to be able to
use Constellio in your language (Spanish - Spain) and assessing the
tool will be more objective. I'm translating and deploying the file
properties of the user. Although I do not know "Wicket"and I this be
difficult because of disintegration of the same, I would like to
contribute with my work on the project if it is of interest to it. How
I can do?
Sorry for my English.
Thanks
JLChip

Reply all
Reply to author
Forward
0 new messages