Errors with tika pdfparser in filter-media when extracting text

67 views
Skip to first unread message

Brian Keese

unread,
May 22, 2025, 10:51:21 AMMay 22
to DSpace Technical Support
Hello,
We have a recurring issue with PDF submissions that cause filter-media to fail when parsing text. Maybe 10% of submissions cause errors that look like below. I tried to upgrade apache tika beyond 2.9.2, thinking there might have been a bug fix. But I can't get the build to finish because of dependency conflicts in tika and I don't know enough about maven to get past them. Has anyone solved this or can anyone suggest a solution?
Thanks,
Brian
2025-05-10 12:07:22.585 INFO filter-media - 139 @ The script has started 2025-05-10 12:07:22.587 INFO filter-media - 139 @ File: Wuli, Diana (DM Cello).pdf.txt 2025-05-10 12:07:22.775 ERROR filter-media - 139 @ ERROR filtering, skipping bitstream: Item Handle: 2022/33585 Bundle Name: ORIGINAL File Size: 25537966 Checksum: 26d2ff8f679b7e4ddca975f3390766fc (MD5) Asset Store: 0 Internal ID: 139753933998022015522955092400058404315 2025-05-10 12:07:22.776 ERROR filter-media - 139 @ Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5c9b293e Caused by: java.lang.StringIndexOutOfBoundsException begin 66, end 24, length 90 2025-05-10 12:07:22.779 INFO filter-media - 139 @ SKIPPED: bitstream 67dabaac-db02-4371-88ea-1129e41e4e2e (item: 2022/33585) because 'Wuli, Diana (DM Cello).pdf.jpg' already exists 2025-05-10 12:07:22.789 INFO filter-media - 139 @ The script has completed

mw...@iu.edu

unread,
May 22, 2025, 11:32:41 AMMay 22
to dspac...@googlegroups.com
You are not alone. We have hundreds of these. Other PDF tools have
no problem with those files.

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
library.indianapolis.iu.edu
signature.asc

Brian Keese

unread,
May 22, 2025, 3:04:17 PMMay 22
to DSpace Technical Support
More information... in my test sample of one, just now, I changed "textextractor.use-temp-file = true" to "textextractor.use-temp-file = false" in dspace.cfg and then the pdf text was parsed successfully. I'll dig into the temp file code to see if I can nail down the root cause. I'm guessing something about the parser plug-in interface has changed. 

mw...@iu.edu

unread,
May 23, 2025, 10:38:01 AMMay 23
to dspac...@googlegroups.com
On Thu, May 22, 2025 at 07:04:17PM +0000, Keese, Brian W wrote:
> More information... in my test sample of one, just now, I changed "textextractor.use-temp-file = true" to "textextractor.use-temp-file = false" in dspace.cfg and then the pdf text was parsed successfully. I'll dig into the temp file code to see if I can nail down the root cause. I'm guessing something about the parser plug-in interface has changed.

Interesting. I may try that.

More data: I fetched tika-app 3.1.0 and opened one of the offending
files. It warns twice about "Empty COSName at offset blah" but has no
trouble reading the file or displaying content.

> On Thursday, May 22, 2025 at 10:32:41 AM UTC-5 mw...@iu.edu wrote:
signature.asc

Brian Keese

unread,
May 23, 2025, 12:14:28 PMMay 23
to DSpace Technical Support
I was not able to figure out the problem with the way parsing is done when use-temp-file is set to true. I did confirm that it doesn't matter if max-chars is in effect (settings of 100000 and -1 yield the same results). 

Maybe the best bet is to find a way to upgrade the tika version in the build. I don't know how to get past the dependency conflicts. I found this relevant (but older) ticket, but I don't know how to apply the fix. https://issues.apache.org/jira/browse/TIKA-2598

Brian Keese

unread,
May 25, 2025, 3:59:07 PMMay 25
to DSpace Technical Support

Brian Keese

unread,
May 25, 2025, 4:31:11 PMMay 25
to DSpace Technical Support

mw...@iu.edu

unread,
May 27, 2025, 1:23:07 PMMay 27
to dspac...@googlegroups.com
On Sun, May 25, 2025 at 08:31:11PM +0000, Keese, Brian W wrote:
> And this is the commit for 7x: https://github.com/DSpace/DSpace/commit/930565effff8b46cfda9e8b3906ecbbee32227f7

Well spotted! That fixed it here.

> On Sunday, May 25, 2025 at 2:59:07 PM UTC-5 Brian Keese wrote:
> 317-274-0749<tel:(317)%20274-0749>
> library.indianapolis.iu.edu<http://library.indianapolis.iu.edu>
>
> --
> All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
> ---
> You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com<mailto:dspace-tech...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/dspace-tech/9f80f5f6-5d44-4751-909a-a7e9568ad3f4n%40googlegroups.com<https://groups.google.com/d/msgid/dspace-tech/9f80f5f6-5d44-4751-909a-a7e9568ad3f4n%40googlegroups.com?utm_medium=email&utm_source=footer>.
signature.asc
Reply all
Reply to author
Forward
0 new messages