!!! OutOfMemoryError !!! in filter-media for XLSX files DSpace 8.0

9 views
Skip to first unread message

Manuela Ferreira

unread,
Dec 22, 2025, 2:08:55 PM (16 hours ago) Dec 22
to DSpace Technical Support

Hello,

An OutOfMemoryError occurs during filter-media execution when processing ~150 MB XLSX files.  It seems that the error occurs while the system is attempting to extract the text from the files.

I have already enabled textextractor.use-temp-file = true and increased the Java (JVM) memory as shown below, but the issue persists.

Environment="JAVA_OPTS=-Xmx12096M -Xms6024M -XX:MaxMetaspaceSize=2024M -Dfile.encoding=UTF-8”


dspace.cfg filter-media configurations below
#### Media Filter / Format Filter plugins (through PluginService) ####
# Media/Format Filters help to full-text index content or
# perform automated format conversions

#Names of the enabled MediaFilter or FormatFilter plugins
filter.plugins = Text Extractor
filter.plugins = JPEG Thumbnail
filter.plugins = PDFBox JPEG Thumbnail


# [To enable Branded Preview]: uncomment and insert the following into the plugin list
#                Branded Preview JPEG, \

# [To enable ImageMagick Thumbnail]:
#    remove "JPEG Thumbnail" from the plugin list
#    uncomment and insert the following line into the plugin list
#                ImageMagick Image Thumbnail, ImageMagick PDF Thumbnail, \
# [To enable ImageMagick Video Thumbnails (requires both ImageMagick and ffmpeg installed)]:
#    uncomment and insert the following line into the plugin list
#                ImageMagick Video Thumbnail, \
#    NOTE: pay attention to the ImageMagick policies and reource limits in its policy.xml
#          configuration file. The limits may have to be increased if a "cache resources
#          exhausted" error is thrown.

#Assign 'human-understandable' names to each filter
plugin.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.TikaTextExtractionFilter = Text Extractor
plugin.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail
plugin.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview JPEG
plugin.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.PDFBoxThumbnail = PDFBox JPEG Thumbnail
plugin.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter = ImageMagick Image Thumbnail
plugin.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter = ImageMagick PDF Thumbnail
plugin.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.ImageMagickVideoThumbnailFilter = ImageMagick Video Thumbnail

#Configure each filter's input format(s)
# NOTE: The TikaTextExtractionFilter can support any file formats that are supported by Apache Tika. So, you can easily
# add additional formats to your DSpace Bitstream Format registry and list them here. The current list of Tika supported
# formats is available at: https://tika.apache.org/2.3.0/formats.html
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = Adobe PDF
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = CSV
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = HTML
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = Microsoft Excel
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = Microsoft Excel XML
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = Microsoft Powerpoint
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = Microsoft Powerpoint XML
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = Microsoft Word
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = Microsoft Word XML
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = OpenDocument Presentation
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = OpenDocument Spreadsheet
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = OpenDocument Text
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = RTF
filter.org.dspace.app.mediafilter.TikaTextExtractionFilter.inputFormats = Text
filter.org.dspace.app.mediafilter.JPEGFilter.inputFormats = BMP, GIF, JPEG, PNG
filter.org.dspace.app.mediafilter.BrandedPreviewJPEGFilter.inputFormats = BMP, GIF, JPEG, PNG
filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, PNG, JPG, TIFF, JPEG, JPEG 2000
filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
filter.org.dspace.app.mediafilter.ImageMagickVideoThumbnailFilter.inputFormats = Video MP4
filter.org.dspace.app.mediafilter.PDFBoxThumbnail.inputFormats = Adobe PDF

#Publicly accessible thumbnails of restricted content.
#List the MediaFilter name's that would get publicly accessible permissions
#Any media filters not listed will instead inherit the permissions of the parent bitstream
#filter.org.dspace.app.mediafilter.publicPermission = JPEGFilter


I need help with this.
Thanks in advance

Manuela Klanovicz Ferreira

Michael Plate

unread,
5:36 AM (12 minutes ago) 5:36 AM
to dspac...@googlegroups.com
Hi Manuela,

Am 22.12.25 um 20:08 schrieb Manuela Ferreira:
> Hello,
>
> An *OutOfMemoryError* occurs during *filter-media* execution when
> processing *~150 MB XLSX files*.  It seems that the error occurs while
> the system is attempting to extract the text from the files.
>
> I have already enabled textextractor.use-temp-file = true and increased
> the Java (JVM) memory as shown below, but the issue persists.
>
> Environment="JAVA_OPTS=-Xmx12096M -Xms6024M -XX:MaxMetaspaceSize=2024M -
> Dfile.encoding=UTF-8”
[…]
try setting the JAVA_OPTS in the

bin/dspace

script directly. Just to verify if the problem may belong to the
environment settings or not.

Michael

Reply all
Reply to author
Forward
0 new messages