[Dspace-tech] error running filter-media script

21 views
Skip to first unread message

Jewel

unread,
Aug 25, 2015, 12:30:34 PM8/25/15
to dspace-tech@lists.sourceforge.net Tech
I am running Dspace version 1.5.1 on a Windows 2003 box. We have loaded
very little into our collection. I can't make out what the error means.
Below is the error I receive after running: dsrun
org.dspace.app.mediafilter.MediaFilterManager
/
E:\dspace\bin>dsrun org.dspace.app.mediafilter.MediaFilterManager
Using DSpace installation in: E:\dspace
ERROR filtering, skipping bitstream:

Item Handle: 10425/53
Bundle Name: ORIGINAL
File Size: 11301578
Checksum: 4a6333832dc9b7ee8704b2c0ec735bbe (MD5)
Asset Store: 0
java.io.IOException: Invalid header signature; read 3759996809423114277,
expected -2226271756974174256
java.io.IOException: Invalid header signature; read 3759996809423114277,
expected -2226271756974174256
at
org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:88)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
at
org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.java:48)
at
org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:97)
at
org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:668)
at
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:570)
at
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:520)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:488)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(MediaFilterManager.java:427)
at
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359)
SKIPPED: bitstream 53 because '2007-49southtexaslawreview451.pdf.txt'
already exists
SKIPPED: bitstream 55 because
'2007-internationaltravellawjournal13.pdf.txt' already exists
Updating search index:
/

--
Jewel


Mark H. Wood

unread,
Aug 25, 2015, 12:30:51 PM8/25/15
to dspac...@lists.sourceforge.net
On Thu, Mar 05, 2009 at 03:58:52PM -0600, Jewel wrote:
> I am running Dspace version 1.5.1 on a Windows 2003 box. We have loaded
> very little into our collection. I can't make out what the error means.
> Below is the error I receive after running: dsrun
> org.dspace.app.mediafilter.MediaFilterManager
> /
> E:\dspace\bin>dsrun org.dspace.app.mediafilter.MediaFilterManager
> Using DSpace installation in: E:\dspace
> ERROR filtering, skipping bitstream:
>
> Item Handle: 10425/53
> Bundle Name: ORIGINAL
> File Size: 11301578
> Checksum: 4a6333832dc9b7ee8704b2c0ec735bbe (MD5)
> Asset Store: 0
> java.io.IOException: Invalid header signature; read 3759996809423114277,
> expected -2226271756974174256
> java.io.IOException: Invalid header signature; read 3759996809423114277,
> expected -2226271756974174256
> at
> org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:88)

It sure would be nice if the message indicated which bitstream had the
problem, no? It appears that one of the bitstreams attached to item
53 is either a corrupt MS Office document, or is not an MS Office
document at all but DSpace believes it is one. (POI is the library
that DSpace uses to extract text from MS Word documents.)

If there is only one Office document attached to item 53, that is the
culprit. If there are more than one, examine each until you find the
problematic one. If there are no bitstreams that should be treated as
Office documents, check the associated format of each bitstream to see
if it matches the content type you would expect.

--
Mark H. Wood, Lead System Programmer mw...@IUPUI.Edu
Friends don't let friends publish revisable-form documents.

Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]

unread,
Aug 25, 2015, 12:30:54 PM8/25/15
to Mark H. Wood, dspac...@lists.sourceforge.net

If there are more than one document for this handle, then you can identify which one it is by looking at column "size_bytes"  (file size below) in the bitstream table.  Below is a sql query I use to list information from the bitstream, bundle, item, and handle tables, when I know one piece of information, say - the handle - and don't know the rest (you can modify the query as needed).  It's useful in the situation below, in listing the data in the bitstream table when you only know the handle.  Hope this helps.

Sue

 

select bi.* from

    bitstream bi

  , bundle2bitstream b2b

  , bundle bu

  , item2bundle i2b

  , item it

  , handle ha

where ha.resource_id = it.item_id

  and it.item_id = i2b.item_id

  and i2b.bundle_id = bu.bundle_id

  and bu.bundle_id = b2b.bundle_id

  and b2b.bitstream_id = bi.bitstream_id

  and ha.handle = '2121/169402'

Reply all
Reply to author
Forward
0 new messages