List corrupted PDFs (empty files) on DSPACE server

114 views
Skip to first unread message

Michelangelo Viana

unread,
Mar 30, 2016, 8:59:28 AM3/30/16
to DSpace Technical Support
Hi fellows,

We are on DSpace 5.1 (http://repositorio.pucrs.br/dspace) and need to know how to list corrupted PDFs (empty files) on DSpace server.
As I can see, Curator Tasks does not have this kind of report, and checksum report, even lists this kind of error, is too confused to me to understand it.

Whe automatically load some items (metadata+PDF) into DSpace using the *import* command: metadata is taken from ALEPH500 system and PDF from another system (TEDE2). However, when PDF cannot be found during the process (ie, TEDE2 server is offline), the record is created (imported) in DSpace with an "empty" PDF. On the UI, the PDF file is 569 B (569 bytes) only and we only know about this error by the users when they report it to us... 

- Can someone give me a clue to list this kind of "empty" PDF files?

Thanks in advance,

Michelângelo Viana
PUCRS/Main Library/Brazil

helix84

unread,
Mar 30, 2016, 9:55:10 AM3/30/16
to Michelangelo Viana, DSpace Technical Support
Here's the query for DSpace 5:

SELECT metadatavalue.text_value, bitstream.*
FROM bitstream, metadatavalue
WHERE size_bytes = 569
AND metadatavalue.resource_type_id = 0
AND bitstream.bitstream_id = metadatavalue.resource_id
AND metadata_field_id = (
  SELECT metadata_field_id
  FROM metadatafieldregistry mfr, metadataschemaregistry msr
  WHERE mfr.metadata_schema_id = msr.metadata_schema_id
  AND short_id = 'dc'
  AND element = 'title'
  AND qualifier IS NULL
);



Regards,
~~helix84

Compulsory reading: DSpace Mailing List Etiquette
https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Monika Mevenkamp

unread,
Mar 30, 2016, 11:50:07 AM3/30/16
to DSpace Tech
Michelângelo

If you are on a unix system you can use the find command to list files in your assetstore that are small 

cd /dspace/assetstore 
find . -type f -and -size -900c -exec basename {} \;

the names that are printed correspond to the internal ids of bitstreams 

given those internal ids you can go to the database and try to get to the information with SQL queries. personally I dislike that approach. For that reason I developed a small ruby gem that interacts with the DSPACE API classes so that I can script these maintenance activities 

using the gem I put a script together that reads internal ids from a file or stdin and then prints information about the corresponding bitstreams; a line for each internal id: 

96563514287524427390952035236210734474  99343   ITEM.80161  99999/fk4vq36h8m    COLLECTION.87   88435/dsp01x633f104k    COMMUNITY.67    88435/dsp01td96k251d

Have a look at 
The CLI scripts that use the gem: https://github.com/akinom/dspace-cli
The script that generates the bitstream report (tsp) dspace-cli/*/statistics/bitstreams_from_internalids.rb

you can run the script (once you have jruby installed) as follows 

cd dspace-cli
DSPACE_HOME  = dspace install dir  if different from /dspace 
bundle exec statistics/bitstreams_from_internalids.rb  file_with_bitstream_internal_ids


I hope this helps 

Monika

----
Monika Mevenkamp
Digital Repository Infrastructure Developer
Princeton University
Skype: mo-meven


--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.


Michelangelo Viana

unread,
Mar 30, 2016, 1:23:08 PM3/30/16
to DSpace Technical Support, mvia...@gmail.com, hel...@centrum.sk
Great helix! 

That is exactly what I was looking for! 
Just changed the "size_bytes" paramater from "= 569" to "< 2000", so your query list another problematic ones, then I found 47 "empty" bitstreams (their size varies from 563 to 569 bytes).
To make it easier to access the record on the UI and replace the PDF file, how can I modify your query to also list the handle code that contains the bitstream?

All the Best!
Michelangelo

Michelangelo Viana

unread,
Mar 30, 2016, 1:26:43 PM3/30/16
to DSpace Technical Support
Hi Monika,

Thanks for your tip! 
However, running your command on our assetstore folder, it lists any "small" file, not only the PDF ones. Here it found 7281 files...
As I do not use ruby here, helix approach, using a SQL query, was satisfactory.
Thanks!

Michelangelo

Michelangelo Viana

unread,
Mar 30, 2016, 2:58:37 PM3/30/16
to DSpace Technical Support, mvia...@gmail.com, hel...@centrum.sk

Found it! 
- That is it:

SELECT 
handle.handle, metadatavalue.text_value, bitstream.*
FROM 
bitstream, metadatavalue, item2bundle, bundle2bitstream, handle
WHERE 
handle.resource_id = item2bundle.item_id
AND item2bundle.bundle_id = bundle2bitstream.bundle_id
AND bundle2bitstream.bitstream_id = bitstream.bitstream_id
AND size_bytes < 2000
AND metadatavalue.text_value LIKE '%.pdf'
AND metadatavalue.resource_type_id = 0
AND bitstream.bitstream_id = metadatavalue.resource_id
AND metadata_field_id = (
  SELECT metadata_field_id
  FROM metadatafieldregistry mfr, metadataschemaregistry msr
  WHERE mfr.metadata_schema_id = msr.metadata_schema_id
  AND short_id = 'dc'
  AND element = 'title'
  AND qualifier IS NULL
);

Sample output:

   handle   |           text_value           | bitstream_id | bitstream_format_id | size_bytes |             checksum             | checksum_algorithm |               internal_id               | deleted | store_number | sequence_id
------------+--------------------------------+--------------+---------------------+------------+----------------------------------+--------------------+-----------------------------------------+---------+--------------+-------------
 10923/7372 | 000470274-Texto+Parcial-0.pdf  |        24188 |                   4 |        567 | de7dc67933af74b45852a2215d3aafbf | MD5                | 14750871164671937117165429606699072375  | f       |            0 |           1
 10923/7429 | 000471221-Texto+Completo-0.pdf |        24362 |                   4 |        569 | 24587ca381cdfd0fc569e716afaf764d | MD5                | 152719227089471130678544999837454837065 | f       |            0 |           1
Reply all
Reply to author
Forward
0 new messages