Location of the extracted text bitstream file

15 views
Skip to first unread message

Sean Carte

unread,
Nov 19, 2021, 7:40:16 AM11/19/21
to DSpace Technical Support
I'm trying to retrieve the extracted text bitstream associated with items. Is there a way to get a list of them from the database?

So far, I've only been able to generate a list of all bitstreams with:

SELECT i.item_id, last_modified, owning_collection, internal_id, t.text_value AS title
FROM item i
JOIN item2bundle i2b
ON i.item_id = i2b.item_id
JOIN bundle2bitstream b2b
ON b2b.bundle_id = i2b.bundle_id
JOIN bitstream b
ON b.bitstream_id = b2b.bitstream_id
JOIN metadatavalue d
ON d.resource_id = i.item_id
JOIN metadatavalue t
ON t.resource_id = i.item_id
WHERE in_archive = 't' AND withdrawn = 'f' AND discoverable = 't'
AND d.metadata_field_id = 11 AND d.text_value >= '2021-01' AND d.text_value < '2021-12'
AND t.metadata_field_id = 64
ORDER BY owning_collection

That gives me a list including the internal_id, which I can use to determine where the file is in the assetstore:
77274565375792968793874045792320511138 = /dspace/assetstore/77/27/45/77274565375792968793874045792320511138

But I've noticed some gaps, like id 4117, which has both a PDF and an extracted text bitstream, but in the assetstore, there's only the PDF in that directory:
$ ls /dspace/assetstore/77/27/45/
77274565375792968793874045792320511138

How can I determine the location of the associated text extract bitstream for that item?

Sean

DSpace version:  CRIS-5.10.0-SNAPSHOT
  SCM revision:  67e7d010e7eda86925980b2a43581b9d4f4929a3
    SCM branch:  dspace-5_x_x-cris
            OS:  Linux(amd64) version 4.4.0-210-generic
  Applications:
     Discovery:  enabled.
           JRE:  Private Build version 1.8.0_292
   Ant version:  Apache Ant(TM) version 1.9.6 compiled on July 20 2018
 Maven version:  3.3.9
Reply all
Reply to author
Forward
0 new messages