Matthew,
If FedoraGSearch and Solr are installed you may try to extract from
FOXML and index this:
<foxml:datastreamVersion … MIMETYPE="image/tiff" SIZE="117917">
Something similar to the following will be in foxmlToSolr.xslt
<xsl:if test="@ID = 'OBJ'">
<field>
<xsl:attribute name="name">
<xsl:value-of select="concat('fgs_OBJ.', 'SIZE')"/>
</xsl:attribute>
<xsl:value-of select="foxml:datastreamVersion[last()]/@SIZE"/>
</field>
</xsl:if>
Then you will need to re-index repsitory. Then you could query Solr
index to extract all PID and fgs_OBJ.SIZE values. Then manipulate your
target PID list against extracted, crate union, sort, etc. (possibly
in use MS Access for manipulations).
Serhiy
On Wed, Dec 5, 2012 at 1:30 PM, Matthew Short <
msh...@niu.edu> wrote:
> Hi folks,
>
> We've had several thousand duplicate objects accidentally ingested, and I've
> been trying to figure out how to best identify them for purging. Putting
> together a list of all the PIDs has been the easy part, since the duplicates
> share the same dc:identifier, but I'm not sure how to differentiate between
> the object we want to keep and the duplicates. Ideally, we'd like to sort by
> MIME type and file size, keeping whichever object has the largest TIFF
> datastream, because most of the duplicates should have different file sizes.
> I know that this data is in the FOXML, but it doesn't seem to be stored in
> either the relational database or the triple store. Is there any way to
> extract the file size of all TIFFs, given a list of PIDs? I doubt that it
> will be of any help, but I've attached the result of my SPARQL query for all
> the dupes in one particular collection, including PID, identifier, and
> title.
>
> Thanks in advance for any suggestions! My head is tired from banging it
> against the wall.
>
>
> Matthew Short
> Metadata Librarian
> Northern Illinois University Libraries
> DeKalb, IL 60155
>
815.753.9868
>
msh...@niu.edu
>
> --
>
>