Re: [islandora] Extracting file size from FOXML

Serhiy Polyakov

unread,

Dec 5, 2012, 3:28:54 PM12/5/12

to isla...@googlegroups.com

Matthew,

If FedoraGSearch and Solr are installed you may try to extract from
FOXML and index this:
<foxml:datastreamVersion … MIMETYPE="image/tiff" SIZE="117917">

Something similar to the following will be in foxmlToSolr.xslt

<xsl:if test="@ID = 'OBJ'">
<field>
<xsl:attribute name="name">
<xsl:value-of select="concat('fgs_OBJ.', 'SIZE')"/>
</xsl:attribute>
<xsl:value-of select="foxml:datastreamVersion[last()]/@SIZE"/>
</field>
</xsl:if>

Then you will need to re-index repsitory. Then you could query Solr
index to extract all PID and fgs_OBJ.SIZE values. Then manipulate your
target PID list against extracted, crate union, sort, etc. (possibly
in use MS Access for manipulations).

Serhiy

On Wed, Dec 5, 2012 at 1:30 PM, Matthew Short <msh...@niu.edu> wrote:
> Hi folks,
>
> We've had several thousand duplicate objects accidentally ingested, and I've
> been trying to figure out how to best identify them for purging. Putting
> together a list of all the PIDs has been the easy part, since the duplicates
> share the same dc:identifier, but I'm not sure how to differentiate between
> the object we want to keep and the duplicates. Ideally, we'd like to sort by
> MIME type and file size, keeping whichever object has the largest TIFF
> datastream, because most of the duplicates should have different file sizes.
> I know that this data is in the FOXML, but it doesn't seem to be stored in
> either the relational database or the triple store. Is there any way to
> extract the file size of all TIFFs, given a list of PIDs? I doubt that it
> will be of any help, but I've attached the result of my SPARQL query for all
> the dupes in one particular collection, including PID, identifier, and
> title.
>
> Thanks in advance for any suggestions! My head is tired from banging it
> against the wall.
>
>
> Matthew Short
> Metadata Librarian
> Northern Illinois University Libraries
> DeKalb, IL 60155
> 815.753.9868
> msh...@niu.edu
>
> --
>
>

Don

unread,

Dec 5, 2012, 10:17:49 PM12/5/12

to isla...@googlegroups.com

Hi Matthew:

I've been in similar situations. I'm not sure if I'm answering your question, but I'll share what I would do. I often start by using http://localhost:8080/fedora/search to help identify my set. For instance were the objects loaded on a particular date/time? What's unique about them ? A content model maybe ? Or a collection ? If you did a batch ingest the PIDs should be sequential ? So you may be able do a dump and determine this is the first PID in the batch and this is the last.

Can you combine a date search with a content model or some other limiter?

I use ITQL for this ( http://localhost:8080/fedora/risearch ) ... so a query like this will find all objects in your repo that have a content model of islandora:pageCModel with a create date after 2012-11-01 and before 2012-12-04 (if you know the time you could include that too). You get the idea.

select $object from <#ri>                                              
where $object <info:fedora/fedora-system:def/model#hasModel> <info:fedora/islandora:pageCModel>
and $object <info:fedora/fedora-system:def/model#createdDate> $created
and $created <mulgara:after>'2012-11-01T00:00:00.000Z'^^<http://www.w3.org/2001/XMLSchema#dateTime> in <#xsd>
and $created <mulgara:before>'2012-12-04T00:00:00.000Z'^^<http://www.w3.org/2001/XMLSchema#dateTime> in <#xsd>

Once you have your set ... there's a number of different ways to delete them. You could use a drush or bash script ... you may want to test with a few sample PIDs before doing the batch. Let me know if something isn't clear.

Thanks,
Don

Matthew Short

unread,

Dec 6, 2012, 1:05:29 PM12/6/12

to isla...@googlegroups.com

Hi Serhiy,

This sounds promising! I've never worked with Solr before, so it wasn't something I even considered. I added this to the stylesheet that transforms our FOXML>Solr:

<xsl:template match="foxml:datastream[@ID='TIFF']">

<xsl:param name="prefix">fgs_</xsl:param>

<xsl:param name="suffix">_s</xsl:param>

<field>

<xsl:attribute name="name">

<xsl:value-of select="concat($prefix,'tiff', $suffix)"/>

</xsl:attribute>

<xsl:value-of select="foxml:datastreamVersion/@SIZE"/>

</field>

</xsl:template>

I ran it on a dozen of our FOXML, and it seems to be working as expected (hopefully what I expected is what I actually need). Are there any guides to re-indexing Solr? I'm not sure what step to take next. I imagine it won't be a very speedy process.

Thanks,

Matt

Serhiy Polyakov

unread,

Dec 6, 2012, 2:18:39 PM12/6/12

to isla...@googlegroups.com

Matt,

To update index:
http://myhost:8080/fedoragsearch/rest?operation=updateIndex
Click “updateindex fromFoxmlFiles” - will update all the objects in
the repository, but will not delete anything else from the index.

To rebuild index completely:
Stop Tomcat
In the file system, delete directory with old index (solr/data/*)
Start Tomcat
Click “updateindex fromFoxmlFiles”

Now, I think if the repository is production reindexing is not
desirable and may take time. You may try an alternative. Run
extraction outside of the repository with Xalan or other xslt
processor. For that:

Batch export all objects' FOXML into files.
Batch run FOXML via custom xslt script (similar you tried) that will
extract only PID and TIFF_SIZE
Work with the result lists

Here is example for processing one file but you need to figure out how
run batch:

java -Xms512m -Xmx1024m -cp \
[DISTR_XALAN_PATH]/xalan-j_2_7_1/*: \
org.apache.xalan.xslt.Process \
-in [IN_PATH]/id _123.xml \
-xsl [XSLT_SCRIPT_PATH]/my_script.xslt \
-out [OUT_PATH]/id_123_out.xml

Serhiy Polyakov
University of North Texas

> --
>
>

jy

unread,

Dec 6, 2012, 3:58:15 PM12/6/12

to isla...@googlegroups.com

Matt/Serhiy,

I have a tuque script that I use for re-indexing by collection pid. Assuming you have gsearch setup you can just give it a pid and it will do the work for you. I don't have a github working yet so if your interested let me know and I'll post it.

John

--

Matthew Short

unread,

Dec 6, 2012, 5:22:43 PM12/6/12

to isla...@googlegroups.com

John, I don't know what a tuque script is, but it sounds wonderful. I'd be very interested.

Thanks,

Matt

jy

unread,

Dec 7, 2012, 12:23:04 PM12/7/12

to isla...@googlegroups.com

I have set up a github. The script is in the scripts directory

https://github.com/jyobb/islandora

John

--

Reply all

Reply to author

Forward