Querying items by metadata item via SOLR and REST

651 views
Skip to first unread message

Ilja Sidoroff

unread,
Sep 1, 2016, 6:43:10 AM9/1/16
to dspac...@googlegroups.com
Hello,

I am using DSpace 5.5.

Am I correct, that SOLR queries return only items that are in
*collections* and not in the *workflow*? At least my search attemps
indicate that?

In the REST API, however, it seems that GET /items returns only
results that are in the collections. However, with POST
/items/find-by-metadata-field I can get all items in the DSpace, both
those in the collections and those in the workflow?

What I need, is a list of *all items* (both in the workflow and the
collections) that have certain metadata field set and *the value of
that field*. I don't see other way of doing that, except by direct SQL
query to the database. I have one for 5.x, but I'm not happy with it
since, I need to update it for 6.x etc. Is there any other way of
doing this?

Also, it seems that

dspace import -d -m mapfile ...

does not delete items currently in the workflow? Is this intentional or a bug?

regards,

Ilja Sidoroff
University of Eastern Finland

Monika Mevenkamp

unread,
Sep 1, 2016, 11:30:36 AM9/1/16
to Ilja Sidoroff, DSpace Tech
Hi Ilja 

I have a script that given a metadata field, e.g. pu.workflow.state, produces a tab separated list so: 

field   id      handle  value
pu.workflow.state       969     99999/fk4w099v32        approved
pu.workflow.state       903     null    emailed
pu.workflow.state       753     null    emailed
pu.workflow.state       752     null    emailed
pu.workflow.state       902     null    orphaned


The script is written in jruby and based on my dspace-jruby gem, see Script here
The gem as well as the script are available from github:   jrdspace gem.  and cli-dspace , which has a bunch of other scripts.

The script is quite small, its ‘action’ is in the doit method 
def doit(metadata_field)
puts ['field', 'id', 'handle', 'value'].join("\t")
dsos = DSpace.findByMetadataValue(metadata_field, nil, DConstants::ITEM)
dsos.each do |dso|
vals = dso.getMetadataByMetadataString(metadata_field).collect { |v| v.value }
puts [metadata_field, dso.getID, dso.getHandle.nil? ? "null" : dso.getHandle, vals ].join("\t")
end
end
if you want to try this out , there are instructions on GitHUb. If you want to work in Java, look at the implementation of the DSpace.findByMetadataValue  method. It has the SQL statement. see HERE 

Monika

Monika Mevenkamp
Digital Repository Infrastructure Developer
Princeton University
Skype: mo-meven



-- 
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Ilja Sidoroff

unread,
Sep 1, 2016, 11:48:41 AM9/1/16
to Monika Mevenkamp, DSpace Tech
Thanks! That script would indeed do what I'd need, but I'm bit concerned about the scalability, since it will have to do one request per item - and if I have thousands of items, that might get a bit heavy? Or would it? I really don't know don't know how long for instance 10.000 item/id/metadata requests would take.

Ilja

________________________________________
From: Monika Mevenkamp <mom...@gmail.com>
Sent: Thursday, September 1, 2016 6:30:33 PM
To: Ilja Sidoroff
Cc: DSpace Tech
Subject: Re: [dspace-tech] Querying items by metadata item via SOLR and REST

Hi Ilja

I have a script that given a metadata field, e.g. pu.workflow.state, produces a tab separated list so:

field id handle value
pu.workflow.state 969 99999/fk4w099v32 approved
pu.workflow.state 903 null emailed
pu.workflow.state 753 null emailed
pu.workflow.state 752 null emailed
pu.workflow.state 902 null orphaned


The script is written in jruby and based on my dspace-jruby gem, see Script here<https://github.com/akinom/dspace-cli/blob/master/metadata/list_values.rb>.
The gem as well as the script are available from github: jrdspace gem<https://github.com/akinom/dspace-jruby>. and cli-dspace<https://github.com/akinom/dspace-cli> , which has a bunch of other scripts.

The script is quite small, its ‘action’ is in the doit method

def doit(metadata_field)
puts ['field', 'id', 'handle', 'value'].join("\t")
dsos = DSpace.findByMetadataValue(metadata_field, nil, DConstants::ITEM)
dsos.each do |dso|
vals = dso.getMetadataByMetadataString(metadata_field).collect { |v| v.value }
puts [metadata_field, dso.getID, dso.getHandle.nil? ? "null" : dso.getHandle, vals ].join("\t")
end
end

if you want to try this out , there are instructions on GitHUb. If you want to work in Java, look at the implementation of the DSpace.findByMetadataValue method. It has the SQL statement. see HERE<https://github.com/akinom/dspace-jruby/blob/master/lib/dspace/dspace.rb#L150-L171>

Monika


Monika Mevenkamp
Digital Repository Infrastructure Developer
Princeton University
Phone: 609-258-4161
Skype: mo-meven



On Sep 1, 2016, at 6:43 AM, Ilja Sidoroff <ilja.s...@uef.fi<mailto:ilja.s...@uef.fi>> wrote:

Hello,

I am using DSpace 5.5.

Am I correct, that SOLR queries return only items that are in
*collections* and not in the *workflow*? At least my search attemps
indicate that?

In the REST API, however, it seems that GET /items returns only
results that are in the collections. However, with POST
/items/find-by-metadata-field I can get all items in the DSpace, both
those in the collections and those in the workflow?

What I need, is a list of *all items* (both in the workflow and the
collections) that have certain metadata field set and *the value of
that field*. I don't see other way of doing that, except by direct SQL
query to the database. I have one for 5.x, but I'm not happy with it
since, I need to update it for 6.x etc. Is there any other way of
doing this?

Also, it seems that

dspace import -d -m mapfile ...

does not delete items currently in the workflow? Is this intentional or a bug?

regards,

Ilja Sidoroff
University of Eastern Finland

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com<mailto:dspace-tech...@googlegroups.com>.
To post to this group, send email to dspac...@googlegroups.com<mailto:dspac...@googlegroups.com>.

Monika Mevenkamp

unread,
Sep 1, 2016, 12:05:15 PM9/1/16
to Ilja Sidoroff, DSpace Tech
does speed matter ? Is this something you’ll have to do a lot - or is it one of those one-of-scripts ?

If you run this on the command line / cron it may not be so important - especially with a cron job you may not care that much - as log as you can start it at midnight and it gets done by 7am

Calling the JRuby script from the UI, aka calling from Java is possible - but I have not actually done that yet

I don’t believe that calling Java via JRuby adds much to the performance

A bigger issue, I see, is that DSpace.findByMetadataValue returns an array of matching DSpaceObjects - if speed matters this needs to be changed to return an iterator, which shouldn’t be too hard

Why not just try and see - since the script only accesses data and does not change anything - there is no danger to disturb your instance. Plus you can run this anywhere - as long as you have access to the database.

Monika


Monika Mevenkamp
Digital Repository Infrastructure Developer
Princeton University
Phone: 609-258-4161
Skype: mo-meven



Monika Mevenkamp

unread,
Sep 1, 2016, 4:39:00 PM9/1/16
to Ilja Sidoroff, DSpace Tech
I just ran a test  and timed execution time 

script  4681  items    -> 26.334u  1.829s 0:35.43 79.4%   0+0k  0+ 36io   0pf+0w
script 64065  items    -> 77.505u 16.817s 6:07.68 25.6%   0+0k  1+365io   0pf+0w
jruby+gem+start dspace -> 12.047u  0.525s 0:06.75 186.0%  0+0k 52+ 38io 393pf+0w
dspace database test   ->  6.616u  0.348s 0:03.44 202.0%  0+0k  2+ 15io   0pf+0w

comparing     the time of running a regular database test versus running a comparable JRuby script that loads the dspace gem and connects  to the Dspace instance, which involves more or less the same actions as testing the database, shows that this costs an extra 6sec user time and .2 sec system time. 

the second script example processes about 13 times as many items than the first - but the real elapsed time   6min versus 35sec more like 10 times as long; just starting up the ruby interpreter, loading the gem and starting the DSPace kernel takes takes almost 7sec which explains most of that ‘imbalance’

Monika


Monika Mevenkamp
Digital Repository Infrastructure Developer
Princeton University
Phone: 609-258-4161
Skype: mo-meven



To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech+uns...@googlegroups.com<mailto:dspace-tech...@googlegroups.com>.

Ilja Sidoroff

unread,
Sep 2, 2016, 4:16:15 AM9/2/16
to Monika Mevenkamp, DSpace Tech
Yeah, the speed is not that crucial, if it stays somewhere in the order of minutes or even some hours. What I'm doing in is transferring items from CRIS, which doesn't know which items DSpace already has, and I'll have to prune those records already in the DSpace. This happens once a day (night) by cron, so I can live with that speed. It's just probably the little computer scientist in me that had hoped for the most efficient solution.

Thanks for the numbers and testing!

Ilja
________________________________________
From: Monika Mevenkamp <mom...@gmail.com>
Sent: Thursday, September 1, 2016 7:05:12 PM

Terry Brady

unread,
Sep 2, 2016, 1:55:40 PM9/2/16
to Ilja Sidoroff, DSpace Technical Support
Ilja, 

In DSpace 6, the REST API will provide additional query capabilities.


While this may not solve your immediate issues, it might provide a good future solution.

Terry

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech+unsubscribe@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.



--
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
425-298-5498 (Seattle, WA)

Ilja Sidoroff

unread,
Sep 3, 2016, 3:21:51 AM9/3/16
to Monika Mevenkamp, dspac...@googlegroups.com
On Fri, Sep 02, 2016 at 09:52:54AM -0400, Monika Mevenkamp wrote:
> Ilja
>
> Yep - that little CS guy sometimes gets in the way
>
> I myselfI do more and more in ruby - much faster turn around, since it is an interpreted language and without the strict typing and all that Java verbosity many fewer lines of codes - i actually wrote the little lister script you need in between other things yesterday. I even use jruby when I need to develop Java code and want to figure out what all those mysterious semi documented DSPace Java methods really are doing
>
> I am eager (as you probably can tell) to promote my dspace / ruby gem
> So if you are interested in trying this out - I am very interested in helping / supporting this

Monika,

I'm not sure if I will use your scripts directly, since I've done other parts of my production pipeline already in golang; using (j)ruby in just one might confuse others. However, I'm starting to see the advantange of using a jvm-based scripting language; I've already have an acquintance of mine using jython with DSpace. I myself prefer ruby over python (or maybe I'm just bored with python), so I will definitely keep your gem and script in mind.

Ilja

Ilja Sidoroff

unread,
Sep 3, 2016, 3:38:07 AM9/3/16
to Terry Brady, DSpace Technical Support
Terry,

I looked very briefly into this on Friday, but I didn't quite get how to create and execute queries without using the interactive webpages. The endpoint GET /rest/filtered-items seemed promising, but at the limited time I looked at it, I didn't see how to use it, but I'll try to look bit more into that.

Ilja

________________________________________
From: dspac...@googlegroups.com <dspac...@googlegroups.com> on behalf of Terry Brady <Terry...@georgetown.edu>
Sent: Friday, September 2, 2016 8:55:13 PM
To: Ilja Sidoroff
Cc: DSpace Technical Support
Subject: Re: [dspace-tech] Querying items by metadata item via SOLR and REST

Ilja,

In DSpace 6, the REST API will provide additional query capabilities.

https://wiki.duraspace.org/display/DSDOC6x/REST+Based+Quality+Control+Reports

While this may not solve your immediate issues, it might provide a good future solution.

Terry

On Thu, Sep 1, 2016 at 3:43 AM, Ilja Sidoroff <ilja.s...@uef.fi<mailto:ilja.s...@uef.fi>> wrote:
Hello,

I am using DSpace 5.5.

Am I correct, that SOLR queries return only items that are in
*collections* and not in the *workflow*? At least my search attemps
indicate that?

In the REST API, however, it seems that GET /items returns only
results that are in the collections. However, with POST
/items/find-by-metadata-field I can get all items in the DSpace, both
those in the collections and those in the workflow?

What I need, is a list of *all items* (both in the workflow and the
collections) that have certain metadata field set and *the value of
that field*. I don't see other way of doing that, except by direct SQL
query to the database. I have one for 5.x, but I'm not happy with it
since, I need to update it for 6.x etc. Is there any other way of
doing this?

Also, it seems that

dspace import -d -m mapfile ...

does not delete items currently in the workflow? Is this intentional or a bug?

regards,

Ilja Sidoroff
University of Eastern Finland

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com<mailto:dspace-tech%2Bunsu...@googlegroups.com>.
To post to this group, send email to dspac...@googlegroups.com<mailto:dspac...@googlegroups.com>.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.



--
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
http://georgetown-university-libraries.github.io/<https://www.library.georgetown.edu/lit/code>
425-298-5498 (Seattle, WA)

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com<mailto:dspace-tech...@googlegroups.com>.
To post to this group, send email to dspac...@googlegroups.com<mailto:dspac...@googlegroups.com>.

Ari

unread,
Sep 5, 2016, 1:34:51 AM9/5/16
to DSpace Technical Support, Terry...@georgetown.edu, ilja.s...@uef.fi
Hi Ilja,

we aiming fullly REST-based solution (dumping both workflow and XMLUI admin UI) so we do lot of REST tests.

This should work for filtered-items (items who's author name contains "Matti"):


If you need to be logged in, then:


1. First, authenticate your self:
curl --data "email=your.email&password=your_pass" http://your_dspace.com:8080/rest/login -c cookies.txt


2. Test that authentication was successful:
curl  -H "Accept: application/json"  http://your_dspace.com:8080/rest/status -b cookies.txt


- this should return something like this:
{"okay":true,"authenticated":true,"email":"your.email","fullname":"ari ","sourceVersion":null,"apiVersion":null}



Hope this helps,
Ari



On Saturday, 3 September 2016 10:38:07 UTC+3, Ilja Sidoroff wrote:
Terry,

I looked very briefly into this on Friday, but I didn't quite get how to create and execute queries without using the interactive webpages. The endpoint GET /rest/filtered-items seemed promising, but at the limited time I looked at it, I didn't see how to use it, but I'll try to look bit more into that.

Ilja

________________________________________
From: dspac...@googlegroups.com <dspac...@googlegroups.com> on behalf of Terry Brady <Terry...@georgetown.edu>
Sent: Friday, September 2, 2016 8:55:13 PM
To: Ilja Sidoroff
Cc: DSpace Technical Support
Subject: Re: [dspace-tech] Querying items by metadata item via SOLR and REST

Ilja,

In DSpace 6, the REST API will provide additional query capabilities.

https://wiki.duraspace.org/display/DSDOC6x/REST+Based+Quality+Control+Reports

While this may not solve your immediate issues, it might provide a good future solution.

Terry

On Thu, Sep 1, 2016 at 3:43 AM, Ilja Sidoroff <ilja.s...@uef.fi<mailto:ilja.si...@uef.fi>> wrote:
Hello,

I am using DSpace 5.5.

Am I correct, that SOLR queries return only items that are in
*collections* and not in the *workflow*? At least my search attemps
indicate that?

In the REST API, however, it seems that GET /items returns only
results that are in the collections. However, with POST
/items/find-by-metadata-field I can get all items in the DSpace, both
those in the collections and those in the workflow?

What I need, is a list of *all items* (both in the workflow and the
collections) that have certain metadata field set and *the value of
that field*. I don't see other way of doing that, except by direct SQL
query to the database. I have one for 5.x, but I'm not happy with it
since, I need to update it for 6.x etc. Is there any other way of
doing this?

Also, it seems that

dspace import -d -m mapfile ...

does not delete items currently in the workflow? Is this intentional or a bug?

regards,

Ilja Sidoroff
University of Eastern Finland

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com<mailto:dspace-tech%2Bunsubscribe@googlegroups.com>.
To post to this group, send email to dspac...@googlegroups.com<mailto:dspac...@googlegroups.com>.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.



--
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
http://georgetown-university-libraries.github.io/<https://www.library.georgetown.edu/lit/code>
425-298-5498 (Seattle, WA)

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com<mailto:dspace-tech+unsubscribe@googlegroups.com>.

Ilja Sidoroff

unread,
Sep 5, 2016, 2:50:22 AM9/5/16
to Ari, DSpace Technical Support, Terry...@georgetown.edu
Ari and Terry,

thanks for help! At least with brief testing, it seems the new api does what I need. One question - if I leave the parameters 'limit' and 'offset' from the query, will I get all the items, or is there some upper limit for results returned in one query?

Ilja

________________________________________
From: Ari <arih...@jyu.fi>
Sent: Monday, September 5, 2016 8:34:50 AM
To: DSpace Technical Support
Cc: Terry...@georgetown.edu; Ilja Sidoroff
Subject: Re: [dspace-tech] Querying items by metadata item via SOLR and REST

Hi Ilja,

we aiming fullly REST-based solution (dumping both workflow and XMLUI admin UI) so we do lot of REST tests.

This should work for filtered-items (items who's author name contains "Matti"):

curl -H "Accept: application/json" "http:your.dspace.com:8080/rest/filtered-items?&query_field[]=dc.contributor.author&query_op[]=contains&query_val[]=Matti&collSel[]=&limit=100&offset=0&expand=parentCollection,metadata&filters=none" -g | python -m json.tool


If you need to be logged in, then:


1. First, authenticate your self:
curl --data "email=your.email&password=your_pass" http://your_dspace.com:8080/rest/login<http://your_dspace.com:8080/rest/login> -c cookies.txt


2. Test that authentication was successful:
curl -H "Accept: application/json" http://your_dspace.com:8080/rest/status<http://your_dspace.com:8080/rest/status> -b cookies.txt


- this should return something like this:
{"okay":true,"authenticated":true,"email":"your.email","fullname":"ari ","sourceVersion":null,"apiVersion":null}



Hope this helps,
Ari


On Saturday, 3 September 2016 10:38:07 UTC+3, Ilja Sidoroff wrote:
Terry,

I looked very briefly into this on Friday, but I didn't quite get how to create and execute queries without using the interactive webpages. The endpoint GET /rest/filtered-items seemed promising, but at the limited time I looked at it, I didn't see how to use it, but I'll try to look bit more into that.

Ilja

________________________________________
From: dspac...@googlegroups.com<javascript:> <dspac...@googlegroups.com<javascript:>> on behalf of Terry Brady <Terry...@georgetown.edu<javascript:>>
Sent: Friday, September 2, 2016 8:55:13 PM
To: Ilja Sidoroff
Cc: DSpace Technical Support
Subject: Re: [dspace-tech] Querying items by metadata item via SOLR and REST

Ilja,

In DSpace 6, the REST API will provide additional query capabilities.

https://wiki.duraspace.org/display/DSDOC6x/REST+Based+Quality+Control+Reports

While this may not solve your immediate issues, it might provide a good future solution.

Terry

On Thu, Sep 1, 2016 at 3:43 AM, Ilja Sidoroff <ilja.s...@uef.fi<javascript:><mailto:ilja.s...@uef.fi<javascript:>>> wrote:
Hello,

I am using DSpace 5.5.

Am I correct, that SOLR queries return only items that are in
*collections* and not in the *workflow*? At least my search attemps
indicate that?

In the REST API, however, it seems that GET /items returns only
results that are in the collections. However, with POST
/items/find-by-metadata-field I can get all items in the DSpace, both
those in the collections and those in the workflow?

What I need, is a list of *all items* (both in the workflow and the
collections) that have certain metadata field set and *the value of
that field*. I don't see other way of doing that, except by direct SQL
query to the database. I have one for 5.x, but I'm not happy with it
since, I need to update it for 6.x etc. Is there any other way of
doing this?

Also, it seems that

dspace import -d -m mapfile ...

does not delete items currently in the workflow? Is this intentional or a bug?

regards,

Ilja Sidoroff
University of Eastern Finland

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com<javascript:><mailto:dspace-tech%2Bunsu...@googlegroups.com<javascript:>>.
To post to this group, send email to dspac...@googlegroups.com<javascript:><mailto:dspac...@googlegroups.com<javascript:>>.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.



--
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
http://georgetown-university-libraries.github.io/<https://www.library.georgetown.edu/lit/code>
425-298-5498 (Seattle, WA)

--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com<javascript:><mailto:dspace-tech...@googlegroups.com<javascript:>>.
To post to this group, send email to dspac...@googlegroups.com<javascript:><mailto:dspac...@googlegroups.com<javascript:>>.

Ari

unread,
Sep 5, 2016, 3:27:43 AM9/5/16
to DSpace Technical Support, ilja.s...@uef.fi
I'm not able to test this right now since my instance has only 20 items currently. I *think* that default limit is 100. But in general you can not trust that API will return all items. You must just make queries by increasing offset (offset + limit) in each round until there is no records left.
Reply all
Reply to author
Forward
0 new messages