Direct SOLR query returns "0" for Item page views

127 views
Skip to first unread message

Jeffrey Sheldon

unread,
Dec 22, 2015, 5:04:43 PM12/22/15
to dspac...@googlegroups.com
Folks,

We've been using a handful of in-house scripts to directly query SOLR in order to monitor Item views, bitstream downloads, and generate various statistics reports. This was working fine up through DSpace 3.3, however, following our upgrade to DSpace 4.3 this past June, it was discovered that SOLR is returning "0" for Item page views. At the same time, we're still seeing DSpace's built-in statistics on the Collection and Item levels as if all is well.

An example query would be like so:

http://localhost:8080/solr/statistics/select/?indent=on&wt=php&?version=2.2&facet=true&facet.field=id&facet.limit=20&fq=owningColl:359&fq=isBot:false&q=type:2&fq=(time:[2015-06-22T00:00:00.000Z%20TO%202015-12-01T00:00:00.000Z])

The results look like this:

array(
'responseHeader'=>array(
'status'=>0,
'QTime'=>3,
'params'=>array(
'facet'=>'true',
'indent'=>'on',
'q'=>'type:2',
'facet.limit'=>'20',
'facet.field'=>'id',
'wt'=>'php',
'fq'=>array('owningColl:359',
'isBot:false',
'(time:[2015-06-23T00:00:00.000Z TO 2015-12-01T00:00:00.000Z])'),
'?version'=>'2.2')),
'response'=>array('numFound'=>0,'start'=>0,'docs'=>array()
),
'facet_counts'=>array(
'facet_queries'=>array(),
'facet_fields'=>array(
'id'=>array(
'1'=>0,
'100'=>0,
'1000'=>0,
'10000'=>0,
'10001'=>0,
'10002'=>0,
'10003'=>0,
'10004'=>0,
'10005'=>0,
'10006'=>0,
'10007'=>0,
'10008'=>0,
'10009'=>0,
'1001'=>0,
'10010'=>0,
'10011'=>0,
'10012'=>0,
'10013'=>0,
'10014'=>0,
'10015'=>0)),
'facet_dates'=>array(),
'facet_ranges'=>array()))

It 'id' array for facet_fields is what we're specifically interested in.

As a side note, I should point out that in order to buy ourselves more time with adopting Discovery into our theme, we're running DSpace 4 with Discovery disabled. I have a test DSpace 5 instance with everything converted over, but am still seeing the same results from the above SOLR query.

Thoughts?

-Jeff

helix84

unread,
Dec 22, 2015, 9:14:15 PM12/22/15
to Jeffrey Sheldon, dspac...@googlegroups.com
Hi, just a few superficial observations:

1) The results match what I see if I specify a non-existing collection
ID. Are you sure that the collection ID in your owningColl matches the
collection you're expecting to see? The ID could have changed e.g. if
you migrated using AIP export/import.

other notes, not directly relevant
2) You may want to avoid using the version parameter, solr will not
support too old versions of its protocol. Also there's an extra "?",
but it will work.

3) You should avoid using the php writer. With these small response
sizes, you won't see any performance benefit (you're using indent=on
anyway) and using direct evaluation means a chance of running
arbitrary injected php code. Just use json instead.


Regards,
~~helix84

Compulsory reading: DSpace Mailing List Etiquette
https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Jeffrey Sheldon

unread,
Dec 23, 2015, 10:43:34 AM12/23/15
to dspac...@googlegroups.com
Thanks for the feedback and time.

1. Each upgrade has been on an existing database. We do export AIPs, but have never been in a situation where we've had to restore data or migrate to another installation as a form of upgrade or recovery.

According to our database, the collection ID remains the same:

SELECT * FROM collection WHERE name='Library Publishing Forum 2014';

359 | Library Publishing Forum 2014

2. Noted. The appears that the queries were added to our stats scripts simply by cut & paste. I'll work on correcting that.

3. I introduced that argument simply for the sake of quickly copying data over to email, but your point is well made.

As far as I can tell, from looking through our configuration, everything should still be adding data to SOLR in the same manner as before.


-Jeff

________________________________________
From: ivan....@gmail.com <ivan....@gmail.com> on behalf of helix84 <hel...@centrum.sk>
Sent: Tuesday, December 22, 2015 8:13 PM
To: Jeffrey Sheldon
Cc: dspac...@googlegroups.com
Subject: Re: [dspace-tech] Direct SOLR query returns "0" for Item page views

Jeffrey Sheldon

unread,
Dec 29, 2015, 5:54:11 PM12/29/15
to dspac...@googlegroups.com
A followup on this, I'm noticing that our logging has slimmed in quality and that many IPs are registering as localhost. It's possible that using Apache and modproxy might be affecting the results. I can find many instances where collection IDs are showing stats and exist validly in our system, but I can only query SOLR reliably for stats on item views, for instance, prior to our upgrade to DSpace 4.

I was able to import our SOLR data to DSpace 5 and use the new conversion utilities to export CSV files and verify the results there. It's odd to me that some stats render find in DSpace's Usage Statistics while other are ignored with direct SOLR queries. Seems like the basic information exists and should be readable.

Following the header, here's an example of a "working" hit followed by a "non-working" hit.

uid,rpp,userAgent,submitter,query,isBot,actor,type,owningComm,city,id,previousWorkflowStep,time,scopeType,page,longitude,scopeId,epersonid,workflowItemId,_version_,sortOrder,countryCode,dns,owningColl,ip,country,statistics_type,referrer,sortBy,owner,continent,owningItem,latitude,bundleName,workflowStep

fb1e8204-5069-4311-b683-c6c1c3234680,,"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",,,false,,2,"115,16,16",Hounslow,21352,,2015-03-08T04:56:49.864Z,,,-0.3500061,,,,1521873839944892422,,GB,host81-149-25-66.in-addr.btopenworld.com.,359,81.149.25.66,,view,,,,EU,,51.466705,,

e1196c75-5ec3-4253-80b2-943e11eedda4,,Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727),,,,,2,"115,124,16,124,16,16",,21356,,2015-07-15T05:27:37.622Z,,,,,,,1521875224684920832,,,,359,0:0:0:0:0:0:0:1,,view,http://www.ncbi.nlm.nih.gov/,,,,,,,


-Jeff

________________________________________
From: dspac...@googlegroups.com <dspac...@googlegroups.com> on behalf of Jeffrey Sheldon <jshe...@ksu.edu>
Sent: Wednesday, December 23, 2015 9:43 AM
To: dspac...@googlegroups.com
--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

helix84

unread,
Dec 30, 2015, 3:33:30 AM12/30/15
to Jeffrey Sheldon, dspac...@googlegroups.com
On Tue, Dec 29, 2015 at 11:54 PM, Jeffrey Sheldon <jshe...@ksu.edu> wrote:
> A followup on this, I'm noticing that our logging has slimmed in quality and that many IPs are registering as localhost. It's possible that using Apache and modproxy might be affecting the results.

If you use mod_proxy, all DSpace ever communicates with directly is
localhost, that's why you see the localhost address logged
(0:0:0:0:0:0:0:1). Make sure the proxy is sending the actual client's
IP in the X-Forwarded-For header (which should be the default
configuration [1]) and set "useProxies = true" in dspace.cfg to to
look into X-Forwarded-For for client's IP.

[1] https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#x-headers

Jeffrey Sheldon

unread,
Dec 30, 2015, 12:15:55 PM12/30/15
to dspac...@googlegroups.com
Enabling "useProxies" resolves the issue. For new visits, instead of seeing IPv6 localhost entries in SOLR, we now see IPv4 addresses from the originator and stats are rendering correctly.

The challenge at this point will be to gather granular stats from months of logged data where the IPs were sourced from localhost. SOLR appears to ignore such traffic when providing query responses.

I'm curious though: Usage Statistics throughout the site shows activity, top hits, country visits, etc. Isn't all of that information also being pulled from SOLR? I've read that the older Administrative Statistics collects data directly from log files, an antiquated approach that still exists in DSpace since Usage Statistics doesn't include some of that info. Is Usage Statistics, in fact, pulling from directly from SOLR? Are those stats then stored in a database table or always read through SOLR queries on the fly?


-Jeff

________________________________________
From: ivan....@gmail.com <ivan....@gmail.com> on behalf of helix84 <hel...@centrum.sk>
Sent: Wednesday, December 30, 2015 2:33 AM
To: Jeffrey Sheldon
Cc: dspac...@googlegroups.com
Subject: Re: [dspace-tech] Direct SOLR query returns "0" for Item page views

helix84

unread,
Dec 31, 2015, 3:40:26 AM12/31/15
to Jeffrey Sheldon, dspac...@googlegroups.com
> The challenge at this point will be to gather granular stats from months of logged data where the IPs were sourced from localhost. SOLR appears to ignore such traffic when providing query responses.

You can fill in the gaps from logs (stats-log-converter, then
stats-log-importer). Just manually edit the converted files before
importing to Solr to make sure you're only importing the missing time
ranges and not creating duplicate events in time ranges that you
already have in Solr.

https://wiki.duraspace.org/display/DSDOC4x/Managing+Usage+Statistics#ManagingUsageStatistics-DSpaceLogConverter

This will miss some types of events and some data for existing events,
but at least you'll have the hits.


> I'm curious though: Usage Statistics throughout the site shows activity, top hits, country visits, etc. Isn't all of that information also being pulled from SOLR? I've read that the older Administrative Statistics collects data directly from log files, an antiquated approach that still exists in DSpace since Usage Statistics doesn't include some of that info. Is Usage Statistics, in fact, pulling from directly from SOLR? Are those stats then stored in a database table or always read through SOLR queries on the fly?

Usage events have never been stored in the DB. Older stats from log
files were processed from cron and stored in files
([dspace]/reports/). Newer stats have been both stored to and read
directly from Solr.
Reply all
Reply to author
Forward
0 new messages