SOLR Date Indexing Using xslt-date-template.xslt

489 views
Skip to first unread message

Brad Spry

unread,
Apr 6, 2015, 4:44:09 PM4/6/15
to isla...@googlegroups.com
Solution needed for indexing dates in additional formats palatable for human consumption.   For example, adding a date index field which stores the date as it was originally cataloged, as is, not normalized using the Joda library.

Currently running a very basic xslt-date-template.xslt, which normalizes dates wonderfully for machine-readable uses:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Template to make the iso8601 date -->
<xsl:stylesheet version="1.0"
 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 
xmlns:java="http://xml.apache.org/xalan/java">

 
<xsl:template name="get_ISO8601_date">
   
<xsl:param name="date"/>
   
<xsl:param name="pid">not provided</xsl:param>
   
<xsl:param name="datastream">not provided</xsl:param>

   
<xsl:value-of select="java:ca.discoverygarden.gsearch_extensions.JodaAdapter.transformForSolr($date, $pid, $datastream)"/>
 
</xsl:template>
</xsl:stylesheet>


I've been reviewing other's xslt-date-template.xslt's and related schema.xml's posted to github, most of which are WAY further developed our nearly stock config:

https://github.com/lyrasis/islandora_transforms/blob/master/library/xslt-date-template.xslt
https://github.com/discoverygarden/ucla_gsearch_config/blob/master/gsearch_solr/islandora_transforms/library/xslt-date-template.xslt
https://github.com/jordandukart/gsearch_solr_config/blob/master/gsearch_solr/xslt-date-template.xslt
https://github.com/FLVC/islandora_transforms/blob/master/library/xslt-date-template.xslt


In summary, need to continue indexing Joda normalized dates for machine-readable uses, but looking for a solution to indexing dates in additional fields using originally cataloged formats, for human consumption.

Note: dc.date is an obvious solution, but when MODS contains both dateCreated and dateCaptured values, both dates are displayed using Metadata Display

MODS:
<originInfo>
<dateCreated encoding="w3cdtf" keyDate="yes">1973-11-11</dateCreated>
<dateCaptured encoding="w3cdtf">2014-08-22</dateCaptured>
</originInfo>

dc.date:
1973-11-11 2014-08-22

I'd love to know how to fully control concatenation like that in SOLR, just in case I'd like to have only one of the values indexed in dc.date.

If one can use Metadata Display to only display one of the values present in dc.date, I'd love to know how...


The extent of my custom SOLR schema.xml work thus far:

<copyField source="mods_titleInfo_partNumber_ms" dest="mods_titleInfo_partNumber_int"/>
<field name="mods_titleInfo_partNumber_int" type="int" indexed="true" stored="true" multiValued="false"/>

This indexes "mods_titleInfo_partNumber_ms" as an integer field with a single value, for sorting purposes in SOLR Views :-)


Thank You Ahead of Time,

Brad Spry
Atkins Library
UNC Charlotte













Donald Moses

unread,
Apr 7, 2015, 9:31:04 PM4/7/15
to isla...@googlegroups.com
Not sure if this answers your question, but what if you use the qualifier attribute for any of the date fields ? Not sure if that will change the way it is getting indexed, but in theory it should give you a new field you could use for query/display.

From the MODS guidelines - http://www.loc.gov/standards/mods/userguide/generalapp.html#qualifier

qualifier – The following values are used with the qualifier attribute:
  • approximate – This value is used to identify a date that may not be exact, but is approximated, such as "ca. 1972".
  • inferred – This value is used to identify a date that has not been transcribed directly from a resource, such as "[not before 1852]".
  • questionable – This value is used to identify a questionable date for a resource, such as "1972?".
<originInfo>
<dateIssued qualifier="questionable" point="start">1894?</dateIssued>
</originInfo>


Don

Brad Spry

unread,
Apr 8, 2015, 6:03:19 PM4/8/15
to isla...@googlegroups.com
THANK YOU so much Don for putting thought into this and proposing a solution!

I need to rewind a little bit to detail my setup.

Using YUDL's basic Solr config:
https://github.com/yorkulibraries/basic-solr-config

...which requires discoverygarden's GSearch extensions:
https://github.com/discoverygarden/dgi_gsearch_extensions

The xslt-date-template.xslt I am utilizing, was included with discoverygarden's GSearch extensions.

discoverygarden's GSearch extensions transforms dates, making them compatible with SOLR.


Moving forward now, I dove into SOLR's date documentation:
https://cwiki.apache.org/confluence/display/solr/Working+with+Dates

...there I learned about SOLR's "TrieDateField" and how it "represents a point in time with millisecond precision".


Created some test records utilizing your suggestion of the qualifier attribute, using the value of "approximate", like so:

<originInfo>
<dateCreated encoding="w3cdtf" qualifier="approximate">2001-10-01</dateCreated>
</originInfo>

Sure enough, just like you said, SOLR indexes the metadata as a new field:
mods_originInfo_encoding_w3cdtf_qualifier_approximate_dateCreated_dt

Using fedoragsearch's admin client to view the index document for the record, here's how the field is indexed in SOLR:

mods_originInfo_encoding_w3cdtf_qualifier_approximate_dateCreated_dt: 1001894400000


Putting everything together now, "1001894400000" represents the number of milliseconds since the Epoch (Thursday, 1 January 1970).

So that's how SOLR is really storing the date... It's just like Unix time, except in milliseconds.

This grasp of what's really happening behind the scenes is helpful to me, because I'm quite familiar with Unix time and conversions.


My thoughts then lead back to the islandora_solr_metadata module, so I grepped for *date* through all of it's wares.

Surprisingly, grep hit on something very interesting within islandora_solr_metadata.install:

      'date_format' => array(
       
'type' => 'varchar',
       
'length' => 255,
       
'not null' => FALSE,
       
'description' => 'The date format used for dates.',
     
),

That lead me to the "islandora_solr_metadata_fields" table, where I found the following field, note, and data:

`date_format` varchar(255) DEFAULT NULL COMMENT 'The PHP date format used for dates.',

`islandora_solr_metadata_fields`
(`configuration_id`, `solr_field`, `display_label`, `hyperlink`, `weight`, `permissions`, `date_format`, `structured_data`)

(7, 'mods_originInfo_encoding_w3cdtf_qualifier_approximate_dateCreated_dt', 'qualifier_approximate BRAD', 0, 28, NULL, NULL, NULL),

I saw that `date_format` was NULL, so just for the heck of it, edited the field and used
'Y-m-d' as the value.

Said a prayer, hit reload, but no joy...


In summary, it seems
islandora_solr_metadata module has the beginnings of a strategy to convert SOLR's millisecond based Epoch time to a user-defined, PHP style date() format, but I could find no function in it's code to utilize the `date_format` field.

It's a really, truly, beautiful idea however, to be able to define the date format output on a per field basis from within
islandora_solr_metadata.

Next step was to review
islandora_solr_metadata's jira for any similar issues.  Found:
https://jira.duraspace.org/browse/ISLANDORA-978

Please login to jira and vote for this issue :-)

I've done a lot of date conversion programming before, seems like a great opportunity for me to give something back to Islandora...  Hmmmmmm...  No promises, but that's where my heart is right now.


Brad























Diego Pino

unread,
Apr 8, 2015, 7:24:11 PM4/8/15
to isla...@googlegroups.com
Hi Brad,


strtotime() is used. It takes a valid time/date "string" (did i say string?) as input. For valid this function understands the ones listed here, http://php.net/manual/en/datetime.formats.php, but a unix timestamp is not present/listed because it's an integer. Just for curiosity, why don't you store the date inside  a triefield in Solr using this format? YYYY-MM-DDThh:mm:ssZ.  

So in theory (and with a approval of the community of course) we could just add something as simple as this (at L310) to solve your problem.

if (is_numeric($value) && (int)$value == $value ) {
 
return format_date($value, 'custom', $date_format[$field], 'UTC'); //Use the unix time stamp directly
 
}
else{
 
return format_date(strtotime($value), 'custom', $date_format[$field], 'UTC'); //Do as we are doing right now
}

Hope that helps.

Cheers

Diego

Brad Spry

unread,
Apr 9, 2015, 11:14:33 AM4/9/15
to isla...@googlegroups.com

Diego,

Thank you for pointing me towards understanding. 

This morning, I see `islandora_solr_metadata/theme/theme.inc` calls `module_load_include('inc', 'islandora_solr_search', 'includes/utilities');`


Attached is how the metadata in question appears in the index.


I am confused as to how the data is truly natively stored in SOLR...

The contents of `mods_originInfo_encoding_w3cdtf_qualifier_approximate_dateCreated_dt` appears to be a Unix timestamp with millisecond precision.

Directly below that, 'catch_all_fields_mt' contains triefield format: 2001-10-01T00:00:00.000Z


My question is: is SOLR natively storing in Unix timestamp or triefield format?

The contents of the attached `mods_originInfo_encoding_w3cdtf_qualifier_approximate_dateCreated_dt` lead me to believe that SOLR's true native storage format is Unix timestamp.


Brad









getIndexDocument-qualifier.png

Diego Pino

unread,
Apr 9, 2015, 3:49:02 PM4/9/15
to isla...@googlegroups.com
Hi, the extension of a dynamic field is not enough to understand which type you are really using. so how is _dt defined in your schema? In my case triedate is bound to dynamic _tdt and _dt for the deprecated date field.
How Solr "indexes" data and how it "stores" data are sometimes different things(in simple terms, one representation may be optimal for searching/even multiple representations inside and other ones for displaying) . 
So i would suggest you take a look at your schema-browser in sole's core admin page to see how your data is displayed.

Best

Diego

Brad Spry

unread,
Apr 13, 2015, 4:48:18 PM4/13/15
to isla...@googlegroups.com
Diego,

As far as indexing and storing, I'm just trying to understand what I'm looking at on the attached screenshot, which is getIndexDocument output from Fedora GSearch.

In the screenshot, the value of `mods_originInfo_encoding_w3cdtf_qualifier_approximate_dateCreated_dt` appears to be a Unix timestamp.

Below that, the value of 'catch_all_fields_mt' appears to be triefield format: 2001-10-01T00:00:00.000Z.

For my understanding, I'm wondering what's truly happening behind the scenes.  

If it's stored in Unix timestamp format, then is Islandora converting and displaying it as triefield format?

If it's stored in triefield format, then is it simply echoing triefield format, unchanged?

This curious mind just wants to understand.


Switching gears, I think the `date_format` field within the `islandora_solr_metadata_fields` table is completely different than the date_format() function found elsewhere.

I believe the intent of the function is to allow the end user to define the display of the date format using php date() style definitions.

Once user defined, I believe it would integrated into the following line of code, and utilize a custom date format for display if one is defined:
https://github.com/Islandora/islandora_solr_metadata/blob/7.x/theme/islandora-solr-metadata-display.tpl.php#L33


Brad






getIndexDocument-qualifier.png

Jared Whiklo

unread,
Apr 13, 2015, 5:49:16 PM4/13/15
to isla...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Brad,

I think the difference is probably just that one is a textfield and
the other is a date field.

So in the case of the ..._dateCreated_dt if you check your schema.xml
it will be some form of date field. But the catch_all_fields_mt is a
text field.

My guess is that you see the internal representation in the date
field, but once it is copied to a text field Solr probably
automatically converts it to a date field. Which is why it is a nicely
formatted date.

cheers,
jared
> -- For more information about using this group, please read our
> Listserv Guidelines:
> http://islandora.ca/content/welcome-islandora-listserv --- You
> received this message because you are subscribed to the Google
> Groups "islandora" group. To unsubscribe from this group and stop
> receiving emails from it, send an email to
> islandora+...@googlegroups.com
> <mailto:islandora+...@googlegroups.com>. Visit this group
> at http://groups.google.com/group/islandora. For more options,
> visit https://groups.google.com/d/optout.

- --
Jared Whiklo
jwh...@gmail.com
- --------------------------------------------------
You know you're from Winnipeg when...You design you kid's Halloween
costume to fit over a snowsuit.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.14 (Darwin)

iEYEARECAAYFAlUsOdgACgkQqhIY384dF1YkxQCdENu5yvlpJQiI2Ngck/BKIcdR
jAgAoIm2qByv+tu2OZGSf/l3/lSX271g
=QLN/
-----END PGP SIGNATURE-----

Adam Vessey

unread,
Apr 13, 2015, 7:33:42 PM4/13/15
to isla...@googlegroups.com
What Solr stores is not so much the issue, as much as what it accepts and provides, yeah? For dates, the answer's the same for both: Full ISO 8601 in Zulu time (YYYY-MM-DDTHH:mm:ss.uuuZ). As this format is(/should be) parseable by PHP's strtotime(), we often use it to transform the date into a Unix timestamp, which PHP's date formatting functions can then play with...  Though a quick look through the code seems indicate the "date_format" column does not get used anywhere, though it was used at one time.

If I recall correctly, the code was reverted because the associated modal forms ended up breaking many things, but I guess the column(s) managed to stick around.

If so inclined, one should be able to define a callback and implement hook_preprocess_islandora_solr_metadata_display() in order to inject the callback into the entry for the desired field as a "formatter", to allow additional field formatting to occur (be it date, thousands separating an integer, or whatever formatting one needs of a given field, really).

Something of an aside: Much of the functionality of islandora_solr_metadata can be performed by islandora_solr_views itself, by using "contextual filters" to grab the PID of the object being viewed.

- Adam
--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.

Brad Spry

unread,
Apr 14, 2015, 10:52:14 AM4/14/15
to isla...@googlegroups.com
Thank You All so much for helping me gain a foothold of understanding.

I've come full circle now... 

The original goal was to retain the original, human cataloged dates for display purposes.

As my config stands today, the contents of MODS "dateCreated" and "dateCaptured" fields are being normalized prior to SOLR storage using JodaAdapter.

The original, human cataloged date is getting lost in the process and not getting stored in SOLR.

This morning, I am looking at "slurp_all_MODS_to_solr.xslt", wondering if I can add an additional date handling routine ABOVE the existing date routine.   Such an additional routine would store the dates as strings_ms style. 

The result would be the same dates being stored two different ways:

1. Text strings, original formatting
2. SOLR date, normalized by JodaAdapter.


Brad

Adam Vessey

unread,
Apr 14, 2015, 12:47:30 PM4/14/15
to isla...@googlegroups.com

Brad Spry

unread,
Apr 14, 2015, 3:05:23 PM4/14/15
to isla...@googlegroups.com
Adam,

That XSL looks like what I need...

There have been a significant number of updates since I last pulled slurp_all_MODS_to_solr.xslt...

I will update and report my findings.

Thank you for taking the time to point these things out in precise detail.

<B

Brad Spry

unread,
Apr 14, 2015, 5:04:26 PM4/14/15
to isla...@googlegroups.com
Sure enough, with an updated slurp_all_MODS_to_solr.xslt, the date indexing we needed was enabled!

Humans are happy now, with their original cataloged dates unaltered and available for display to other humans.

Computers are happy, SOLR still has it's date field, normalized and indexed to millisecond precision.


At this very moment, computers and humans are singing Kumbaya and eating s'mores; it's quite the sight to see and hear :-)


Thank You All for taking precious time out of your day to point me in great directions.  I hope to find ways to repay all the favors.


Sincerely,
Reply all
Reply to author
Forward
0 new messages