Adding author field does not work

42 views
Skip to first unread message

wilhel...@gmail.com

unread,
Nov 28, 2017, 3:59:37 AM11/28/17
to Datafari
Hi,

I wanted to implement adding an author field according to the documentation found here:

https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/3801090/Add+custom+fields+in+Datafari+and+display+them

So I included the authorname field into solrconfig in the request handler:

<str name="fmap.meta_author">authorname</str>

Then I added the custom authornamefield by using addCustomSchemaInfo.sh:

custom_fields.incl:

{
      "name":"authorname",
      "type":"text_en",
      "stored":true
}

Now the field shows up in the schema browser of the solr config UI for the FileShare core.

Then I added the qf query field in solrconfig.xml. The documentation says 'author', but I tried both: 'author' and 'authorname', as the latter seems to be correct to me.

After that I uploaded the configuration using Zookeeper.

Finally, I reset the seeding of a certain job having docx, doc and pdf files.

After the job finished I went the the config UI of solr and searchef for authorname:* but nothing popped up.

Also searching for all entries of the respective job showed that there was no entry for the authorname at all to be found in any of the data collections for each file.

So it seems to me that there is problem with extraction already. Maybe the field name changed for manifoldcf? Is the documentation provided in the link above still up to date`?

Thanks in advance,

Wilhelm



wilhel...@gmail.com

unread,
Nov 28, 2017, 4:03:47 AM11/28/17
to Datafari
I should add that I tried the procedure completely for both 'author' and 'authorname' as qf.

Furthermore, I use the local files crawler. But I think this should be independent, as the metadata is extracted via Tika?

wilhel...@gmail.com

unread,
Nov 28, 2017, 4:18:41 AM11/28/17
to Datafari
I also changed the stored value of the dynamic ignored_ field to true, but I still get:

 {
        "id": "file:/xxx",
        "job": "xxx",
        "dcterms_created": "2017-03-22T04:10:28Z",
        "last_modified": "2017-03-22T04:10:28Z",
        "dcterms_modified": "2017-03-22T04:10:28Z",
        "last_save_date": "2017-03-22T04:10:28Z",
        "meta_save_date": "2017-03-22T04:10:28Z",
        "meta_creation_date": "2017-03-22T04:10:28Z",
        "creation_date": "2017-03-22T04:10:28Z",
        "resourcename": "xxx.pdf",
        "url": "file:xxx",
        "source": "file",
        "extension": "pdf",
        "language": "en",
        "title": [
          "xxx.pdf"
        ],
        "content_hl_en": [
          "(xxx)"],
        "_version_": 1579286619938095000,
        "deny_token_document": [
          "__nosecurity__"
        ],
        "deny_token_parent": [
          "__nosecurity__"
        ],
        "allow_token_document": [
          "__nosecurity__"
        ],
        "allow_token_share": [
          "__nosecurity__"
        ],
        "allow_token_parent": [
          "__nosecurity__"
        ],
        "deny_token_share": [
          "__nosecurity__"
        ]
      },

for each document. No 'ignored' values.


Am Dienstag, 28. November 2017 09:59:37 UTC+1 schrieb wilhel...@gmail.com:

wilhel...@gmail.com

unread,
Nov 30, 2017, 9:02:48 AM11/30/17
to Datafari
Ok, after a lot of work, I found a solution. The author field extracted by Tika are referenced as 'literal' fields, which can be directly extracted. Unfortunately, there are 'Author', 'Last-Author', and creator.

So I created a field called 'author' being multivalued and stored and indexed in schema.xml. Then I added these lines into solrconfig to the /update/extracte reuqestHandler:

                        <str name="fmap.author">author</str>
                        <str name="fmap.creator">author</str>
                        <str name="fmap.last_author">author</str>

Hence, all values to be found in either of these fields is storedin 'author'

Furthermore, I introduced another field 'authorname' into schema.xml, which is not multivalued, but has a custom type 'author'.

The custom type 'author' is created using the custom add script to be found in the conf/custom_schema directory:

{
     "name":"author",
     "class":"solr.TextField",
     "positionIncrementGap":"100",
     "analyzer" : {
        "charFilters":[
        {
           "class":"solr.PatternReplaceCharFilterFactory",
           "pattern":",.*?$",
           "replacement":""
        }
        ],
        "filters":[
                {"class":"solr.TrimFilterFactory"},
                {
                           "class":"solr.PatternReplaceFilterFactory",
                           "pattern":"(^\\b[a-zA-Z0-9/-_äöüÄÖÜ]+\\s)|(\\b[a-zA-Z0-9/-_äöüÄÖÜ]+(.?)\\s)",
                           "replacement":""
                },
                {"class":"solr.TrimFilterFactory"},
                {"class":"solr.LowerCaseFilterFactory"},
                {"class":"solr.CapitalizationFilterFactory",
                        "onlyFirstWord":"false"} ],
        "tokenizer":{
           "class":"solr.KeywordTokenizerFactory" }
      }
}

This is not necessarily needed, but it does some proper reduction of the retrieved names to the familiy name, if provided.

The updateprocessor datafari is then further enhanced by:

                <processor class="solr.LastFieldValueUpdateProcessorFactory">
                        <str name="fieldName">author</str>
                </processor>

                <processor class="solr.TrimFieldUpdateProcessorFactory">
                        <str name="fieldName">author</str>
                </processor>

                <processor class="solr.CloneFieldUpdateProcessorFactory">
                        <str name="source">author</str>
                        <str name="dest">authorname</str>
                </processor>

Furthermore, in the qf/pf fields in solrconfig, I added authorname.

Voila! The rest hast to happen like in the description of said link above. I also added a custom facet 'authors' to search.js, which shows them:

        Manager.addWidget(new AjaxFranceLabs.TableWidget({
                elm : $('#facet_author'),
                id : 'facet_author',
                field : 'authorname',
                name : window.i18n.msgStore['author'],
                pagination : true,
                selectionType : 'OR',
                sort : 'AtoZ',
                maxDiplay: 100,
                returnUnselectedFacetValues : true
        }));

Now it works!

cedric...@francelabs.com

unread,
Dec 5, 2017, 8:59:28 AM12/5/17
to Datafari
Hi Wilhelm,

glad to see to managed to solve it by yourself, we are very happy to see your commitment to make things work by yourself, BUT also to share it with the community, this is very valuable !!

Regards,

Cedric
Reply all
Reply to author
Forward
0 new messages