Rule for wildcard and indexing

38 views
Skip to first unread message

grantaka36

unread,
Jul 3, 2019, 3:14:32 AM7/3/19
to PLOS API Developers
Dear PLOS API developers,

Would you please guide the reference for the following search result? Thank you always for the reply.


1. 
Query: N?N?N?N?N 
Match: N,N,N’,N’,N”
Result: "?" in this query matches "’,"
The charatcters "’," and "″," seems to be indexed as one character

2.
Query: N!N!N!N
Match: N,N,N,N
Match: N,N,N’,N
Result: "!" in this query matches "," and "’,"
The charatcter "!" seems to be regarded as "?"

3.
Query: N!!!N!N!N
Match: N,N,N,N
Match: N,N′,N″,N
Result: "!!!" in this query matches ",", "’," and "″,"
The charatcter "!!!" seems to be regarded as "?"


Regards,
grantaka36


Erik Hetzner

unread,
Jul 15, 2019, 2:43:01 PM7/15/19
to plos-api-...@googlegroups.com, grantaka36
Hi grantaka36,

I believe it is expected that punctuation will be ignored. The primary purpose of the abstract search is to provide full text query capabilities. Are you trying to look for some other kind of data in the abstracts? You might be interested in https://github.com/PLOS/allofplos if you want to do more sophisticated analysis of the content.

The abstract is put through the following configuration in solr:

<fieldType name="text" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="100"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="0" generateNumberParts="1" stemEnglishPossessive="1" splitOnCaseChange="0" generateWordParts="1" splitOnNumerics="0" catenateAll="0" catenateWords="0"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="100"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="0" generateNumberParts="1" stemEnglishPossessive="1" splitOnCaseChange="0" generateWordParts="1" splitOnNumerics="0" catenateAll="0" catenateWords="0"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>

best, Erik

On Wed, 03 Jul 2019 00:14:32 -0700,
grantaka36 <grant...@gmail.com> wrote:
>
> [1.1 <text/plain; UTF-8 (quoted-printable)>]
> [1.2 <text/html; UTF-8 (quoted-printable)>]
> --
> You received this message because you are subscribed to the Google Groups "PLOS API Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plos-api-develo...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/plos-api-developers/cc95997d-fef7-4ec8-8465-d03b075ad5bc%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

🐝 PLOS | OPEN FOR DISCOVERY
🐘 Erik Hetzner | Software Developer
📮 1160 Battery Street, Suite 225, San Francisco, CA 94111
📞 Main +1 415 624 1200

grantaka36

unread,
Jul 17, 2019, 8:41:12 PM7/17/19
to PLOS API Developers
Thanks Erik,

> Are you trying to look for some other kind of data in the abstracts? 
No, I am trying just the full text search of abstracts, and found being used helpful classes with your guide. Thank you.

Regards,
grantaka36


2019年7月3日水曜日 16時14分32秒 UTC+9 grantaka36:
Reply all
Reply to author
Forward
0 new messages