Extending JATE to other languages

17 views
Skip to first unread message

pablocall...@gmail.com

unread,
Nov 2, 2018, 9:20:04 AM11/2/18
to JATE2
Hi all, 

I am trying to extend JATE for other languages, specifically to Spanish. I'm using an OpenNLP model for Spanish POS in version 1.5 but my results are null (without any lemmatizers). 

Anyone has tried to work in another language? which are the basics filters that I have to take care of?

Thank you


Jie Gao

unread,
Nov 2, 2018, 9:59:16 AM11/2/18
to JATE2
Hi,

Please check and provide some more details:

1. which mode are your based on ? I guess it would be Embedded mode.

2. have your corpus been indexed properly ?

   For embedded mode, you don't need to manually index your corpus. For plugin mode, you have to index your corpus manually in a separate process first. Please check your solr index.

3. have you configured your Solr schema correctly ?

If you havn't changed default property file 'jate.properties', you need to configure key fields correctly, e.g., n-gram field ('jate_ngraminfo'), candidate term field ('jate_cterms') and domain term field ('jate_cterms'). For Spanish corpus, in addition to PoS tagger, you may also need to replace two tokenisers (both token tokeniser and sentence tokeniser) and remove default english lemmatiser. Moveover, depending on your settings, you also need to configure a correct language-specific stop words list and fine-tune various filtering parameters in pattern parser or noun phrase chunker..  You have to provide us your schema.xml file so that we can help to check your configs.

Hope it helps,
Jerry

pablocall...@gmail.com

unread,
Nov 6, 2018, 12:08:11 PM11/6/18
to JATE2

Thanks for your answer, I have been doing some improvements. Now I am able  to retrieve some words and concepts


 I am using JATE in Plugin Mode using the example provided (http://jerrygaolondon.github.io/jateSolrPluginDemo/) and in the youtube tutorial. JATE works very well for English corpora.

I'm using an external JAVA project to index the documents inside the core (solr-solrj) and I have not changed jate.properties file. 


The main changes in my schema.xml have been in  the two jate fieldtypes: jate_text_2_ngrams and jate_text_2_terms

The problem for Spanish is that there are not models for version 1.5 of OpenNLP. I found someones in github ( for sentence splitting and pos). Also I have configured my own spanish stopwords list. 

My main problem has been in OpenNLPRegexChunkerFactory  for  jate_text_2_terms. This filter is not returning anything. However, with OpenNLPNounPhraseFilterFactory   and with the english chunker model (I know that this is not correct) they are retrieving some noun chunks. 

I have to find they way to improve one of these two filters. Have anyone changed the class of JATE library to work with other languages or have used other type of filter for the same purpose?

Thanks a lot!

<fieldType name="jate_text_2_ngrams" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
          
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
          
                <tokenizer class="org.apache.lucene.analysis.jate.OpenNLPTokenizerFactory"
                           sentenceModel="es-sent.bin"
                           tokenizerModel="en-token.bin"/>
               
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                
                <filter class="org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory"
                        posTaggerClass="uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP"
                        posTaggerModel="opennlp-es-maxent-pos-es.bin"/>
               
                <filter class="org.apache.lucene.analysis.jate.ComplexShingleFilterFactory" minTokens="2" maxTokens="5"
                        maxCharLength="40" minCharLength="2" removeLeadingStopWords="true"
                        removeTrailingStopWords="true" removeLeadingSymbolicTokens="true"
                        removeTrailingSymbolicTokens="true"
                        stripAnySymbolChars="false"
                        stripLeadingSymbolChars="true" stripTrailingSymbolChars="true"
                        stopWords="stopwords.txt" stopWordsIgnoreCase="true"
                        outputUnigrams="true" outputUnigramsIfNoShingles="false" tokenSeparator=" "/>
                <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            </analyzer>
        </fieldType>

        <!--a configuration for PoS based candidate extraction-->
        <fieldType name="jate_text_2_terms" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                
                <tokenizer class="org.apache.lucene.analysis.jate.OpenNLPTokenizerFactory"
   sentenceModel="es-sent.bin"
   tokenizerModel="en-token.bin"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
                
                <filter class="org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFactory"
                        posTaggerClass="uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP"
                        posTaggerModel="opennlp-es-maxent-pos-es.bin"/>

                <!--filter class="org.apache.lucene.analysis.jate.OpenNLPRegexChunkerFactory"
                        patterns="aclrdtec.patterns"
                        minTokens="1" maxTokens="5"
                        maxCharLength="40" minCharLength="2" removeLeadingStopWords="true"
                        removeTrailingStopWords="true" removeLeadingSymbolicTokens="true"
                        removeTrailingSymbolicTokens="true"
                        stripAnySymbolChars="false"
                        stripLeadingSymbolChars="true" stripTrailingSymbolChars="true"
                        stopWords="stopwords.txt" stopWordsIgnoreCase="true"/-->
                <filter class="org.apache.lucene.analysis.jate.OpenNLPNounPhraseFilterFactory"
                        chunkerModel="en-chunker.bin"
                        minTokens="1" maxTokens="5"
                        maxCharLength="40" minCharLength="2" removeLeadingStopWords="true"
                        removeTrailingStopWords="true" removeLeadingSymbolicTokens="true"
                        removeTrailingSymbolicTokens="true"
                        stripAnySymbolChars="false"
                        stripLeadingSymbolChars="true" stripTrailingSymbolChars="true"
                        stopWords="stopwords.txt" stopWordsIgnoreCase="true"/>
                <filter class="solr.LowerCaseFilterFactory" />
              
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            </analyzer>
        </fieldType>        
        <!-- ###################### JATE End #############################-->
    </types>










ziqi zhang

unread,
Nov 9, 2018, 6:12:47 AM11/9/18
to pablocall...@gmail.com, ja...@googlegroups.com
Hi

Apologies for our late reply. 

OpenNLPRegexChunkerFactory finds candidate phrases by matching PoS sequence patterns. Your configuration suggests you are using the default 'aclrdtec.patterns'. If you take a look at this file, you will see it uses the English PoS tag to define the sequence patterns. 

I wonder if your Spanish PoS tagger uses the same tag bank? Please check this first. If not, you need to change your pattern file (aclrdtec.patterns) to define the sequence of PoS patterns for Spanish; if yes, you may want to check if your Spanish PoS tagger has generated any patterns that can be matched by those defined in the aclrdtec.patterns file.

Hope that helps.

--
You received this message because you are subscribed to the Google Groups "JATE2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jate2+un...@googlegroups.com.
To post to this group, send email to ja...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/jate2/5e192a8f-a901-42e4-ad93-0586fc53a405%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

pablocall...@gmail.com

unread,
Nov 12, 2018, 6:15:20 AM11/12/18
to JATE2

Thank you! Patterns were the problem. PoS taggers in Spanish generate different tags (and of course, patterns in different order), so I created my own patterns and now I can retrieve multiword terms. 

I think I could push my core and conf in testdata/solr-testbed/ . I think it is a small contribution and it should be checked (patterns, stopwords,...) but at least there is an Alpha version for Spanish corpora. 


Best, 

Pablo

ziqi zhang

unread,
Nov 12, 2018, 7:10:59 AM11/12/18
to Pablo Calleja, ja...@googlegroups.com
Good to know it is working, and please do. We welcome any contributions just there might be a delay in integrating those due to lack of resources on this project.

Thanks

pablocall...@gmail.com

unread,
Nov 13, 2018, 12:17:13 PM11/13/18
to JATE2

Ok. I have attached the core to this thread. I think that a pull request is oriented more for Issues. 

Pablo
jateSpanishCore.rar

Jie Gao

unread,
Nov 14, 2018, 5:20:29 AM11/14/18
to pablocall...@gmail.com, ja...@googlegroups.com
That's fine. 

Thanks
Jerry

Reply all
Reply to author
Forward
0 new messages