Problem DKPro treetagger preposition "au" french

6 views
Skip to first unread message

Yassin TOULLICHI

unread,
Jul 24, 2019, 10:00:47 AM7/24/19
to dkpro-core-user
Hi,

I use DKPro with the two components CoreNlpSegmenter and TreeTaggerPosTagger in a java spring boot application, the text analysis with preposition "au" poses a problem, the text is split into two token "a" type Verb and "u" type Verb instead of a single token "au" type preposition. Is it possible to give me a solution of this problem.

Best regards
Yassin

Richard Eckart de Castilho

unread,
Jul 24, 2019, 10:04:07 AM7/24/19
to dkpro-c...@googlegroups.com
Hi,

that sounds odd. My best guess would be that there might be an invisible unicode character called SOFT-HYPHEN (or something similar) being between the "a" and the "u" and that CoreNLP decides to split on that character. Best you a binary file viewer to look at your text in detail and check exactly if there are additional bytes between the "a" and "u" character.

Cheers,

-- Richard

Yassin TOULLICHI

unread,
Jul 24, 2019, 10:53:07 AM7/24/19
to dkpro-c...@googlegroups.com
Hi,

Thanks for your response, i do some test there is no invisible unicode character, 
The same problem, CoreNLP split "au" to two character.    

Best regards
Yassin 

--
You received this message because you are subscribed to the Google Groups "dkpro-core-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-core-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dkpro-core-user/2F798D84-3664-4A30-A613-600C1990E707%40gmail.com.


--
Yassin TOULLICHI

Richard Eckart de Castilho

unread,
Jul 24, 2019, 12:05:43 PM7/24/19
to dkpro-c...@googlegroups.com
Can you try with the original non-DKPro/non-UIMA CoreNLP package to see if this also happens there?

-- Richard

> On 24. Jul 2019, at 16:52, Yassin TOULLICHI <yassint...@gmail.com> wrote:
>
> Hi,
>

Richard Eckart de Castilho

unread,
Jul 24, 2019, 1:20:08 PM7/24/19
to dkpro-c...@googlegroups.com
On 24. Jul 2019, at 16:52, Yassin TOULLICHI <yassint...@gmail.com> wrote:
>
> Thanks for your response, i do some test there is no invisible unicode character,
> The same problem, CoreNLP split "au" to two character.

If you are dealing with French text and configure the segmenter for French, then I could
image that splitting "au" into "a" and "u" is by design since "au" seems to be a
contraction (à + le).

-- Richard

Yassin TOULLICHI

unread,
Jul 25, 2019, 4:16:58 AM7/25/19
to dkpro-c...@googlegroups.com
Hi,

How can i do that?

Best regards

--
You received this message because you are subscribed to the Google Groups "dkpro-core-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-core-us...@googlegroups.com.


--
Yassin TOULLICHI

Richard Eckart de Castilho

unread,
Jul 25, 2019, 5:38:46 AM7/25/19
to dkpro-core-user
On 25. Jul 2019, at 10:16, Yassin TOULLICHI <yassint...@gmail.com> wrote:
>
> How can i do that?

https://stanfordnlp.github.io/CoreNLP/cmdline.html

-- Richard

Yassin TOULLICHI

unread,
Jul 25, 2019, 9:46:49 AM7/25/19
to dkpro-c...@googlegroups.com
Hi,

CoreNLP split "au" to "à" and "le", 

I dont use only CoreNLP i use also treetagger,  and the result isn't the same, the result is "a" and "u" 

Best regards,  
Yassin 

--
You received this message because you are subscribed to the Google Groups "dkpro-core-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-core-us...@googlegroups.com.


--
Yassin TOULLICHI

Richard Eckart de Castilho

unread,
Jul 26, 2019, 6:51:35 AM7/26/19
to dkpro-core-user
Hi,

> On 25. Jul 2019, at 15:46, Yassin TOULLICHI <yassint...@gmail.com> wrote:
>
> CoreNLP split "au" to "à" and "le",
>
> I dont use only CoreNLP i use also treetagger, and the result isn't the same, the result is "a" and "u"

The question is on which kind of data the TreeTagger model is trained. Is it trained on data which normalizes "au" to "à le" or does it actually use "au" in the training data.

My guess would be that it is probably trained on "au" (because I don't think the tokenizer scripts which come originally with TreeTagger would do normalization) - so you might want to switch to another segmenter.

Cheers,

-- Richard

Yassin TOULLICHI

unread,
Jul 29, 2019, 8:04:51 AM7/29/19
to dkpro-c...@googlegroups.com
Hi,

Thanks for your suggestion, i choose another segmenter BreakIterator instead corenlp segmenter, and he works fine,

Best regards
Yassin

--
You received this message because you are subscribed to the Google Groups "dkpro-core-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-core-us...@googlegroups.com.


--
Yassin TOULLICHI
Reply all
Reply to author
Forward
0 new messages