Problem DKPro treetagger preposition "au" french

Yassin TOULLICHI

unread,

Jul 24, 2019, 10:00:47 AM7/24/19

to dkpro-core-user

Hi,

I use DKPro with the two components CoreNlpSegmenter and TreeTaggerPosTagger in a java spring boot application, the text analysis with preposition "au" poses a problem, the text is split into two token "a" type Verb and "u" type Verb instead of a single token "au" type preposition. Is it possible to give me a solution of this problem.

Best regards

Yassin

Richard Eckart de Castilho

unread,

Jul 24, 2019, 10:04:07 AM7/24/19

to dkpro-c...@googlegroups.com

Hi,

that sounds odd. My best guess would be that there might be an invisible unicode character called SOFT-HYPHEN (or something similar) being between the "a" and the "u" and that CoreNLP decides to split on that character. Best you a binary file viewer to look at your text in detail and check exactly if there are additional bytes between the "a" and "u" character.

Cheers,

-- Richard

Yassin TOULLICHI

unread,

Jul 24, 2019, 10:53:07 AM7/24/19

to dkpro-c...@googlegroups.com

Hi,

Thanks for your response, i do some test there is no invisible unicode character,

The same problem, CoreNLP split "au" to two character.

Best regards

Yassin

--
You received this message because you are subscribed to the Google Groups "dkpro-core-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-core-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dkpro-core-user/2F798D84-3664-4A30-A613-600C1990E707%40gmail.com.

--

Yassin TOULLICHI

Richard Eckart de Castilho

unread,

Jul 24, 2019, 12:05:43 PM7/24/19

to dkpro-c...@googlegroups.com

Can you try with the original non-DKPro/non-UIMA CoreNLP package to see if this also happens there?

-- Richard

> On 24. Jul 2019, at 16:52, Yassin TOULLICHI <yassint...@gmail.com> wrote:
>
> Hi,
>

Richard Eckart de Castilho

unread,

Jul 24, 2019, 1:20:08 PM7/24/19

to dkpro-c...@googlegroups.com

On 24. Jul 2019, at 16:52, Yassin TOULLICHI <yassint...@gmail.com> wrote:
>

> Thanks for your response, i do some test there is no invisible unicode character,
> The same problem, CoreNLP split "au" to two character.

If you are dealing with French text and configure the segmenter for French, then I could
image that splitting "au" into "a" and "u" is by design since "au" seems to be a
contraction (à + le).

-- Richard

Yassin TOULLICHI

unread,

Jul 25, 2019, 4:16:58 AM7/25/19

to dkpro-c...@googlegroups.com

Hi,

How can i do that?

Best regards

--
You received this message because you are subscribed to the Google Groups "dkpro-core-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-core-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dkpro-core-user/2ECB7E35-B2C6-44CC-860A-B4C4730C2912%40gmail.com.

--

Yassin TOULLICHI

Richard Eckart de Castilho

unread,

Jul 25, 2019, 5:38:46 AM7/25/19

to dkpro-core-user

On 25. Jul 2019, at 10:16, Yassin TOULLICHI <yassint...@gmail.com> wrote:
>
> How can i do that?

https://stanfordnlp.github.io/CoreNLP/cmdline.html

-- Richard

Yassin TOULLICHI

unread,

Jul 25, 2019, 9:46:49 AM7/25/19

to dkpro-c...@googlegroups.com

Hi,

CoreNLP split "au" to "à" and "le",

I dont use only CoreNLP i use also treetagger, and the result isn't the same, the result is "a" and "u"

Best regards,

Yassin

--
You received this message because you are subscribed to the Google Groups "dkpro-core-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-core-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dkpro-core-user/BF4E8208-1DD5-4F16-B073-1ADA033F48F9%40gmail.com.

--

Yassin TOULLICHI

Richard Eckart de Castilho

unread,

Jul 26, 2019, 6:51:35 AM7/26/19

to dkpro-core-user

Hi,

> On 25. Jul 2019, at 15:46, Yassin TOULLICHI <yassint...@gmail.com> wrote:
>
> CoreNLP split "au" to "à" and "le",
>
> I dont use only CoreNLP i use also treetagger, and the result isn't the same, the result is "a" and "u"

The question is on which kind of data the TreeTagger model is trained. Is it trained on data which normalizes "au" to "à le" or does it actually use "au" in the training data.

My guess would be that it is probably trained on "au" (because I don't think the tokenizer scripts which come originally with TreeTagger would do normalization) - so you might want to switch to another segmenter.

Cheers,

-- Richard

Yassin TOULLICHI

unread,

Jul 29, 2019, 8:04:51 AM7/29/19

to dkpro-c...@googlegroups.com

Hi,

Thanks for your suggestion, i choose another segmenter BreakIterator instead corenlp segmenter, and he works fine,

Best regards

Yassin

--
You received this message because you are subscribed to the Google Groups "dkpro-core-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-core-us...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dkpro-core-user/91347BBD-E62A-4724-804F-E7AC27314850%40gmail.com.