Identifying French text

68 views
Skip to first unread message

Rowan Jacobs

unread,
Aug 23, 2021, 5:41:11 AM8/23/21
to nltk-users
I'm looking for a way to identify relatively short strings (between 1 and 50 words) in French in a list of otherwise English strings.

An example input might be (with French strings emphasized):
['Damsel-fly', 'Dragonfly', 'Dragonfly', 'Larvae of mosquito', 'Dragonfly', 'Dragonfly (Anisoptera spp.)', 'Dragonfly', 'The horse-fly', 'Dragonfly', "Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière", 'Petit filet pour une espèce de papillon; treillage', 'Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).', 'A shore plant (Scirpus frondosus) whose leaves dry a bright orange', 'Dragonfly', 'Libellule', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', 'Dracontomelon vitiense']

I realize that identifying strings of only one or two words is probably unfeasible, and I'd rather have false negatives than false positives.

Is this something that's possible to do with nltk?

Jordi Carrera

unread,
Aug 23, 2021, 3:01:37 PM8/23/21
to nltk-users
You could definitely implement your own solution using functionality built into NLTK but I think you'll need training data for that in all cases. If it's not a problem for you to collect a reasonable large amount (e.g. a few thousand sentences at least) of French-language data as the positive class, and language from any other languages as the negative class, then you could implement your own classifier using NLTK objects (the MaxEnt classifier for instance). You could probably use as well a simple character n-gram model trained on the same kind of data (using`nltk.ConditionalFreqDist` class or some related one). You could apply character-lever tokenization, then compute character probabilities independently for French data and for non-French data, and then run any input word through both models: if the former is higher than the latter, then word is likely to be a French word.

In case this sounds like too much work, or if you can't easily come by the data, I can think of two options: using `langdetect` or using a workaround with any machine translation library, in this case `EasyNMT` (arguably an over-engineered solution, so I don't  necessarily recommend it but, if you don't have to process a lot of data, it should do the trick).

You can install both with pip (EasyNMT, langdetect).

Here's an example usage of `langdetect` that would solve your problem with 1 false positive and 1 false negative, if I understood your example correctly:

```
>>> from langdetect import detect
>>> words = ['Damsel-fly', 'Dragonfly', 'Dragonfly', 'Larvae of mosquito', 'Dragonfly', 'Dragonfly (Anisoptera spp.)', 'Dragonfly', 'The horse-fly', 'Dragonfly', "Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière", 'Petit filet pour une espèce de papillon; treillage', 'Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).', 'A shore plant (Scirpus frondosus) whose leaves dry a bright orange', 'Dragonfly', 'Libellule', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', 'Dracontomelon vitiense']
>>> for w in sorted(set(words)):
...    print(detect(w), '\t', w)
... 
it (Dracontomelon vitiense)
en A shore plant (Scirpus frondosus) whose leaves dry a bright orange
cy Damsel-fly
it Dracontomelon vitiense
cy Dragonfly
en Dragonfly (Anisoptera spp.)
fr Larvae of mosquito
et Libellule
fr Petit filet pour une espèce de papillon; treillage
fr Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).
fr Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière
en The horse-fly

``` 

And here's an example usage with `EasyNMT` that solves the task with 1 false negative:

```
from easynmt import EasyNMT

model = EasyNMT('opus-mt')

words = list(set(['Damsel-fly', 'Dragonfly', 'Dragonfly', 'Larvae of mosquito', 'Dragonfly', 'Dragonfly (Anisoptera spp.)', 'Dragonfly', 'The horse-fly', 'Dragonfly', "Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière", 'Petit filet pour une espèce de papillon; treillage', 'Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).', 'A shore plant (Scirpus frondosus) whose leaves dry a bright orange', 'Dragonfly', 'Libellule', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', 'Dracontomelon vitiense']))

fr = model.translate(words, target_lang='fr')
en = model.translate(words, target_lang='en')

for unk, fr, en in zip(words, fr, en):
    if unk == fr and unk != en:
        print(('French', unk))
    else:
        print(('Not French', unk))

# ('Not French', 'Damsel-fly')
# ('French', 'Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).')
# ('Not French', 'A shore plant (Scirpus frondosus) whose leaves dry a bright orange')
# ('Not French', 'Dragonfly (Anisoptera spp.)')
# ('Not French', 'Larvae of mosquito')
# ('Not French', 'The horse-fly')
# ('Not French', 'Libellule')
# ('Not French', '(Dracontomelon vitiense)')
# ('Not French', 'Dracontomelon vitiense')
# ('French', "Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière")
# ('Not French', 'Dragonfly')
# ('French', 'Petit filet pour une espèce de papillon; treillage')

````

Note we're translating all words into French and making the assumption that those that remain the same after translation were terms originally in French. This caused some false positives with the Latin terms (that don't have a French translation either, and also remain the same after translation), so I added the English translations as a pivot: if the term remains the same after translation into English as well, then it's a Latin term. When the evaluation of the two conditions is `True AND False`, respectively, it's a French term.

Rowan Jacobs

unread,
Aug 24, 2021, 2:51:40 AM8/24/21
to nltk-users
Thanks for the very thorough answer!

I looked at a few more libraries, as suggested by users on Stack Overflow.

langid:
('fr', -0.10433578491210938)      (Dracontomelon vitiense)
('en', -119.83710384368896)      A shore plant (Scirpus frondosus) whose leaves dry a bright orange
('en', 0.3048439025878906)      Damsel-fly
('fr', -0.10433578491210938)      Dracontomelon vitiense
('en', 9.061840057373047)      Dragonfly
('en', -14.312902450561523)      Dragonfly (Anisoptera spp.)
('en', -18.89906597137451)      Larvae of mosquito
('ro', 1.0045490264892578)      Libellule
('fr', -141.5236783027649)      Petit filet pour une espèce de papillon; treillage
('fr', -86.08180046081543)      Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).
('fr', -264.84265518188477)      Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière
('en', -15.21357250213623)      The horse-fly

textblob:
es      (Dracontomelon vitiense)

en      A shore plant (Scirpus frondosus) whose leaves dry a bright orange
en      Damsel-fly
es      Dracontomelon vitiense
en      Dragonfly
en      Dragonfly (Anisoptera spp.)
en      Larvae of mosquito
fr      Libellule

fr      Petit filet pour une espèce de papillon; treillage
fr      Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).
fr      Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière
en      The horse-fly

cld3:
LanguagePrediction(language='mk', probability=0.506585419178009, is_reliable=False, proportion=1.0)      (Dracontomelon vitiense)
LanguagePrediction(language='it', probability=0.44047126173973083, is_reliable=False, proportion=1.0)      A shore plant (Scirpus frondosus) whose leaves dry a bright orange
LanguagePrediction(language='az', probability=0.5689943432807922, is_reliable=False, proportion=1.0)      Damsel-fly
LanguagePrediction(language='mk', probability=0.506585419178009, is_reliable=False, proportion=1.0)      Dracontomelon vitiense
LanguagePrediction(language='en', probability=0.5282107591629028, is_reliable=False, proportion=1.0)      Dragonfly
LanguagePrediction(language='en', probability=0.5476078987121582, is_reliable=False, proportion=1.0)      Dragonfly (Anisoptera spp.)
LanguagePrediction(language='la', probability=0.9370142221450806, is_reliable=True, proportion=1.0)      Larvae of mosquito
LanguagePrediction(language='ro', probability=0.9477087259292603, is_reliable=True, proportion=1.0)      Libellule
LanguagePrediction(language='fr', probability=0.9999828338623047, is_reliable=True, proportion=1.0)      Petit filet pour une espèce de papillon; treillage
LanguagePrediction(language='fr', probability=0.849265456199646, is_reliable=True, proportion=1.0)      Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).
LanguagePrediction(language='fr', probability=0.9999961853027344, is_reliable=True, proportion=1.0)      Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière
LanguagePrediction(language='en', probability=0.9802001118659973, is_reliable=True, proportion=1.0)      The horse-fly

If I just want to take the English outputs and leave the rest, then I think langid and textblob have the best results. If I only want to remove French outputs and keep all other languages, then textblob does the best. I should test it on a larger input set, though.

Manuel Souto Pico

unread,
Aug 25, 2021, 7:29:20 PM8/25/21
to nltk-...@googlegroups.com
Hi Rowan,

From what I can read diagonally, I think you already have a very good answer from Jordi and a good approach to the task.

Just for the record, my two cents, in case they can still be useful for you or someone else.

First thought is: it's not the same guess the language of a string and identifying whether it's French or not.

If you need to do the former, you could use an MT's method to detect the language. For example, for Google:
>>> translator.detect('이 문장은 한글로 쓰여졌습니다.')
# <Detected lang=ko confidence=0.27041003>
Doc: https://py-googletrans.readthedocs.io/en/latest/

Or you must do the latter, you could spell check the word. If it's recognized, then it's French. There are at least a couple of modules you can use in Python:


I hope that helps.
Cheers, Manuel


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/nltk-users/5868bafa-528a-4dec-bad3-820d7a7690f4n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages