You could definitely implement your own solution using functionality built into NLTK but I think you'll need training data for that in all cases. If it's not a problem for you to collect a reasonable large amount (e.g. a few thousand sentences at least) of French-language data as the positive class, and language from any other languages as the negative class, then you could implement your own classifier using NLTK objects (the MaxEnt classifier for instance). You could probably use as well a simple character n-gram model trained on the same kind of data (using`nltk.ConditionalFreqDist` class or some related one). You could apply character-lever tokenization, then compute character probabilities independently for French data and for non-French data, and then run any input word through both models: if the former is higher than the latter, then word is likely to be a French word.
In case this sounds like too much work, or if you can't easily come by the data, I can think of two options: using `langdetect` or using a workaround with any machine translation library, in this case `EasyNMT` (arguably an over-engineered solution, so I don't necessarily recommend it but, if you don't have to process a lot of data, it should do the trick).
Here's an example usage of `langdetect` that would solve your problem with 1 false positive and 1 false negative, if I understood your example correctly:
```
>>> from langdetect import detect
>>> words = ['Damsel-fly', 'Dragonfly', 'Dragonfly', 'Larvae of mosquito', 'Dragonfly', 'Dragonfly (Anisoptera spp.)', 'Dragonfly', 'The horse-fly', 'Dragonfly', "Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière", 'Petit filet pour une espèce de papillon; treillage', 'Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).', 'A shore plant (Scirpus frondosus) whose leaves dry a bright orange', 'Dragonfly', 'Libellule', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', 'Dracontomelon vitiense']
>>> for w in sorted(set(words)):
... print(detect(w), '\t', w)
...
it (Dracontomelon vitiense)
en A shore plant (Scirpus frondosus) whose leaves dry a bright orange
cy Damsel-fly
it Dracontomelon vitiense
cy Dragonfly
en Dragonfly (Anisoptera spp.)
fr Larvae of mosquito
et Libellule
fr Petit filet pour une espèce de papillon; treillage
fr Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).
fr Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière
en The horse-fly
```
And here's an example usage with `EasyNMT` that solves the task with 1 false negative:
```
from easynmt import EasyNMT
model = EasyNMT('opus-mt')
words = list(set(['Damsel-fly', 'Dragonfly', 'Dragonfly', 'Larvae of mosquito', 'Dragonfly', 'Dragonfly (Anisoptera spp.)', 'Dragonfly', 'The horse-fly', 'Dragonfly', "Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière", 'Petit filet pour une espèce de papillon; treillage', 'Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).', 'A shore plant (Scirpus frondosus) whose leaves dry a bright orange', 'Dragonfly', 'Libellule', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', '(Dracontomelon vitiense)', 'Dracontomelon vitiense']))
fr = model.translate(words, target_lang='fr')
en = model.translate(words, target_lang='en')
for unk, fr, en in zip(words, fr, en):
if unk == fr and unk != en:
print(('French', unk))
else:
print(('Not French', unk))
# ('Not French', 'Damsel-fly')
# ('French', 'Petites chevrettes. Insecte (demoiselles). Insecte, libellule (Lch). Dragonfly (Atl).')
# ('Not French', 'A shore plant (Scirpus frondosus) whose leaves dry a bright orange')
# ('Not French', 'Dragonfly (Anisoptera spp.)')
# ('Not French', 'Larvae of mosquito')
# ('Not French', 'The horse-fly')
# ('Not French', 'Libellule')
# ('Not French', '(Dracontomelon vitiense)')
# ('Not French', 'Dracontomelon vitiense')
# ('French', "Sorte d'insecte, de papillon hideux, dont le duvet des ailes s'en va en poussière")
# ('Not French', 'Dragonfly')
# ('French', 'Petit filet pour une espèce de papillon; treillage')
````
Note we're translating all words into French and making the assumption that those that remain the same after translation were terms originally in French. This caused some false positives with the Latin terms (that don't have a French translation either, and also remain the same after translation), so I added the English translations as a pivot: if the term remains the same after translation into English as well, then it's a Latin term. When the evaluation of the two conditions is `True AND False`, respectively, it's a French term.