Question about word frequencies in Lexique 3.83

55 views
Skip to first unread message

Benjamin Storme

unread,
Mar 23, 2023, 10:46:02 AM3/23/23
to Lexique
Hello,
I would like to model the distribution of sounds across the French lexicon, using Lexique 3.83 as database. To do that, I was thinking of using a Poisson regression (inspired by a recent paper by Winter and Bürkner: https://doi.org/10.1111/lnc3.12439). But this would require having raw word counts instead of word frequencies. And Lexique 3.83 only lists word frequencies (fpm). I guess I should be able to get back to words counts by multiplying Lexique 3.83's frequencies by the size of the corpus (in millions). I am looking specifically at freqfilm2 (based on the corpus of subtitles). In the documentation (p 8), you write that the corpus of subtitles you used contains 50 million occurrences of words. But in freqfilm2, I see words with frequency of 0.01 fpm. Multiplying this by 50 gives 0.5 - this makes less than one occurrence in the corpus. How is it possible that a word occurred less than 1 time in the corpus? There is probably something obvious that I am missing here. Also, would it be possible to have the exact number of occurrences of words in the corpus of subtitles? (I guess 50 millions is an approximation)
Thank you very much in advance,
Best wishes,
Benjamin Storme
PS: thanks a lot for developing an amazing resource like Lexique 3.83!
Reply all
Reply to author
Forward
0 new messages