Hi Kristopher,
Thanks for your reply. I have another question to ask.
In some published papers, authors will search for bigrams in one large national corpus to see the frequency, and then decide whether to calculate the MI score and t-score.
authors search bigrams in BNC. If the frequencies are below 5, then those target bigrams will be labelled in the below threshold group. Then those word sequences will not be assigned association scores. If word
sequences can reach the frequency threshold of 5 occurrences in the BNC, then those word sequences will be assigned MI scores and t-scores. Obviously, in the below threshold group, the occurrences can be 4, 3, 2, 1 even 0. 0 means that the matched word sequences
are absent in BNC. For some sequences, even if the frequency of one target word sequence is 4, it still cannot be assigned association scores.
In TAALES 2.0, as for association strength calculation, I also have questions about the frequencies. I copy the calculation results of TAALES as an example.
COCA_academic_bi_ MI |
how is |
-0.436766314163 |
|
is going |
2.4478375438 |
|
you lately |
N/A |
|
you want |
4.46709863611 |
|
about with |
-2.59387173594 |
|
was helped |
N/A |
It is clear that in COCA_academic_bi_ MI index, different bigrams have different MI results. N/A means that two bigrams cannot be found in COCA academic corpus. My question is about the other 4 bigrams that
assigned MI scores. For example, in COCA_ academic corpus, whether as long as the word sequences can be found in it, then those sequences will be assigned association scores automatically? Even the occurrence of a word sequence in COCA_academic probably is
1, which means that this word sequence is rare in the COCA_academic corpus?