Inconsistent Phraser scoring with different score thresholds

Dale Jacques

unread,

Sep 12, 2018, 11:56:53 AM9/12/18

to Gensim

Hello all,

We have integrated GenSim's phraser into our NLP workflow, but have recently identified unpredictable behaviour in the phraser output. Specifically, the 'score' of a handful of phrases varies wildly as I raise or decrease the score threshold for inclusion (varying by several orders of magnitude).

I used a min_count of 10 and a score threshold of 100 using the 'Original Scorer' on 70,000 survey responses. We returned 236 phrases from this corpus. We knew one specific phrase ("role model") occurred thousands of times and should have been included. It was conspicuously missing from our output. I computed the scoring algorithm manually and also found that it should have been included.

What is especially odd is that our phrase, "role model", was included when we dropped our score threshold to 20, but it was included with a score of 1,300 (well over our original threshold). This definitely should have been included in the output with a threshold of 100. I repeated this experiment with several higher thresholds and our phrase, "role model", was again not included.

I am aware the the phrase detection is based on colocation statistics, and have read the underlying manuscript about the algorithm.

After several hours of testing, I found that phraser scores are deterministic with a specific threshold input. However, there is some variability when the threshold is changed. Usually this variability is small, but sometimes it is as high as several orders of magnitude.

My specific question: Why are we sometimes seeing large variability in phrase scores when we change our threshold?

Gordon Mohr

unread,

Sep 12, 2018, 1:57:03 PM9/12/18

to Gensim

Can you show code & output where a `Phrases` model creates a bigram with a low threshold & a high threshold, but not a threshold in-between? (Note that a `Phraser` instance can't have its parameters adjusted after being created: it becomes a static set of bigrams that were detected with a given `Phrases`.)

- Gordon

Dale Jacques

unread,

Sep 12, 2018, 2:23:48 PM9/12/18

to Gensim

Thanks for the response Gordon!

I've attached two excel documents with the top 19 phrases by score for two thresholds: 20 and 100. These are not exhaustive lists of detected phrases (due to confidentiality concerns).

Note that most phrases have similar scores in each run, "role model" does not appear when the threshold is increased to 100. "role model" has a score of 13,212 when the threshold is set to 20.

Also of note, we see three variations of "problem solver" in the top 10 phrases when the threshold is low, but these values are not included when the threshold is higher.

Here is the code I'm using:

def extract_phrase_score(phrases_model, documents):
    """
    Function to extract the phrases identified by GenSim model
    :param phrases_model: the model GenSim phrases model to extract phrases
    from
    :param documents: the corpus of documents the GenSim phrases model was
    trained on
    :return: returns an iterator of all 'phrases' and the associated scores
    : note that this return iterator requires a list() call to execute
    : this iterator also includes duplicates that will be handled later
    """
    for phrase, score in phrases_model.export_phrases(documents):
        yield phrase, score


def search_for_three_word_phrases(tokenized_document_list,
                                  min_count=10,
                                  threshold=100,
                                  save_to_disk=True,
                                  stopwords=stopwords.words('english')):
    """
    Function that iteratively applies phraser model to search for bigram and
    trigram phrases
    :param tokenized_document_list: the output of preprocess_string_for
    phraser().  Like the GenSim Phraser() function,
    this function takes a list of tokenized documents (list of lists).
    :param min_count: the minimum number of times a phrase must appear to be
    considered a "phrase"
    :param threshold: the minimum statistic computed by the GenSim phraser
    package
    :param save_to_disk: Will model be saved to disk in the model/phrases
    folder.  Named with timeDate stamp.
    :return: returns a sorted pandas dataframe of unique phrases with the
    associated phraser score.
    """
    bigram_phrases = Phrases(tokenized_document_list,
                              common_terms=stopwords,
                              min_count=min_count,
                              threshold=threshold)

    bigram_phraser = Phraser(bigram_phrases)

    trigram_phrases = Phrases(bigram_phraser[tokenized_document_list],
                              common_terms=stopwords,
                              min_count=min_count,
                              threshold=threshold)

    phrase_list = list(extract_phrase_score(trigram_phrases, tokenized_document_list))

    out_df = pd.DataFrame(phrase_list, columns=['phrase', 'score']).drop_duplicates()
    out_df['phrase'] = [i.decode('UTF-8') for i in out_df['phrase']]
    out_df_sorted = out_df.sort_values('score',ascending=False).reset_index(drop=True)

    if(save_to_disk):
        create_module_directory(join(ROOT_DIRECTORY, 'model'))
        create_module_directory(join(ROOT_DIRECTORY, 'model/phrases'))
        phrase_filename = 'model/phrases/' +\
                          datetime.datetime.now().strftime('%Y%m%d_%H%M%S') +\
                          ' - phrases_export.csv'
        out_df_sorted.to_csv(join(ROOT_DIRECTORY, phrase_filename))

    return out_df_sorte

 

def phrases_pipeline(document_list,
                     min_count=10,
                     threshold=100,
                     save_to_disk=True):
    """
    High level function that creates a pipeline of all phrases functions.
    Takes all arguments required to train a trigram phrases model, trains the
    model, optionally saves it to disk, and returns the unique phrases and
    scores as a pandas or R dataframe.
    :param document_list: The list of text documents to search for phrases.
    :param min_count: The minimum number of appearances for a phrase to be
    eligible.
    :param threshold: The minimum threshold of the GenSim phrase statistic to
    be considered a phrase.
    :param save_to_disk: Should the model be saved to disk?
    :return: A data frame containing the strings to be considered phrases with
    their associated score.
    """
    sentence_stream = [tokenize_string(doc) for doc in document_list]

    phrase_df = search_for_three_word_phrases(sentence_stream,
                                              min_count=min_count,
                                              threshold=threshold,
                                              save_to_disk=save_to_disk)
    return phrase_df

phrases_export - threshold_020.xlsx

phrases_export - threshold_100.xlsx

Gordon Mohr

unread,

Sep 12, 2018, 3:01:23 PM9/12/18

to Gensim

Even if you're interested in multi-pass phrase-promotion (trigrams & beyond), if you think there's an anomaly, it'd be best to try to demonstrate it in a more simple bigrams-only configuration.

In particular, your code in `search_for_three_word_phrases()` is creating its `phrase_list` by running the trigram-phraser on the raw, original `tokenized_document_list` – not the already-bigram-promoted corpus on which it was trained. So I don't think it has any chance of reporting the trigrams it was trained to discover. (That may not be directly associated with your concern, but is indicative of the extra confusion possible with nested phrase-promotion-steps.)

The two ranked-lists-of-scores don't yet necessarily look like any sort of unexpected behavior to me: changing the `threshold` will change which potential-bigrams are combined. The case where a particular bigram is combined at a low threshold, then not at a higher threshold, is normal.

What would be odd is if raising the threshold further caused the same bigram to be combined again. (It's hard to imagine a case where that could happen – maybe where another adjacent potential-combination was ruled out? Or where multiple-levels of phrase-combination are involved, perhaps that it's not really the 1st phraser doing the combination, but the 2nd after base frequencies have changed?)

- Gordon

Dale Jacques

unread,

Sep 12, 2018, 3:39:54 PM9/12/18

to Gensim

Thanks again for your help Gordon.

I agree that lower thresholds will create more 'phrases'. The crux of my question is why is the score different when we change the threshold?

According to the Gensim documentation, the score is calculated using:

$\frac{(bigram\_count - min\_count) * len\_vocab }{ (worda\_count * wordb\_count)}$

The scores for each phrase should not change by changing the threshold. Correct?

Yet I am seeing the score for the phrase 'role model' drop by over three orders of magnitude when I change the threshold from 100 to 20. Also notice that the scores of most other phrases is not significantly affected.

Gordon Mohr

unread,

Sep 12, 2018, 7:48:14 PM9/12/18

to Gensim

Is the same score-change exhibited in a simpler, single-Phrases-pass example – instead of stacked `bigrams` and `trigrams`?

(I'm still a bit surprised there's any phrases-of-interest promoted, applying the trigrams-phrases-model to the non-bigram-processed corpus, so would prefer to confirm/explain any surprising results without that complication, first.)

- Gordon

Radim Řehůřek

unread,

Sep 14, 2018, 12:07:53 PM9/14/18

to Gensim

Hi Dale,

are you sure you're modifying `threshold` and not `min_count`? Threshold should indeed not affect the scores, it's only used for filtering. Min_count, on the other hand, would produce exactly the effect you're describing: big score jumps for phrases with frequency around 100, while not much effect on very frequent phrases.

Can you double check you're not passing incorrect position arguments somewhere by accident? threshold <=> min_count

Cheers,
Radim

Dale Jacques

unread,

Sep 14, 2018, 4:11:26 PM9/14/18

to Gensim

Thanks all for your replies. I really appreciate your responsiveness.

@Gordon, it's been a busy week, but I'm trying to find time to strip this down to the basics and retest. We are seeing trigram output from the attached code (e.g. "upwards and downwards").

@Radim, you'll see from my code posted above that I'm not using any positional arguments. Everything is named. I output the results as a .csv with one column as the phrases and the other column as the score. I can verify the threshold is the threshold for the score by the minimum score output by in .csv. I can assure you that I am only changing the threshold.

Thanks again, and I hope you all have a great weekend!

Dale Jacques

unread,

Oct 3, 2018, 12:33:13 PM10/3/18

to Gensim

@Gordon and @Radim

Thanks again for your time and patience. I have enjoyed this community immensely in my short time lurking.

I finally had some time to break this apart and test each component. I have attached my test code below.

The final answer is that bigram phrase scores do not depend on thresholds, but trigram phrase scores definitely DO. This leads to some very surprising results. Notably, our top bigram phrase in the example corpus, "Du Pont" is not captured at high trigram thresholds.

You are correct, there is not a bug in the phraser scoring, but iteratively applying it to identify trigrams produces surprising and unintended consequences (sometimes undesirable).

Thanks again for your help.

import pandas as pd

from nltk.corpus import stopwords
from gensim.models.phrases import (Phrases, Phraser)




def extract_phrase_score(phrases_model, documents):
    """
    Function to extract the phrases identified by GenSim model
    :param phrases_model: the model GenSim phrases model to extract phrases
    from
    :param documents: the corpus of documents the GenSim phrases model was
    trained on
    :return: returns an iterator of all 'phrases' and the associated scores
    : note that this return iterator requires a list() call to execute
    : this iterator also includes duplicates that will be handled later
    """
    for phrase, score in phrases_model.export_phrases(documents):
        yield phrase, score

if __name__ == '__main__':
    
    ## import and install the brown corpus from nltk.  This only needs to be 
    ## done once.
    # import nltk
    # nltk.download('brown')
    from nltk.corpus import brown

    sentence_stream = brown.sents()

    # Loop over 7 thresholds.  Prints the top 25 phrases and associated score
    for thresh in [100, 200, 400, 800, 1600, 3200, 6400]:
        # identify bigram phrases
        bigram_phrases = Phrases(sentence_stream,
                                 common_terms=stopwords.words("english"),
                                 min_count=10,
                                 threshold=thresh)

        # Block of code to extract and print phrases and score
        bi_phrase_list = list(extract_phrase_score(bigram_phrases,sentence_stream))
        out_df = pd.DataFrame(bi_phrase_list, columns=['phrase', 'score']).drop_duplicates()


        out_df['phrase'] = [i.decode('UTF-8') for i in out_df['phrase']]
        out_df_sorted = out_df.sort_values('score', ascending=False).reset_index(drop=True)

        # print(out_df_sorted.head(25))


        # Createa a bigram phraser object to iteratively create trigram phrases
        bigram_phraser = Phraser(bigram_phrases)

        trigram_phrases = Phrases(bigram_phraser[sentence_stream],
                                  common_terms=stopwords.words("english"),
                                  min_count=10,
                                  threshold=thresh)

        # Block of code to extract and print phrases and score
        tri_phrase_list = list(extract_phrase_score(trigram_phrases,sentence_stream))
        out_df = pd.DataFrame(tri_phrase_list, columns=['phrase', 'score']).drop_duplicates()


        out_df['phrase'] = [i.decode('UTF-8') for i in out_df['phrase']]
        out_df_sorted = out_df.sort_values('score',ascending=False).reset_index(drop=True)

        print(out_df_sorted.head(25))

Reply all

Reply to author

Forward