The `Phrases` mechanism is purely based on statistical cooccurrences, which means, among other things:
* It will be highly sensitive to the corpus on which it is trained, and the effectivive (but tunable) `threshold` & `min_count` values.
* The results will usually not be aesthetically conformant to what a meaning-aware reader (like a human) might prefer, and even extensive tuning may only improve the promotion of some desirable bigrams at the cost of others. So, presenting its results to end-users may often be unappealing. Still, its combinations will often improve the raw text, internally, for info-retrieval or classification purposes - via the addition of bigram tokens that have more useful 'signal' than the original.
The `text8` is a probably a pretty bad set of training data for this purpose. It's not very large (only 100MB), and is just a tiny subset of some old raw Wikipedia text.
Also, all of its text is case-flattened, so no matter how many times `new york` might appear in its training data, it could never possibly learn to promote your `['New', 'York']` to `['New_York']`. It might work on `['new', 'york']` depending on the `text8` frequencies & tuning – I haven't tried.
So: you probably want to apply it to your own domain data, as large as possible, whenever you can. If using outside training text, you'd want something larger & more applicable to your data than `text8`. You'll want to remain sensitive to applying the same case-handling, & other preprocessing, to both training and later application data.
`Phraser` (aka `FrozenPhrases` in recent releases) is an optimized alternative that discards some state/flexibility for smaller/faster operation – so while trying to tinker to get acceptable results, you probably want to work with only `Phrases` for experimentation. For example, you could tamper with its `threshold` to try to get more or less of the bigrams of interest.
(And if you switch `FrozenPhrases` for later steps, it'll mainly deliver its benefit if you're sure to discard the `Phrases` instance/variable once you're using `FrozenPhrases` instead.)
- Gordon