Hey Ben,
Sort of the default thing for using a linear model for classifying
text is like this:
- Every word in your vocabulary maps on to a "feature" for your
classifier. When you consider your input text, you find the words that
are in your vocabulary and count them up. (or maybe just see whether
they're present)
- Then the learned model for the classifier has a weight for each of
those features. Say, for example, you want to consider 10K different
features (which is to say, your vocabulary is limited to 10K words).
Now you've got 10K different weights, and maybe your input text
contains 10 of those words.
- The linear model then just adds up the weights for the words that
you found. This is called the "bag of words" classification style --
it disregards order completely.
For the more mathematically inclined, you might want to think of the
weights as a big matrix (assuming a multiclass classifier) and the
input document as a vector (of length |V|, for vocabulary size). Then
the whole process is mostly just a matrix-vector multiplication.
Most NLP/ML software you'll use nowadays abstracts over this stuff for
you! You can use classifiers in NLTK without thinking about vocabulary
size, for example.
To see what's happening in current NLP research, check out ACL
Anthology! NLP conferences are basically all open access nowadays.
http://aclweb.org/anthology/
If you want to drink from the firehose, check out arXiv's cs.CL --
this is the latest up-to-the-minute stuff, not for the faint of heart
:
https://arxiv.org/list/cs.CL/recent
Hope this helps!
--
-- alexr