Naive Bayes Classifier for big datasets

已查看 1,228 次
跳至第一个未读帖子

Karsten

未读,
2012年11月12日 05:07:042012/11/12
收件人 gen...@googlegroups.com
Hi,

this is a little off topic but I think that the people will mostly likely understand.

I am looking for a Naive Bayes classifier in Python which can learn on very big datasets such as gensim is. I found the one by scikit-learn to very great (http://scikit-learn.org/0.11/modules/naive_bayes.html), however it loads all the learning data into memory and I end up with memory errors. Their implementation is not to hard and I guess I have to change it myself but I wanted to give it a try here.

Thanks a lot,
Karsten

Radim Řehůřek

未读,
2012年11月13日 07:05:352012/11/13
收件人 gensim
Hello Karsten,

yes, scikit-learn works in-memory only, so the dataset size is
limited.

But a major point in using bayesian models is incorporating prior
knowledge. With huge datasets, the prior (NB smoothing) will be pretty
irrelevant, as there's plenty of direct evidence from data.

So perhaps sub-sampling with sklearn might serve you well enough here,
although scaling the vanilla NB classifier is pretty trivial.

Just an idea,
Radim

Denzil Correa

未读,
2012年11月14日 02:10:112012/11/14
收件人 gen...@googlegroups.com
Algorithms in Sklearn also take in sparse matrix representations as input. You may want to take a look at it.
--
Regards,
Denzil

Karsten

未读,
2012年11月14日 14:59:462012/11/14
收件人 gen...@googlegroups.com
Hm, the problem ist actually not scikit-learn but numpy. At least it crashes on numpy.mean(...).

I thought about subsampling but somehow I have the feeling that most algorithms would demand to go through the whole data at least once. Obviously I could just pick them randomly and hope that most of the time the subset is not biased.

Radim Řehůřek

未读,
2012年11月14日 15:16:382012/11/14
收件人 gensim
On Nov 14, 8:59 pm, Karsten <thesim...@googlemail.com> wrote:
> Hm, the problem ist actually not scikit-learn but numpy. At least it
> crashes on numpy.mean(...).

It's pretty unlikely to be a bug in numpy.mean... perhaps some strange
values in your data?

>
> I thought about subsampling but somehow I have the feeling that most
> algorithms would demand to go through the whole data at least once.
> Obviously I could just pick them randomly and hope that most of the time
> the subset is not biased.

Yes, that's pretty much what subsampling is :)

Are you worried about performance, Karsten? A NB classifier is just a
simple loop through the training data, incrementing word counts.
Unless you have some extra demanding setup, writing the few lines of
code to fit *exactly* your needs is probably the fastest way.

And you can treat the training input as a generator (like gensim does
it), so you don't have to worry about RAM.

HTH,
Radim


>
>
>
>
>
>
>
> On Wednesday, November 14, 2012 8:10:34 AM UTC+1, Denzil wrote:
>
> > Algorithms in Sklearn also take in sparse matrix representations as input.
> > You may want to take a look at it.
>
> > On Tue, Nov 13, 2012 at 8:05 PM, Radim Řehůřek <m...@radimrehurek.com<javascript:>

Denzil Correa

未读,
2012年11月15日 05:01:002012/11/15
收件人 gen...@googlegroups.com
Interesting - Is mean a feature to your classification? If so, you could do a row-by-row pass and compute the mean & store them. This means lesser memory requirements and fewer calculations.
--
Regards,
Denzil

Karsten

未读,
2012年11月21日 05:41:262012/11/21
收件人 gen...@googlegroups.com
Hi,

I think I forgot to mentioned that I am using the Gaussain Bayes classifier. So I have to calculate the mean and the variance.

I modified the scikit implementation (Source: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py) to this:

    def fit(self, X, y):
        """Fit Gaussian Naive Bayes according to X, y
        
        Mean and variance are calculated using an online algorithm described here: 

        Parameters
        ----------
        X : iterable of array-like, shape = [n_features]
            Training vectors, where n_features is the number of features.

        y : array-like, shape = [n_samples]
            Target values.

        Returns
        -------
        self : object
            Returns self.
        """
        
        self.classes_ = unique_y = np.unique(y)
        n_classes = unique_y.shape[0]

        #n_samples, n_features = X.shape

        #if n_samples != y.shape[0]:
        #    raise ValueError("X and y have incompatible shapes")

        first_sample = None
        for one in X:
            first_sample = one
            break
        n_features = first_sample.shape[0]


        self.theta_ = np.zeros((n_classes, n_features))
        self.sigma_ = np.zeros((n_classes, n_features))
        self.class_prior_ = np.zeros(n_classes)
        
        mapping = defaultdict()
        for i, y_i in enumerate(unique_y):
            mapping[y_i] = i
        
        epsilon = 1e-9
        
        #calculate mean (theta_) and variance (sigma_) online in one pass
        n_samples = 0
        M2 = np.zeros((n_classes, n_features))
        for sample, y_i in izip(X, y):
            n_samples += 1
            i = mapping[y_i]
            
            delta = sample[:] - self.theta_[i, :]
            self.theta_[i, :] += delta[:]/n_samples
            M2[i, :] += delta[:]*(sample[:]-self.theta_[i, :])
            
        self.sigma_[:, :] = M2[:, :]/(n_samples -1) + epsilon

        #calculate prior
        for i, y_i in enumerate(unique_y):
            self.class_prior_[i] = np.float(np.sum(y == y_i)) / n_samples
        return self   

Now, it is very simple and not yet really elegant. So far y has to be a numpy array. It'd be great if it could be an iterable too. I would need a way to get the number of classes without going too much iterating. The ideal way would be to just pass one argument and have the tuple (sample, class-mark) returned on iteration.
Also, as you can see I don't check if the sample size and the length of y match. I would not have to do so if I got both in one iteration.

What do you think?

Radim Řehůřek

未读,
2012年11月21日 17:45:502012/11/21
收件人 gensim
Hello Karsten,

the iterator yielding (x_i, y_i) directly, instead of doing izip(X,
Y), makes sense in general. But if that goes against the spirit of the
package you are using, it's probably not a good idea -- you'll be
fighting it.

As for the number of passes: once you start optimizing for the number
of passes, you start writing loops and code differently. Again, if the
original uses fast in-memory numpy arrays, a pass-optimized version of
the same algo will likely look quite different. Minimizing #passes is
not always trivial, although it may seem so in gensim :)

Best,
Radim



On Nov 21, 11:41 am, Karsten <thesim...@googlemail.com> wrote:
> Hi,
>
> I think I forgot to mentioned that I am using the Gaussain Bayes
> classifier. So I have to calculate the mean and the variance.
>
> I modified the scikit implementation (Source:https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naiv...)
> to this:
>
>     def fit(self, X, y):
>         """Fit Gaussian Naive Bayes according to X, y
>
>         Mean and variance are calculated using an online algorithm
> described here:
>
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Onli...
回复全部
回复作者
转发
0 个新帖子