I think I forgot to mentioned that I am using the Gaussain Bayes classifier. So I have to calculate the mean and the variance.
def fit(self, X, y):
"""Fit Gaussian Naive Bayes according to X, y
Mean and variance are calculated using an online algorithm described here:
Parameters
----------
X : iterable of array-like, shape = [n_features]
Training vectors, where n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
Returns
-------
self : object
Returns self.
"""
self.classes_ = unique_y = np.unique(y)
n_classes = unique_y.shape[0]
#n_samples, n_features = X.shape
#if n_samples != y.shape[0]:
# raise ValueError("X and y have incompatible shapes")
first_sample = None
for one in X:
first_sample = one
break
n_features = first_sample.shape[0]
self.theta_ = np.zeros((n_classes, n_features))
self.sigma_ = np.zeros((n_classes, n_features))
self.class_prior_ = np.zeros(n_classes)
mapping = defaultdict()
for i, y_i in enumerate(unique_y):
mapping[y_i] = i
epsilon = 1e-9
#calculate mean (theta_) and variance (sigma_) online in one pass
n_samples = 0
M2 = np.zeros((n_classes, n_features))
for sample, y_i in izip(X, y):
n_samples += 1
i = mapping[y_i]
delta = sample[:] - self.theta_[i, :]
self.theta_[i, :] += delta[:]/n_samples
M2[i, :] += delta[:]*(sample[:]-self.theta_[i, :])
self.sigma_[:, :] = M2[:, :]/(n_samples -1) + epsilon
#calculate prior
for i, y_i in enumerate(unique_y):
self.class_prior_[i] = np.float(np.sum(y == y_i)) / n_samples
return self
Now, it is very simple and not yet really elegant. So far y has to be a numpy array. It'd be great if it could be an iterable too. I would need a way to get the number of classes without going too much iterating. The ideal way would be to just pass one argument and have the tuple (sample, class-mark) returned on iteration.
Also, as you can see I don't check if the sample size and the length of y match. I would not have to do so if I got both in one iteration.