K nearest Classifier: ValueError: Found array with 0 sample(s) (shape=(0, 165)) while a minimum of 1

1,193 views
Skip to first unread message

Vinay Babu

unread,
Nov 28, 2017, 12:09:15 AM11/28/17
to open source deduplication
Hello,

I'm trying to use a new Classifier instead of default logistic regression, and before training using the model, I changed the classifer using the following code


from sklearn.neighbors import KNeighborsClassifier
deduper = dedupe.Dedupe(fields)
deduper.classifier = KNeighborsClassifier(n_neighbors=3)

However it throws me this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-43-e6d175ec5827> in <module>()
----> 1 deduper.train(recall=0.90)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\dedupe\api.py in train(self, recall, index_predicates)
    667         """
    668         examples, y = flatten_training(self.training_pairs)
--> 669         self.classifier.fit(self.data_model.distances(examples), y)
    670 
    671         self._trainBlocker(recall, index_predicates)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    788             sample_weight=sample_weight,
    789             check_input=check_input,
--> 790             X_idx_sorted=X_idx_sorted)
    791         return self
    792 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    114         random_state = check_random_state(self.random_state)
    115         if check_input:
--> 116             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    117             y = check_array(y, ensure_2d=False, dtype=None)
    118             if issparse(X):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    460                              " minimum of %d is required%s."
    461                              % (n_samples, shape_repr, ensure_min_samples,
--> 462                                 context))
    463 
    464     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 165)) while a minimum of 1 is required.

Vinay Babu

unread,
Nov 30, 2017, 5:15:03 AM11/30/17
to open source deduplication
HI,

Found the solution the classifier needs to be initialized just after the field definition.

Regards,
Vinay Babu
Message has been deleted

ilan....@credifi.com

unread,
Mar 27, 2018, 5:31:42 AM3/27/18
to open source deduplication
Vinay,

How did using K-nearest neighbors work out?

What do you mean by, "the classifier needs to be initialized just after the field definition."?

Thank you

Vinay Babu

unread,
Mar 27, 2018, 6:28:07 AM3/27/18
to open-source-...@googlegroups.com
Hi,

After your field definition you need to create an object of the Dedupe class with all the fields list.

Use the object to override the default classifier(i.e. Logistic Regression) with the classifier of your choice. Here is the code snippet

/*Field Definition*/
fields = [
              {'field' : 'cmp_name', 'type': 'Name'},
              {'field' : 'lat_lon', 'type': 'LatLong'}
]


deduper = dedupe.Dedupe(fields, num_cores=4)

/*Override default classifier*/
deduper.classifier = DecisionTreeClassifier(criterion = "gini", random_state = 100,
                                 max_depth=5, min_samples_leaf=4)


Hope this helps..

Thanks
Vinay Babu

On Tue, Mar 27, 2018 at 3:00 PM, <ilan....@credifi.com> wrote:
Vinay,

How did this work out?
What do you mean by the "classifier needs to be initialized just after the field definition." ?

Thank you

--

---
You received this message because you are subscribed to a topic in the Google Groups "open source deduplication" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/open-source-deduplication/Ns38vXKPCDM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to open-source-deduplication+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages