Test of linearity of data

166 views
Skip to first unread message

Rupali

unread,
Jul 31, 2014, 8:04:10 AM7/31/14
to wekamooc...@googlegroups.com
How can I find out whether a given data is linearly separable? The data set that I have to test this on, is a multiclass data, with nearly 20 classes and 40 attributes.

Thanks in advance.

Gabriel Santos

unread,
Jul 31, 2014, 8:22:48 AM7/31/14
to wekamooc...@googlegroups.com
Hi Rupali,

The easiest way is to use the Visualization > BoundaryVisualizer. If the data is linearly separable then you will see it on the graph.

BR,
Gabriel Santos
Macau
Community TA

Rupali

unread,
Jul 31, 2014, 9:02:53 AM7/31/14
to wekamooc...@googlegroups.com
Dear Gabriel,

When I am trying to load the dataset to weka boundary visualizer, it is giving me this error -
Can't load at this time, currently busy with some other i/o.

What could be the problem behind it? 
Message has been deleted
Message has been deleted

Gabriel Santos

unread,
Jul 31, 2014, 10:43:21 AM7/31/14
to wekamooc...@googlegroups.com
Hi Rupali,

Maybe will not work really with Visualization because the BoundaryVisualizer only accepts 2 attributes as input. You have 40 attributes and I mistakenly though you had 40 instances.

Any way, you can use the "Explorer" to load the data and see your data in the "Visualize".

Santos

Rupali

unread,
Jul 31, 2014, 2:04:52 PM7/31/14
to wekamooc...@googlegroups.com

By seeing the data in the visualize tab of the explorer , how can i know whether my data is linearly separable ? 
Do you mean that I have to run some algorithm and then check out the different plots in the visualize tab ?

Thanks,
Rupali

Gabriel Santos

unread,
Jul 31, 2014, 9:35:44 PM7/31/14
to wekamooc...@googlegroups.com
You can ignore the first row and the last column because they represent the class against each attribute.
You can also ignore the diagonal line from the top-right to the bottom-left.
Now if you see some boxes with different agglomeration of colours and if you can draw an imaginary line to separate the different agglomerations then it means they are linearly separable.

Gabriel Santos

unread,
Jul 31, 2014, 10:29:43 PM7/31/14
to wekamooc...@googlegroups.com
I'm assuming the attributes are related in some way that you can plot the data on a 2D graph. If your data exist in a hyperspace then forget the graphs because no graphical tools will be able to show you a hyperplane beyond 3D.

In this case you can use SVM (e.g. SMO) to test the linear separability of the data.

I have attached a file for you to play around and test the different parameters of the SMO. It is always easier to start with a smaller sample size. What you can do is the following:

- Load the dataset
- Use SMO as your classifier
- Change the filterType to "No normalization/standardization" (this is to help you get the parameters in order to be plotted on a different graphing application).
- Click the name of the kernel "PolyKernel" and change the exponent value from 1 to 2. (this is to force to use support vectors)
- Click Start to run it

------------------------

Look up for this output

BinarySMO

 -       0.0004 * <9 8 > * X]
 -       0.0002 * <6 10 > * X]
 +       0.0006 * <7 6 > * X]
 +       3.7699

------------------------

Translate this output into something you can plot it in a different graphic application.

-0.0004*(9x + 8y)^2-0.0002*(6x+10y)^2+0.0006(7x+6y)^2+3.7699

------------------------

Copy the above equation and add the string "plot z=" in front, like the following

plot z = -0.0004*(9x + 8y)^2-0.0002*(6x+10y)^2+0.0006(7x+6y)^2+3.7699

Now copy the line above and paste it into Google search bar.

You will see a 3D graphic by now.

The line where the gray plane intercepts the z=0 plane is the "Decision Boundary"; above it you have one class and below it you have another class.

Have fun!
AB.arff

Gabriel Santos

unread,
Jul 31, 2014, 11:05:38 PM7/31/14
to wekamooc...@googlegroups.com
More over, if you data is really linearly separable and you have a label for each instance, you should get no error in the confusion matrix.
In this case you have a perfect dataset which the agglomerations are truly separable. If the agglomerations have some overlapping areas then you can still separate them linearly but you cannot rely on the confusion matrix to give you an immediate conclusion.

Rupali

unread,
Jul 31, 2014, 11:12:57 PM7/31/14
to wekamooc...@googlegroups.com
Thanks Gabriel for your suggestions, I would surely perform SMO as you told.
One thing I want to know here, how do I know that my data exists in hyperspace? This is because the plotting of data takes place between any two attributes. 

Rupali

unread,
Jul 31, 2014, 11:15:47 PM7/31/14
to wekamooc...@googlegroups.com
Thanks Gabriel for your suggestions, I would surely perform SMO as you told.
One thing I want to know here, how do I know that my data exists in hyperspace? This is because the plotting of data takes place between any two attributes.  And this is between every two attributes, and therefore there are multiple plots, i.e. n*n plots for n attributes.

Rupali

unread,
Jul 31, 2014, 11:27:56 PM7/31/14
to wekamooc...@googlegroups.com
This conlclusion of data being linearly separable from the confusion matrix applies to SMO only or to other some algorithms also ?

Moreover, this means that the accuracy of the algorithm should be 100%, which is rarely possible. If there are agglomerations overlapping areas, then should I conclude that the data is not linearly separable , or there are further chances of applying the same or something else in the hope of separating the data linearly at some point of time?

Gabriel Santos

unread,
Aug 1, 2014, 12:40:13 AM8/1/14
to wekamooc...@googlegroups.com
Most likely your data lives in a hyperspace. Say, if you have 4 attributes, then your dataset can lives in a 4 dimensional space, if your 40 attributes are all independent and not merger/transformable/... then your dataset lives in a 40 dimensional space. You can not use lines to separate the different classes because a line has only 2 dimension, you must use hyperplanes. The problem is, a hyperplane is not drawable with a graph.

About the 100% percent accuracy, your understanding is correct. But, in the real world this kind of clear separation is not easy to find. If your data is 100% separable then what you need to do is just pick up a classifier suitable for your dataset's class type (nominal vs numeric) then apply it the classifier algorithms.

The problem is, if your data is a real world data then you have much more work to do.

About your question on how to separate the data if they are not 100% separable, the answer is, don't worry because the tools we have today are not limited to the perceptron. Even SMO can be configured to use kernel in order to be able to separate data which are not 100% linearly separable.

One more thing for your consideration, even if your data can be plotted on a 2D graph plus the different classes are visually found in different agglomerations it doesn't mean that you can always separate them with a line. Take an egg as an example, the yellow part is one class and the white part is another class. Now ask yourself, can you use a chopstick to separate the two classes, i.e. put the yellow in one place and the white in another place? The answer is no, it means that, this kind of data are distinguishable but not linearly separable. But, don't worry, because some scientist has invented the logistic regression to cope with these kind of problems. If the problem is more complex, say multiclass, then you can use the One-vs-All for logistic regression classification.

So, we are very fortune because a lot of scientist have invented a lot of tools for us to cope with a lot of real world data separation problem.
One thing is for sure, you must know the data you have, you must clean them (n.b. data is a plural noun, its corresponding singular form is datum, which very rarely is used), and make them slimmer.

BR,
Reply all
Reply to author
Forward
0 new messages