Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to tackle different sample size in the training set in SVM

16 views
Skip to first unread message

ciaco...@gmail.com

unread,
Apr 26, 2017, 10:07:08 AM4/26/17
to

0
down vote
favorite
I have to train a SVM for a classification problem. I have some strings that are the paths in a deterministic finite automata (DFA). If the alphabet is -01- then possible strings are 011101110 or 0110 for example. The purpose of classifier (SVM) is the correct prediction (label) of unseen strings like accepting or rejecting(label 1 or label -1, binary classification). The problem is that the strings have different lenghts. How can I tackle this problem?

jlad...@itu.edu

unread,
Apr 26, 2017, 4:35:19 PM4/26/17
to
I just looked up deterministic finite automata and I don't think that, in general, they are a good fit for support vector machines.

I am also not sure that you have described your problem that well. It appears that your input data is variable in length. Is there a limit? DFA's can operate with sequences of arbitrary length. The number of elements in the input vector of an SVM is finite and fixed.

*IF* there is a maximum limit to the input string size in your problem, then you could define a third state for each element which means "no data". Let's say that you never wanted to look at input strings longer than ten binary digits. Give your SVM a ten-element input of floating-point numbers. If your string is "0110", train and test "0110xxxxxx", where x is distinct from both 0 and 1. If your string is "011101110", encode it as "011101110xx".

You would use this same strategy you were training an artificial neural network (ANN) architecture.

You may be interested in recurrent neural networks (RNN), the machine learning approach which is designed for sequential data streams. I am just starting to study them myself. The RNN architecture that I am finding most interesting is the "long short term memory" (LSTM) architecture.

jlad...@itu.edu

unread,
Apr 26, 2017, 8:12:38 PM4/26/17
to
On Wednesday, April 26, 2017 at 1:35:19 PM UTC-7, I wrote:
> Give your SVM a ten-element input of floating-point numbers.

Eh, I re-read that, and I want to revise it. A ten-element input of any numerical type that isn't a simple Boolean will do. Integers will do. You don't need floats. You just need to be able to define that third "no data" state.

ciaco...@gmail.com

unread,
Apr 26, 2017, 8:52:38 PM4/26/17
to
This was my idea also (similar idea). Namely that SVM aren't the best classifier for this problem, but I was forced to use them. The data aren't sequential but pre-given, thus I can select the longer string. However, I am skeptical of what I will get. If you have an other idea about to tackle this problem it will be very apprecied. I know that the better approach are Convolutional NN and Recurrent NN (Deep Learning) but I can't use them.
Thanks for your answer
0 new messages