Active learning, choosing next c% of training data

dhaaraa.darji

unread,

Mar 4, 2013, 12:15:23 PM3/4/13

to cs6...@googlegroups.com

We are starting same boosting but with 5% data and next time

I need to choose data from 95% test, having minimum abs(confidence)

I dont understand, next time for 10%, is it like pick first 5%

from test dataset and merge with original or build completely

new training data set? That doesn't make sense still! Thinking

to just sort abs(confidence of testdataset) and pick first

5% and merge all time with previous training.

Joseph Burley

unread,

Mar 4, 2013, 1:53:09 PM3/4/13

to cs6...@googlegroups.com

I was thinking that each iteration of active learning we would pick one data point from the untouched data set to add to the training set. But I wonder, how do you change the distribution for your training set? Do you start training all over again (throw away any results from the previous iterations) or do you assign the new data point a weight (maybe 1/m) and normalize your probability distribution to come up with a new distribution?

Also, how do you determine how close the data points are to the separation surface without having run them through the current F(x) (where F(x) = SUM_t[ alpha_t * h_t(x) ])? Can we do this over the untouched data set to find the the point closest to the separation surface? Is there another way to measure how close points are to the separation surface? If the separation surface was a line it would seem trivial, but with the separation surface being F(x) which is the result of a bunch of decision tree splits it is not intuitive in my mind how you would determine "closeness" of the separation without calculating an alpha_t term.

Finally, is it true that alpha_t approaches 0 as the uncertainty increases?

Thanks.

dhaaraa.darji

unread,

Mar 4, 2013, 2:02:57 PM3/4/13

to cs6...@googlegroups.com

For 1st:

yeah I guess should start all over again, distribution

with new m. Not sure though!

----

About confidence: it is F(x) = SUM_t[ alpha_t * h_t(x) ]

for test_dataset

What I'm doing is:

confidence = [abs(SUM_t[alpha_t * h_t(x)])for x in test_data]

then sort it in increasing order, so you will get points

which are close to separation surface.

SUM_t[alpha_t * h_t(x)] gives margin, and we predict according

to its sign, so that sum should work as confidence

Joseph Burley

unread,

Mar 4, 2013, 5:27:20 PM3/4/13

to cs6...@googlegroups.com

Thanks, this is very helpful!

Joseph Burley

unread,

Mar 6, 2013, 9:42:45 PM3/6/13

to cs6...@googlegroups.com

Does any body else have their epsilon_t's converging on 1/2 for the data sets required? I know the prof said that this could happen, but I didn't really expect this to happen so quickly.

I think this is directly related to how the predictors are chosen. Should we choose predictors initially for each split such that there is always a + and a - group, or should we update our predictors each iteration using the weights of the data points in each side of the split?

For example:
left right
1/3 + - 1/12

1/12 - + 1/3
1/12 - + 1/12

If we used the weights to find predictors for this split we would say both left and right are +, but if we used the labels only we would say that left is - and the right is +.

Any ideas?

Reply all

Reply to author

Forward