I had a question about Problem 2.3. Sorry if I'm being really dense, and please correct me at any point if I'm wrong. I'm almost positive I must be reading something incorrectly, but I just can't figure out what it is.
At the start of the question it says that we are dropping the assumption that the samples are linearly separable. So, we are to assume that our input datapoints can be non-separable.
We are asked to show a bound on the number of updates of the perceptron. However, the perceptron will never converge if the data isn't linearly separable since no w will satisfy all of the datapoints. And so if we are dropping the assumption that the data is linearly separable, then the bound we must prove must be a lower bound (as the number of updates cannot be bounded from above since the updates will never stop).
However, my intuition suggest that this cannot be, and we must prove an upper bound (even looking at the form suggests this). Plugging in the setup (i.e., u = w, \rho = \rho) from the top appears to result in a divide-by-zero: as d_i = max(0,\rho - y_i <w,x_i>) = 0 since \rho is less than or equal to y_i <w,x_i>, and \delta = [\sum_{i=1}^n 0]^{-1/2} = 1 / sqrt(0). However, I assume that you can adjust rho ever-so-slightly by some epsilon to make one point have a very small deviation. This yields a very small change in rho; it also gives an infinitesimal but non-zero delta. In this case, the lower bound would effectively be the same as the upper bound we proved before, which is problematic since the number of updates is discrete and the bound has no such restrictions (so it'd be bounded by within this tiny interval). In this case, a modified upper bound makes more sense to me: if we don't have the separating w, we can take some arbitrary u and see how long the perceptron will take to find the separating w. However, that doesn't make sense if the data isn't separable.
Would someone please be able to clarify this to me?
Again, my apologies if I'm bad at reading.
Thanks,
David