A few problems understanding Adaboost

Question

I am studying the following Adaboost algorithm:

$for (t=1,2,...,T):$

$h_t = argmin_{h\in H} \Sigma_{i=1}^m(D_i^t\cdot1_{[h(x_i)\neq y_i]}) $

$\epsilon_t = \Sigma_{i=1}^m(D_i^t\cdot1_{[h_t(x_i)\neq y_i]})$

$w_t = \frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$

$\forall i \in [m]: \hat D^{t+1}_i=D_i^t\cdot exp(-y_i\cdot w_t\cdot h_t(x_i))$

$D_i^{t+1}=\frac{\hat D_i^{t+1}}{\Sigma_{i=1}^m \hat D_i^{t+1}}$

and afterwards our prediction will be:

$h_s(x)=sign(\Sigma_{i=1}^T(w_i\cdot h_i(x)))$

To my understanding, $w_t$'s role is to determine how good $h_t$ is.

If the loss function is low, $w_t$ will be high and $h_s$ will care more about that certain $h_t$.

I got 2 problems with that method. The first regards $D^t$ while calculating $w_t$. It is good for calculating $D^{t+1}$ later, yet that is not what I am talking about. When I want to predict the loss of that function, I want $\epsilon$ to estimate the real loss of the hypothesis.

In my opinion you should calculate the loss of the hypothesis with uniform distribution, which represents the real world as best as possible.

The second problem, which is related to the first one, is: why do we calculate $\epsilon_t$ on the training set??

How come we don't use a test set for estimating the loss function the best we can? Off course a loss function on the training set does not give us as much information as it will on a new set of data.

I know we need those $\epsilon_t$ and $w_t$ for calculating $\hat D^{t+1}_i$, but can we not calculate another set for a better hypothesis of $h_s$?

Edit: I want to clarify the method I suggest:

We take the data and seperate it in 3 parts: train, test1, test2. To make it easier to understand, the training size is m, and the test1 size is n.

$for (t=1,2,...,T):$

$h_t = argmin_{h\in H} \Sigma_{i=1}^m(D_i^t\cdot1_{[h(x_i)\neq y_i]}) $

$\epsilon_t = \Sigma_{i=1}^m(D_i^t\cdot1_{[h_t(x_i)\neq y_i]})$

$w_t = \frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$

$\epsilon'_t = \Sigma_{i=1}^n(\frac{1}{n}\cdot1_{[h_t(x_i)\neq y_i]})$

$w'_t = \frac{1}{2}ln(\frac{1-\epsilon'_t}{\epsilon'_t})$

$\forall i \in [m]: \hat D^{t+1}_i=D_i^t\cdot exp(-y_i\cdot w_t\cdot h_t(x_i))$

$D_i^{t+1}=\frac{\hat D_i^{t+1}}{\Sigma_{i=1}^m \hat D_i^{t+1}}$

afterwards our prediction will be:

$h_s(x)=sign(\Sigma_{i=1}^T(w'_i\cdot h_i(x)))$

Thanks a lot

score 0 · Answer 1 · answered May 23 '19 at 20:28

You're training $h_t$ according to $D^t$ which is determined (for some $i$) by the query: $$h_t(x_i)=y_i\Rightarrow D^t_i=(\frac {1-\epsilon_t}{\epsilon_t})^{-\frac 1 2}\\ h_t(x_i)\neq y_i\Rightarrow D^t_i=(\frac {1-\epsilon_t}{\epsilon_t})^{\frac 1 2} $$ Where obviously the lower one is bigger.

If you were to choose $\epsilon_t$, thus choosing $w_t$, according to a given 'test' set, you would most likely get a worse training+test error than if you were to treat your test as another part of the training set.

Remember you're trying to minimise your error for samples outside your given data, so the best thing would be to use as much (not corrupt) data as you have for the training set, as this would give you a better prediction of true distributions and labels.

I do not suggest to change the way you define $D^t$, but I am looking at $w_t$ at the end, on the output. So I think it better to make $w'_t$ with some other data then the training data. Let me edit my question and clarify it — Shaq, May 23 '19 at 20:34

Siong Thye Goh · Accepted Answer · 2019-05-25T18:25:41.363

Great job identifying that Adaboost tend to overfit and in fact, the use of validation set has been proposed in the past for example here.

I inspect your algorithm and it seems to have more potential to generalize better. However, one thing to be careful. Is it possible that you might be overfitting to the validation set instead since the weight on each classifier is determined entirely by the validation set with no sampling involved.

Casting the overfitting concern aside for now. Here is an idea for you. Currently, you chose some $w'_i$ and decided to use them but are you opened to the idea of using alternative weights instead? What you are trying to do seems to be constructing a linear classifier using the $h(x_i)=(h_1(x_i), h_2(x_i), \ldots, h_T(x_i))$ to predict $y_i$ where $x_i$ and $y_i$ are from the validation set.

You could have built other classifier model such as SVM model with regularization using $h$ as the features or logistic regression to predict $y_i$. You can even built a classifier that is more sparse by dropping insignificant predictors. You might want to prepare another validation set or do a $k$-fold validation to choose parameter for this stage.

Thanks for your comment. This article looks interesting, I will keep looking more through it. — Shaq, May 25 '19 at 21:11

A few problems understanding Adaboost

2 Answers2