I am testing bayesian optimization for the purpose of hyperparameter tuning. For this purpose I divided my data into $3$ separate sets, one used for training, the second used for parameter tuning, third used for computing metrics.
I noticed that for some benchmark datasets accuracy on the test set is very low while it is really big on the first two sets as if the method actually overfitted the validation set really hard. Should I also monitor the test set accuracy while optimizing for hyperparameters and perform early stopping when accuracy starts to drop? Wouldn't it also break the rules of not touching the test set? Maybe should I make an additional set just for monitoring validation set overfitting and test the accuracy on the final - fourth set?