Lyscanthrope t1_j9uz4bb wrote on February 24, 2023 at 7:18 PM

Simple answer: yes, of course! Middle ground: of you gave any hyper parameters to choose, you need a validation set! More detailed answer: it is very probable depending on the assumption that you have on your data. Choosing how to do the model selection will lead to how you estimate the model performance (ie the way you estimate the generalisation error)... Lot of work can go in here! Edit: this is my humble opinion but one should always think on how to validate performances before modeling... It saves a lot of time. And please, always know you basic (statistic wise)

Maximum-Ruin-9590 t1_j9uzp6m wrote on February 24, 2023 at 7:21 PM

My coworker has just one dataset and does cross validation, tuning and comparing on train. He gets pretty good metrics that way.

cthorrez t1_j9xahu6 wrote on February 25, 2023 at 5:44 AM

I just have one dataset too. I train, pick hyperparameters, and test on the same data. Nobody can get better metrics than me. :D

Maximum-Ruin-9590 t1_j9v03zg wrote on February 24, 2023 at 7:24 PM

As mentioned u need validation sets aka some kind of folds for most things in ML. Crossvalidation and tuning just to name some things. It is also smart to have folds to compare different models with each other.

osedao OP t1_j9v4ip5 wrote on February 24, 2023 at 7:53 PM

Yeah that make sense to test models with folds never seen. But I have a small dataset, I’m trying to find the best practice

Additional-Escape498 t1_j9vqmlh wrote on February 24, 2023 at 10:16 PM

For a small dataset still use cross validation, but use k-fold cross validation so you don’t divide the dataset into 3, just into 2 and then the k-fold subdivides the training set. Sklearn has a class for this already built to make this simple. Since you have a small dataset and are using fairly simple models I’d suggest setting k >= 10.

osedao OP t1_j9wa0a1 wrote on February 25, 2023 at 12:35 AM

Thanks for the recommendations! I’ll try this

BrohammerOK t1_j9wvrl7 wrote on February 25, 2023 at 3:26 AM

You can work with 2 splits, which is a common practice. For a small dataset you can use 5 or 10 fold crossvalidation with shuffling on 75-80% of the dataset (train) for hyperparameter tunning / model selection, fit the best model on the entirety of that set, and then evaluate/test on the remaining 25%-20% that you held out. You can repeat the process multiple times with different seeds to get a better estimation of the expected performance, assuming that the input data when you do inference comes from the same distribution as your dataset.

BrohammerOK t1_j9ww6yx wrote on February 25, 2023 at 3:30 AM

If you wanna use something like early stopping, though, you'll have no choice but to use 3 splits.

Kroutoner t1_j9vzb0b wrote on February 24, 2023 at 11:16 PM

There are scenarios where you would be totally fine not using a validation set, or even any sort of sample splitting whatsoever, but you definitely need to know what you’re doing and know why it’s okay that you’re not using them. If you can’t provide an explicit justification for why it’s okay you’re probably best off using a validation set.

osedao OP t1_j9waf9t wrote on February 25, 2023 at 12:38 AM

Could this approach be enough to justify not using validation: i have 8 features and if i have equal/same distributions of each of these features in both training and test set, would this be enough?

Kroutoner t1_j9wfyz1 wrote on February 25, 2023 at 1:20 AM

This does not seem like suitable justification.

28Smiles t1_j9y2fw4 wrote on February 25, 2023 at 11:52 AM

At least use an Ensemble and cross-validate, this way u get at least some meaningful results, but you are still in danger of overfitting

qalis t1_j9y4c1m wrote on February 25, 2023 at 12:16 PM

Yes, absolutely, for any size of the dataset and model this is strictly necessary. You can use cross-validation, Leave-One-Out CV, or bootstrap techniques (e.g. 0.632+ bootstrap). You don't need to validate if you don't have any hyperparameters, but this is very rarely the case; the only examples I can think of is Random Forest and Extremely Randomized Trees, where sufficiently large number of trees is typically enough.

hellrail t1_j9wnk66 wrote on February 25, 2023 at 2:19 AM

Wtf of course man, u also need one if you fit y=ax+b dude

[D] Is validation set necessary for non-neural network models, too?

Comments