Menu Close

Best Practices in Machine Learning

I luckily read two wonderful articles (one of them was shared by @Boke89707488 Dr. Bo Ke) that are telling me the best practices using machine learning.

The first one is “Establishment of Best Practices for Evidence for Prediction” by Russell A. Poldrack, PhD; Grace Huckins, MSc; Gael Varoquaux, PhD.

Best Practices for Predictive Modeling:

In-sample model fit indices should not be reported as evidence.”

Always validate your models using the model-unseen test data for the report.

“The cross-validation procedure should encompass all operations
applied to the data. In particular, predictive analyses should not be performed on data after variable selection if the variable selection was informed to any degree by the data themselves (ie, post hot cross-validation). Otherwise, estimated predictive accuracy will be inflated owing to circularity.”

More understanding of this point will be explained at the end of this post.

“Prediction analyses should not be performed with samples smaller
than several hundred observations”

More understanding of this point will be explained at the end of this post.

Multiple measures of prediction accuracy should be examined and
reported. For regression analyses, measures of variance, such as R2, should be accompanied by measures of unsigned error, such as mean squared error or mean absolute error. For classification analyses, accuracy should be reported separately for each class, and a measure of accuracy that is insensitive to relative class frequencies, such as area under the receiver operating characteristic curve, should be reported.

Definitely.

“The coefficient of determination should be computed by using the
sums-of-squares formulation rather than by squaring the correlation coefficient”

Sure.

k-fold cross-validation, with k in the range of 5 to 10 should be
used rather than leave-one-out cross-validation because the
testing set in leave-one-out cross-validation is not representative of the whole data and is often anti-correlated with the training set”

More considerations should be mentioned. Please see the below about the practical way of doing cross-validation.

The second article is “Machine learning algorithm validation with a
limited sample size
” by Andrius VabalasI, Emma Gowen, Ellen Poliakoff, Alexander J. Casson

I do love the figure in this article, which can tell exactly how to do cross-validation.

From work by Andrius VabalasI, Emma Gowen, Ellen Poliakoff, Alexander J. Casson

More importantly, we should remember this

Our simulations show that K-fold Cross-Validation (CV)
produces strongly biased performance estimates with small sample sizes, and the bias is
still evident with sample size of 1000. Nested CV and train/test split approaches produce
robust and unbiased performance estimates regardless of sample size

Reference: 

Poldrack RA, Huckins G, Varoquaux G. Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry. 2020;77(5):534–540. doi:10.1001/jamapsychiatry.2019.3671

Vabalas A, Gowen E, Poliakoff E, Casson AJ (2019) Machine learning algorithm validation with a limited sample size. PLoS ONE 14(11): e0224365. https://doi.org/10.1371/journal.pone.0224365