Completed on 28 Jun 2017 by Palle Villesen . Sourced from http://biorxiv.org/content/early/2017/06/18/146340.
Login to endorse this review.
Author response is in blue.
Dear authors - interesting work!
What about overfitting/data dredging in your work? "The reported result of assessment is based on the average f-measure for the 10-folds for testing dataset."
When you go from genes to isoforms you also increase the number of predictor variables which make overfitting more possible (not necessarily more likely though).
I couldn't see the variance of these f-measures from CV which is normally a signature of overfitting (if the variance is very high).
For a full analysis I would suggest you split your datasets into training (MCC or F estimated by CV on this set) and validation set (MCC or F estimated by fitting final model to full training set - evaluate on this set). This is very close to what is done in kaggle competitions etc. where you actually measure your performance yourself (internal performance) but also need to predict on new data (external performance). If these two measures are very different the chosen model is not good.
Check "Comparison of RNA-seq and microarray-based models for clinical endpoint prediction". The problem is that when using CV to compare and select best models you may end up with the model that accidentally fits (using CV) your dataset best (data dredging). So basically you would like to see a nice correlation between training performance (internal performance) and validation performance (external performance) - and only use internal performance to rank models/parameters.
(In response to Palle's comment)
Thank you very much for your interest in our paper. I apologize for my very delayed response, but I was working at an internship over the summer, so my focus was elsewhere. Concerning your question whether cross validation is a valid assessment protocol for our analysis, I will answer from two different perspectives.
The main concern presented of overfitting of a model is of course a concern anyone should have when it comes to critiquing machine learning based models. If a model is overfitted to a given dataset, this implies that the model generated is unable to be used with any other dataset other than the one the model was developed with.
The first perspective is from the generalized machine learning protocol of what is the difference between cross validation, training, and testing. As you have mentioned kaggle competitions and the paper you mentioned use training and testing to validate the models generated through training. The idea is to split the dataset into three sections: training, testing, and validation. Then build the model using the training dataset then test using testing dataset and fine tune the model through optimization to achieve the best result. Then use the validation dataset which can be a separate split of the same dataset or the best scenario a completely different dataset to check whether the model developed is appropriate. For both training and validation dataset, the model generated never used those datasets, but the key difference is the optimization or different dataset.
10 fold cross validation is a stricter protocol then a testing/validation protocol because it splits the data 10 times to assess how much the model generated varies. If overfitting was occurring, some folds would throw off the fold drastically. From one perspective, training/testing/validation protocol could be considered 1 or 2 fold cross validation. Bringing in a new dataset would be an additional step to validate any models generated.
The second perspective is based on the purpose of this paper, which is to give evidence that isoform based features are better than gene based. We are not suggesting that the models generated should be used as general purpose models for the tested classifiers because there are hundreds of reasons why expression values can change in RNA-Seq based data unrelated to the class. General use of these models would require adding many “validation” datasets. The only difference in these models generated was whether they came from gene or isoform based features. There was not any optimization done, so a testing/validation protocol would essentially be 2 folds of testing. The average f-measures for gene and isoform based models that performed the same were extremely similar (ie .98789 vs .98799). For the gene and isoform based models where isoform based was better, the standard deviation across the folds was less than the difference between gene and isoform based. Not making a supplementary material including these assessment of the variance across folds was an oversight, which will be addressed when the paper is peer reviewed published.