TY - JOUR
T1 - Bayesian clinical classification from high-dimensional data
T2 - Signatures versus variability
AU - Shalabi, Akram
AU - Inoue, Masato
AU - Watkins, Johnathan
AU - De Rinaldis, Emanuele
AU - Coolen, Anthony C.C.
N1 - Funding Information:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors gratefully acknowledge support from the Engineering and Physical Sciences Research Council (UK), IDBS, and the Ana Leaf Foundation.
Publisher Copyright:
© 2016, © The Author(s) 2016.
PY - 2018/2/1
Y1 - 2018/2/1
N2 - When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For d = 10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.
AB - When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For d = 10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.
KW - Bayesian classification
KW - Discriminant analysis
KW - curse of dimensionality
KW - outcome prediction
KW - overfitting
UR - http://www.scopus.com/inward/record.url?scp=85041947217&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85041947217&partnerID=8YFLogxK
U2 - 10.1177/0962280216628901
DO - 10.1177/0962280216628901
M3 - Article
C2 - 26984907
AN - SCOPUS:85041947217
SN - 0962-2802
VL - 27
SP - 336
EP - 351
JO - Statistical Methods in Medical Research
JF - Statistical Methods in Medical Research
IS - 2
ER -