TY - JOUR
T1 - Parallelized latent dirichlet allocation provides a novel interpretability of mutation signatures in cancer genomes
AU - Matsutani, Taro
AU - Hamada, Michiaki
N1 - Funding Information:
Funding: Publication costs are funded by Waseda University [basic research budget]. This study was also supported by the Ministry of Education, Culture, Sports, Science and Technology (KAKENHI) [grant numbers JP17K20032, JP16H05879, JP16H01318, JP16H02484, JP18KT0016, JP16H06279 and JP20H00624 to MH].
Funding Information:
Publication costs are funded by Waseda University [basic research budget]. This study was also supported by the Ministry of Education, Culture, Sports, Science and Technology (KAKENHI) [grant numbers JP17K20032, JP16H05879, JP16H01318, JP16H02484, JP18KT0016, JP16H06279 and JP20H00624 to MH]. Computation for this study was partially performed on the NIG supercomputer at ROIS National Institute of Genetics. We thank Tsukasa Fukunaga and members at Hamada Laboratory for valuable discussions about this study.
Publisher Copyright:
© 2020 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2020/10
Y1 - 2020/10
N2 - Mutation signatures are defined as the distribution of specific mutations such as activity of AID/APOBEC family proteins. Previous studies have reported numerous signatures, using matrix factorization methods for mutation catalogs. Different mutation signatures are active in different tumor types; hence, signature activity varies greatly among tumor types and becomes sparse. Because of this, many previous methods require dividing mutation catalogs for each tumor type. Here, we propose parallelized latent Dirichlet allocation (PLDA), a novel Bayesian model to simultaneously predict mutation signatures with all mutation catalogs. PLDA is an extended model of latent Dirichlet allocation (LDA), which is one of the methods used for signature prediction. It has parallelized hyperparameters of Dirichlet distributions for LDA, and they represent the sparsity of signature activities for each tumor type, thus facilitating simultaneous analyses. First, we conducted a simulation experiment to compare PLDA with previous methods (including SigProfiler and SignatureAnalyzer) using artificial data and confirmed that PLDA could predict signature structures as accurately as previous methods without searching for the optimal hyperparameters. Next, we applied PLDA to PCAWG (Pan-Cancer Analysis of Whole Genomes) mutation catalogs and obtained a signature set different from the one predicted by SigProfiler. Further, we have shown that the mutation spectrum represented by the predicted signature with PLDA provides a novel interpretability through post-analyses.
AB - Mutation signatures are defined as the distribution of specific mutations such as activity of AID/APOBEC family proteins. Previous studies have reported numerous signatures, using matrix factorization methods for mutation catalogs. Different mutation signatures are active in different tumor types; hence, signature activity varies greatly among tumor types and becomes sparse. Because of this, many previous methods require dividing mutation catalogs for each tumor type. Here, we propose parallelized latent Dirichlet allocation (PLDA), a novel Bayesian model to simultaneously predict mutation signatures with all mutation catalogs. PLDA is an extended model of latent Dirichlet allocation (LDA), which is one of the methods used for signature prediction. It has parallelized hyperparameters of Dirichlet distributions for LDA, and they represent the sparsity of signature activities for each tumor type, thus facilitating simultaneous analyses. First, we conducted a simulation experiment to compare PLDA with previous methods (including SigProfiler and SignatureAnalyzer) using artificial data and confirmed that PLDA could predict signature structures as accurately as previous methods without searching for the optimal hyperparameters. Next, we applied PLDA to PCAWG (Pan-Cancer Analysis of Whole Genomes) mutation catalogs and obtained a signature set different from the one predicted by SigProfiler. Further, we have shown that the mutation spectrum represented by the predicted signature with PLDA provides a novel interpretability through post-analyses.
KW - Bayes modeling
KW - Cancer genome
KW - Latent Dirichlet allocation
KW - Mutation signature
UR - http://www.scopus.com/inward/record.url?scp=85091558842&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85091558842&partnerID=8YFLogxK
U2 - 10.3390/genes11101127
DO - 10.3390/genes11101127
M3 - Article
C2 - 32992754
AN - SCOPUS:85091558842
SN - 2073-4425
VL - 11
SP - 20
JO - Genes
JF - Genes
IS - 10 1
M1 - 1127
ER -