TY - JOUR
T1 - Estimating copy numbers of alleles from population-scale high-throughput sequencing data
AU - Mimori, Takahiro
AU - Nariai, Naoki
AU - Kojima, Kaname
AU - Sato, Yukuto
AU - Kawai, Yosuke
AU - Yamaguchi-Kabata, Yumi
AU - Nagasaki, Masao
N1 - Funding Information:
This work was supported (in part) by MEXT Tohoku Medical Megabank Project. All computational resources were provided by the Supercomputing services, Tohoku Medical Megabank Organization, Tohoku University.
Publisher Copyright:
© 2015 Mimori et al.; licensee BioMed Central Ltd.
PY - 2015/1/21
Y1 - 2015/1/21
N2 - Background: With the recent development of microarray and high-throughput sequencing (HTS) technologies, a number of studies have revealed catalogs of copy number variants (CNVs) and their association with phenotypes and complex traits. In parallel, a number of approaches to predict CNV regions and genotypes are proposed for both microarray and HTS data. However, only a few approaches focus on haplotyping of CNV loci. Results: We propose a novel approach to infer copy unit alleles and their numbers in each sample simultaneously from population-scale HTS data by variational Bayesian inference on a generative probabilistic model inspired by latent Dirichlet allocation, which is a well studied model for document classification problems. In simulation studies, we evaluated concordance between inferred and true copy unit alleles for lower-, middle-, and higher-copy number dataset, in which precision and recall were ≥ 0.9 for data with mean coverage ≥ 10× per copy unit. We also applied the approach to HTS data of 1123 samples at highly variable salivary amylase gene locus and a pseudogene locus, and confirmed consistency of the estimated alleles within samples belonging to a trio of CEPH/Utah pedigree 1463 with 11 offspring. Conclusions: Our proposed approach enables detailed analysis of copy number variations, such as association study between copy unit alleles and phenotypes or biological features including human diseases.
AB - Background: With the recent development of microarray and high-throughput sequencing (HTS) technologies, a number of studies have revealed catalogs of copy number variants (CNVs) and their association with phenotypes and complex traits. In parallel, a number of approaches to predict CNV regions and genotypes are proposed for both microarray and HTS data. However, only a few approaches focus on haplotyping of CNV loci. Results: We propose a novel approach to infer copy unit alleles and their numbers in each sample simultaneously from population-scale HTS data by variational Bayesian inference on a generative probabilistic model inspired by latent Dirichlet allocation, which is a well studied model for document classification problems. In simulation studies, we evaluated concordance between inferred and true copy unit alleles for lower-, middle-, and higher-copy number dataset, in which precision and recall were ≥ 0.9 for data with mean coverage ≥ 10× per copy unit. We also applied the approach to HTS data of 1123 samples at highly variable salivary amylase gene locus and a pseudogene locus, and confirmed consistency of the estimated alleles within samples belonging to a trio of CEPH/Utah pedigree 1463 with 11 offspring. Conclusions: Our proposed approach enables detailed analysis of copy number variations, such as association study between copy unit alleles and phenotypes or biological features including human diseases.
KW - Copy number variation
KW - High-throughput sequencing data
KW - Latent Dirichlet allocation
UR - http://www.scopus.com/inward/record.url?scp=84961665308&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84961665308&partnerID=8YFLogxK
U2 - 10.1186/1471-2105-16-S1-S4
DO - 10.1186/1471-2105-16-S1-S4
M3 - Article
C2 - 25707811
AN - SCOPUS:84961665308
SN - 1471-2105
VL - 16
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - 1
M1 - S4
ER -