Estimating copy numbers of alleles from population-scale high-throughput sequencing data

Takahiro Mimori, Naoki Nariai, Kaname Kojima, Yukuto Sato, Yosuke Kawai, Yumi Yamaguchi-Kabata, Masao Nagasaki*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review


Background: With the recent development of microarray and high-throughput sequencing (HTS) technologies, a number of studies have revealed catalogs of copy number variants (CNVs) and their association with phenotypes and complex traits. In parallel, a number of approaches to predict CNV regions and genotypes are proposed for both microarray and HTS data. However, only a few approaches focus on haplotyping of CNV loci. Results: We propose a novel approach to infer copy unit alleles and their numbers in each sample simultaneously from population-scale HTS data by variational Bayesian inference on a generative probabilistic model inspired by latent Dirichlet allocation, which is a well studied model for document classification problems. In simulation studies, we evaluated concordance between inferred and true copy unit alleles for lower-, middle-, and higher-copy number dataset, in which precision and recall were ≥ 0.9 for data with mean coverage ≥ 10× per copy unit. We also applied the approach to HTS data of 1123 samples at highly variable salivary amylase gene locus and a pseudogene locus, and confirmed consistency of the estimated alleles within samples belonging to a trio of CEPH/Utah pedigree 1463 with 11 offspring. Conclusions: Our proposed approach enables detailed analysis of copy number variations, such as association study between copy unit alleles and phenotypes or biological features including human diseases.

Original languageEnglish
Article numberS4
JournalBMC Bioinformatics
Issue number1
Publication statusPublished - 2015 Jan 21
Externally publishedYes


  • Copy number variation
  • High-throughput sequencing data
  • Latent Dirichlet allocation

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics


Dive into the research topics of 'Estimating copy numbers of alleles from population-scale high-throughput sequencing data'. Together they form a unique fingerprint.

Cite this