TY - JOUR
T1 - Parameters for accurate genome alignment
AU - Frith, Martin C.
AU - Hamada, Michiaki
AU - Horton, Paul
N1 - Funding Information:
We are especially grateful to Gary Benson and Yevgeniy Gelfand for modifying Tandem Repeats Finder in response to our feedback. We thank Kiyoshi Asai for discussions and material support, Hajime Harada for IT support, Yi-Kuo Yu and Stephen Altschul for providing software to calculate l, and the CBRC RNA team for assistance with using Rfam. Computation time was provided by the AIST Super Cluster, and also by the Super Computer System, Human Genome Center, Institute of Medical Science, University of Tokyo. PH was supported by a Japanese Ministry of Education, Culture, Sport, Science and Technology, Grant-in-Aid for Scientific Research (B).
PY - 2010/2/9
Y1 - 2010/2/9
N2 - Background: Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed.Results: We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases.Conclusions: These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours http://last.cbrc.jp/.
AB - Background: Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed.Results: We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases.Conclusions: These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours http://last.cbrc.jp/.
UR - http://www.scopus.com/inward/record.url?scp=77649140807&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77649140807&partnerID=8YFLogxK
U2 - 10.1186/1471-2105-11-80
DO - 10.1186/1471-2105-11-80
M3 - Article
C2 - 20144198
AN - SCOPUS:77649140807
SN - 1471-2105
VL - 11
JO - BMC Bioinformatics
JF - BMC Bioinformatics
M1 - 80
ER -