TY - JOUR
T1 - Representation learning applications in biological sequence analysis
AU - Iuchi, Hitoshi
AU - Matsutani, Taro
AU - Yamada, Keisuke
AU - Iwano, Natsuki
AU - Sumi, Shunsuke
AU - Hosoda, Shion
AU - Zhao, Shitao
AU - Fukunaga, Tsukasa
AU - Hamada, Michiaki
N1 - Funding Information:
The illustrations in Fig. 1 were kindly provided by Kae Namie. This work was supported by the Ministry of Education, Culture, Sports, Science, and Technology (KAKENHI) [Grant Nos.: JP17K20032, JP16H05879, JP16H06279 JP19H01152 and JP20H00624 to MH, JP19K20395 to TF, JP19J20117 to SH, JP20J20016 to TM and JP21K15078 to HI] and JST CREST [Grant Nos.: JPMJCR1881 and JPMJCR21F1 to MH].
Publisher Copyright:
© 2021 The Author(s)
PY - 2021/1
Y1 - 2021/1
N2 - Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.
AB - Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.
KW - BERT
KW - Natural language processing
KW - Representation learning
KW - Sequence analysis
KW - Word2vec
UR - http://www.scopus.com/inward/record.url?scp=85108702944&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85108702944&partnerID=8YFLogxK
U2 - 10.1016/j.csbj.2021.05.039
DO - 10.1016/j.csbj.2021.05.039
M3 - Review article
AN - SCOPUS:85108702944
SN - 2001-0370
VL - 19
SP - 3198
EP - 3208
JO - Computational and Structural Biotechnology Journal
JF - Computational and Structural Biotechnology Journal
ER -