TY - JOUR
T1 - SCTB-V2
T2 - the 2nd version of the Chinese treebank in the scientific domain
AU - Chu, Chenhui
AU - Mao, Zhuoyuan
AU - Nakazawa, Toshiaki
AU - Kawahara, Daisuke
AU - Kurohashi, Sadao
N1 - Funding Information:
This work was supported by “Project on Practical Implementation of Japanese to Chinese-Chinese to Japanese Machine Translation,” JST. We sincerely thank Ms. Fumio Hirao and Mr. Teruyasu Ueki, who annotated SCTB-V2. We are appreciated Mr. Frederic Bergeron for his development of the SynTree toolkit to speed up the annotation process. Finally, we want to thank Dr. Mo Shen for valuable discussions regarding annotation standards.
Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Nature B.V.
PY - 2023/9
Y1 - 2023/9
N2 - Word segmentation, part-of-speech (POS) tagging, and syntactic parsing are three fundamental Chinese analysis tasks for Chinese language processing, which are also crucial for various downstream tasks such as machine translation and information extraction. To achieve high accuracy for these tasks, treebanks that contain sentences manually annotated with word segmentation, part-of-speech tags, and phrase structures are essential. Although there are large-scale Chinese treebanks in the news domain, such treebanks are unavailable in the scientific domain. This significantly limits the performance of Chinese language processing for scientific text. To address this problem, we annotate the 2nd version of the Chinese treebank in the scientific domain (SCTB-V2). SCTB-V2 contains 12,175 sentences annotated with word segmentation, part-of-speech tags, and phrase structures. We conducted Chinese analyses and machine translation experiments on SCTB-V2. The results show the effectiveness of SCTB-V2. We release this treebank to promote scientific Chinese language processing research http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20 in%20Scientific%20Domain%20%28SCTB%29.
AB - Word segmentation, part-of-speech (POS) tagging, and syntactic parsing are three fundamental Chinese analysis tasks for Chinese language processing, which are also crucial for various downstream tasks such as machine translation and information extraction. To achieve high accuracy for these tasks, treebanks that contain sentences manually annotated with word segmentation, part-of-speech tags, and phrase structures are essential. Although there are large-scale Chinese treebanks in the news domain, such treebanks are unavailable in the scientific domain. This significantly limits the performance of Chinese language processing for scientific text. To address this problem, we annotate the 2nd version of the Chinese treebank in the scientific domain (SCTB-V2). SCTB-V2 contains 12,175 sentences annotated with word segmentation, part-of-speech tags, and phrase structures. We conducted Chinese analyses and machine translation experiments on SCTB-V2. The results show the effectiveness of SCTB-V2. We release this treebank to promote scientific Chinese language processing research http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20 in%20Scientific%20Domain%20%28SCTB%29.
KW - Chinese
KW - Scientific domain
KW - Treebank
UR - http://www.scopus.com/inward/record.url?scp=85139860512&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85139860512&partnerID=8YFLogxK
U2 - 10.1007/s10579-022-09615-2
DO - 10.1007/s10579-022-09615-2
M3 - Article
AN - SCOPUS:85139860512
SN - 1574-020X
VL - 57
SP - 1389
EP - 1403
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
IS - 3
ER -