SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain

Chenhui Chu*, Zhuoyuan Mao, Toshiaki Nakazawa, Daisuke Kawahara, Sadao Kurohashi

*この研究の対応する著者

研究成果: Article査読

抄録

Word segmentation, part-of-speech (POS) tagging, and syntactic parsing are three fundamental Chinese analysis tasks for Chinese language processing, which are also crucial for various downstream tasks such as machine translation and information extraction. To achieve high accuracy for these tasks, treebanks that contain sentences manually annotated with word segmentation, part-of-speech tags, and phrase structures are essential. Although there are large-scale Chinese treebanks in the news domain, such treebanks are unavailable in the scientific domain. This significantly limits the performance of Chinese language processing for scientific text. To address this problem, we annotate the 2nd version of the Chinese treebank in the scientific domain (SCTB-V2). SCTB-V2 contains 12,175 sentences annotated with word segmentation, part-of-speech tags, and phrase structures. We conducted Chinese analyses and machine translation experiments on SCTB-V2. The results show the effectiveness of SCTB-V2. We release this treebank to promote scientific Chinese language processing research http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20 in%20Scientific%20Domain%20%28SCTB%29.

本文言語English
ページ(範囲)1389-1403
ページ数15
ジャーナルLanguage Resources and Evaluation
57
3
DOI
出版ステータスPublished - 2023 9月

ASJC Scopus subject areas

  • 言語および言語学
  • 教育
  • 言語学および言語
  • 図書館情報学

フィンガープリント

「SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル