TY - GEN
T1 - Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?
AU - Sun, Jing
AU - Lepage, Yves
N1 - Funding Information:
Acknowledgments. The authors would like to acknowledge the contributions of the members of Biometric Technologies Laboratory (BTLab) at the University of Calgary, as well as Prof. Alexei Sourin for his valuable help in manuscript preparation. The authors also would like to acknowledge the support of NSERC Funding Agency, Canada.
PY - 2012
Y1 - 2012
N2 - Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.
AB - Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.
UR - http://www.scopus.com/inward/record.url?scp=84883365383&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84883365383&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84883365383
SN - 9789791421171
T3 - Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
SP - 351
EP - 360
BT - Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
T2 - 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
Y2 - 7 November 2012 through 7 November 2012
ER -