TY - JOUR
T1 - Compilation of an idiom example database for supervised idiom identification
AU - Hashimoto, Chikara
AU - Kawahara, Daisuke
N1 - Funding Information:
Acknowledgments This work was conducted as part of the collaborative research project of Kyoto University and NTT Communication Science Laboratories. The work was supported by NTT Communication Science Laboratories and JSPS Grants-in-Aid for Young Scientists (B) 19700141. We would like to thank the members of the collaborative research group of Kyoto University and NTT Communication Science Laboratories and Dr. Francis Bond for their stimulating discussion. Thanks are also due to Prof. Satoshi Sato, who kindly provided us with the list of basic Japanese idioms.
PY - 2009/12
Y1 - 2009/12
N2 - Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89. 25 and 88. 86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.
AB - Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89. 25 and 88. 86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.
KW - Corpus
KW - Idiom identification
KW - Japanese idiom
KW - Language resources
UR - http://www.scopus.com/inward/record.url?scp=77950756086&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77950756086&partnerID=8YFLogxK
U2 - 10.1007/s10579-009-9104-1
DO - 10.1007/s10579-009-9104-1
M3 - Article
AN - SCOPUS:77950756086
SN - 1574-020X
VL - 43
SP - 355
EP - 384
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
IS - 4
ER -