SemSeq: A Regime for Training Widely-Applicable Word-Sequence Encoders

Hiroaki Tsuyuki, Tetsuji Ogawa, Tetsunori Kobayashi, Yoshihiko Hayashi*

*この研究の対応する著者

研究成果: Conference contribution

抄録

A sentence encoder that can be readily employed in many applications or effectively fine-tuned to a specific task/domain is highly demanded. Such a sentence encoding technique would achieve a broader range of applications if it can deal with almost arbitrary word-sequences. This paper proposes a training regime for enabling encoders that can effectively deal with word-sequences of various kinds, including complete sentences, as well as incomplete sentences and phrases. The proposed training regime can be distinguished from existing methods in that it first extracts word-sequences of an arbitrary length from an unlabeled corpus of ordered or unordered sentences. An encoding model is then trained to predict the adjacency between these word-sequences. Herein an unordered sentence indicates an individual sentence without neighboring contextual sentences. In some NLP tasks, such as sentence classification, the semantic contents of an isolated sentence have to be properly encoded. Further, by employing rather unconstrained word-sequences extracted from a large corpus, without heavily relying on complete sentences, it is expected that linguistic expressions of various kinds are employed in the training. This property contributes to enhancing the applicability of the resulting word-sequence/sentence encoders. The experimental results obtained from supervised evaluation tasks demonstrated that the trained encoder achieved performance comparable to existing encoders while exhibiting superior performance in unsupervised evaluation tasks that involve incomplete sentences and phrases.

本文言語English
ホスト出版物のタイトルComputational Linguistics - 16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019, Revised Selected Papers
編集者Le-Minh Nguyen, Satoshi Tojo, Xuan-Hieu Phan, Kôiti Hasida
出版社Springer
ページ43-55
ページ数13
ISBN(印刷版)9789811561672
DOI
出版ステータスPublished - 2020
イベント16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019 - Hanoi, Viet Nam
継続期間: 2019 10月 112019 10月 13

出版物シリーズ

名前Communications in Computer and Information Science
1215 CCIS
ISSN(印刷版)1865-0929
ISSN(電子版)1865-0937

Conference

Conference16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019
国/地域Viet Nam
CityHanoi
Period19/10/1119/10/13

ASJC Scopus subject areas

  • コンピュータ サイエンス(全般)
  • 数学 (全般)

フィンガープリント

「SemSeq: A Regime for Training Widely-Applicable Word-Sequence Encoders」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル