Word segmentation for the sequences emitted from a word-valued source

Takashi Ishida*, Toshiyasu Matsushima, Shigeichi Hirasawa

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Word segmentation is the most fundamental and important process for Japanese or Chinese language processing. Because there is no separation between words in these languages, we firstly have to separate the sequence into words. On this problem, it is known that the approach by probabilistic language model is highly efficient, and this is shown practically. On the other hand, recently, a word-valued source has been proposed as a new class of source model for the source coding problem. This model can be supposed to reflect more of the probability structure of natural languages. We may regard Japanese sentence or Chinese sentence as the sequence emitting from a non-prefix-free WVS. In this paper, as the first phase of applying WVS to natural language processing, we formulate a word segmentation problem for the sequence from non-prefix-free WVS. Then, we examine the performance of word segmentation for the models by numerical computations.

Original languageEnglish
Title of host publicationCIT 2007
Subtitle of host publication7th IEEE International Conference on Computer and Information Technology
Pages662-667
Number of pages6
DOIs
Publication statusPublished - 2007
EventCIT 2007: 7th IEEE International Conference on Computer and Information Technology - Aizu-Wakamatsu, Fukushima, Japan
Duration: 2007 Oct 162007 Oct 19

Publication series

NameCIT 2007: 7th IEEE International Conference on Computer and Information Technology

Conference

ConferenceCIT 2007: 7th IEEE International Conference on Computer and Information Technology
Country/TerritoryJapan
CityAizu-Wakamatsu, Fukushima
Period07/10/1607/10/19

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems
  • Software
  • Mathematics(all)

Fingerprint

Dive into the research topics of 'Word segmentation for the sequences emitted from a word-valued source'. Together they form a unique fingerprint.

Cite this