A large-scale Web data collection as a natural language processing infrastructure

Keiji Shinzato, Daisuke Kawahara, Chikara Hashimoto, Sadao Kurohashi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In recent years, language resources acquired from the Web are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.

Original languageEnglish
Title of host publicationProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PublisherEuropean Language Resources Association (ELRA)
Pages2236-2241
Number of pages6
ISBN (Electronic)2951740840, 9782951740846
Publication statusPublished - 2008
Externally publishedYes
Event6th International Conference on Language Resources and Evaluation, LREC 2008 - Marrakech, Morocco
Duration: 2008 May 282008 May 30

Publication series

NameProceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008

Other

Other6th International Conference on Language Resources and Evaluation, LREC 2008
Country/TerritoryMorocco
CityMarrakech
Period08/5/2808/5/30

ASJC Scopus subject areas

  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics
  • Education

Fingerprint

Dive into the research topics of 'A large-scale Web data collection as a natural language processing infrastructure'. Together they form a unique fingerprint.

Cite this