TY - GEN
T1 - A large-scale Web data collection as a natural language processing infrastructure
AU - Shinzato, Keiji
AU - Kawahara, Daisuke
AU - Hashimoto, Chikara
AU - Kurohashi, Sadao
PY - 2008
Y1 - 2008
N2 - In recent years, language resources acquired from the Web are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.
AB - In recent years, language resources acquired from the Web are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.
UR - http://www.scopus.com/inward/record.url?scp=80053421799&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053421799&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:80053421799
T3 - Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
SP - 2236
EP - 2241
BT - Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PB - European Language Resources Association (ELRA)
T2 - 6th International Conference on Language Resources and Evaluation, LREC 2008
Y2 - 28 May 2008 through 30 May 2008
ER -