Abstract
Measuring semantic similarity between words is vital for various applications in natural language processing, such as language modeling, information retrieval, and document clustering. We propose a method that utilizes the information available on the Web to measure semantic similarity between a pair of words or entities. We integrate page counts for each word in the pair and lexico-syntactic patterns that occur among the top ranking snippets for the AND query using support vector machines. Experimental results on Miller Charles' benchmark data set show that the proposed measure outperforms all the existing web based semantic similarity measures by a wide margin, achieving a correlation coefficient of 0.834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0.78) in a named entity clustering task, proving the capability of the proposed measure to capture semantic similarity using web content.
Original language | English |
---|---|
Title of host publication | NAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference |
Pages | 340-347 |
Number of pages | 8 |
Publication status | Published - 2007 |
Externally published | Yes |
Event | Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2007 - Rochester, NY Duration: 2007 Apr 22 → 2007 Apr 27 |
Other
Other | Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2007 |
---|---|
City | Rochester, NY |
Period | 07/4/22 → 07/4/27 |
ASJC Scopus subject areas
- Language and Linguistics
- Linguistics and Language