TY - JOUR
T1 - Detection of mergeable wikipedia articles utilizing multiple similarity measures
AU - Wang, Renzhi
AU - Iwaihara, Mizuho
N1 - Funding Information:
KAKENHI Grant Number 19K11983. The authors are grateful for the helpful and constructive comments by the reviewers and editor.
Funding Information:
Acknowledgments This work was in part supported by JSPS
Publisher Copyright:
© 2020 Information Processing Society of Japan.
PY - 2020
Y1 - 2020
N2 - Wikipedia is the largest online encyclopedia, in which articles are edited by different volunteers with different thoughts and styles. Sometimes two or more articles’ titles are different but the themes of these articles are exactly the same or strongly similar. Administrators and editors are supposed to detect such article pairs and determine whether they should be merged together. We call an article pair is mergeable if it is discussed for possible merge, and a merged article pair is such that the pair is actually merged. In this paper, we propose a method to automatically determine whether an article pair is mergeable or merged. According to Wikipedia Guidelines for article merge, in the duplicate case, the article pairs are covering exactly the same contents. In the overlap case, the article pairs are covering related subjects that have a significant overlap. The content of an overlapped part is similar but the words in the pair can be extensively different, so methods that exploit semantic relatedness are necessary. We consider various textual similarities and semantic relatedness. For integrating word embeddings on the target dataset and the global large corpus, we propose linear and non-linear combinations of multiple embedding results and rebuilding word vectors for evaluating semantic relatedness. We clarify the differences between our method and previous researches for combining multiple word embeddings. We also deal with overlap cases by computing Jaccard similarity between article pairs. We combine Jaccard similarity, common-link article count and word embedding-based relatedness together, to predict whether the article pair should be merged. We explore the relationship between segment-level (paragraph-level) similarity and mergeable/merged article pairs, then propose Multimodal Similarity-Based Merge Prediction (MSBMP) which combines the proposed new features by Random Forest, to predict mergeable/merged article pairs. Our evaluations are performed on real mergeable and merged article pairs. Remarkable superiorities of MSBMP are shown, with apparent improvement from baselines of WikiSearch, TFIDF and word embeddings.
AB - Wikipedia is the largest online encyclopedia, in which articles are edited by different volunteers with different thoughts and styles. Sometimes two or more articles’ titles are different but the themes of these articles are exactly the same or strongly similar. Administrators and editors are supposed to detect such article pairs and determine whether they should be merged together. We call an article pair is mergeable if it is discussed for possible merge, and a merged article pair is such that the pair is actually merged. In this paper, we propose a method to automatically determine whether an article pair is mergeable or merged. According to Wikipedia Guidelines for article merge, in the duplicate case, the article pairs are covering exactly the same contents. In the overlap case, the article pairs are covering related subjects that have a significant overlap. The content of an overlapped part is similar but the words in the pair can be extensively different, so methods that exploit semantic relatedness are necessary. We consider various textual similarities and semantic relatedness. For integrating word embeddings on the target dataset and the global large corpus, we propose linear and non-linear combinations of multiple embedding results and rebuilding word vectors for evaluating semantic relatedness. We clarify the differences between our method and previous researches for combining multiple word embeddings. We also deal with overlap cases by computing Jaccard similarity between article pairs. We combine Jaccard similarity, common-link article count and word embedding-based relatedness together, to predict whether the article pair should be merged. We explore the relationship between segment-level (paragraph-level) similarity and mergeable/merged article pairs, then propose Multimodal Similarity-Based Merge Prediction (MSBMP) which combines the proposed new features by Random Forest, to predict mergeable/merged article pairs. Our evaluations are performed on real mergeable and merged article pairs. Remarkable superiorities of MSBMP are shown, with apparent improvement from baselines of WikiSearch, TFIDF and word embeddings.
KW - Mergeable article
KW - Text mining
KW - Wikipedia
KW - Word embedding
UR - http://www.scopus.com/inward/record.url?scp=85079517659&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85079517659&partnerID=8YFLogxK
U2 - 10.2197/ipsjjip.28.178
DO - 10.2197/ipsjjip.28.178
M3 - Article
AN - SCOPUS:85079517659
SN - 0387-5806
VL - 28
SP - 178
EP - 191
JO - Journal of information processing
JF - Journal of information processing
ER -