TY - JOUR
T1 - Generating similarity cluster of Indonesian languages with semi-supervised clustering
AU - Nasution, Arbi Haza
AU - Murakami, Yohei
AU - Ishida, Toru
N1 - Funding Information:
This research was partially supported by a Grant-in-Aid for Scientific Research (A) (17H00759, 2017-2020) and a Grant-in-Aid for Young Scientists (A) (17H04706, 2017-2020) from Japan Society for the Promotion of Science (JSPS). The first author was supported by Indonesia Endownment Fund for Education (LPDP).
Publisher Copyright:
Copyright © 2019 Institute of Advanced Engineering and Science. All rights reserved.
PY - 2019/2
Y1 - 2019/2
N2 - Lexicostatistic and language similarity clusters are useful for computational linguistic researches that depends on language similarity or cognate recognition. Nevertheless, there are no published lexicostatistic/language similarity cluster of Indonesian ethnic languages available. We formulate an approach of creating language similarity clusters by utilizing ASJP database to generate the language similarity matrix, then generate the hierarchical clusters with complete linkage and mean linkage clustering, and further extract two stable clusters with high language similarities. We introduced an extended k-means clustering semi-supervised learning to evaluate the stability level of the hierarchical stable clusters being grouped together despite of changing the number of cluster. The higher the number of the trial, the more likely we can distinctly find the two hierarchical stable clusters in the generated k-clusters. However, for all five experiments, the stability level of the two hierarchical stable clusters is the highest on 5 clusters. Therefore, we take the 5 clusters as the best clusters of Indonesian ethnic languages. Finally, we plot the generated 5 clusters to a geographical map.
AB - Lexicostatistic and language similarity clusters are useful for computational linguistic researches that depends on language similarity or cognate recognition. Nevertheless, there are no published lexicostatistic/language similarity cluster of Indonesian ethnic languages available. We formulate an approach of creating language similarity clusters by utilizing ASJP database to generate the language similarity matrix, then generate the hierarchical clusters with complete linkage and mean linkage clustering, and further extract two stable clusters with high language similarities. We introduced an extended k-means clustering semi-supervised learning to evaluate the stability level of the hierarchical stable clusters being grouped together despite of changing the number of cluster. The higher the number of the trial, the more likely we can distinctly find the two hierarchical stable clusters in the generated k-clusters. However, for all five experiments, the stability level of the two hierarchical stable clusters is the highest on 5 clusters. Therefore, we take the 5 clusters as the best clusters of Indonesian ethnic languages. Finally, we plot the generated 5 clusters to a geographical map.
KW - Hierarchical clustering
KW - K-means clustering
KW - Language similarity
KW - Lexicostatistic
KW - Semi-supervised clustering
UR - http://www.scopus.com/inward/record.url?scp=85066303482&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85066303482&partnerID=8YFLogxK
U2 - 10.11591/ijece.v9i1.pp531-538
DO - 10.11591/ijece.v9i1.pp531-538
M3 - Article
AN - SCOPUS:85066303482
SN - 2088-8708
VL - 9
SP - 531
EP - 538
JO - International Journal of Electrical and Computer Engineering
JF - International Journal of Electrical and Computer Engineering
IS - 1
ER -