TY - JOUR
T1 - The use of semantic similarity tools in automated content scoring of fact-based essays written by EFL learners
AU - Wang, Qiao
N1 - Funding Information:
The author would like to thank the three raters in this study for their kind assistance in grading student writing samples.
Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2022/11
Y1 - 2022/11
N2 - This study searched for open-source semantic similarity tools and evaluated their effectiveness in automated content scoring of fact-based essays written by English-as-a-Foreign-Language (EFL) learners. Fifty writing samples under a fact-based writing task from an academic English course in a Japanese university were collected and a gold standard was produced by a native expert. A shortlist of carefully selected tools, including InferSent, spaCy, DKPro, ADW, SEMILAR and Latent Semantic Analysis, generated semantic similarity scores between student writing samples and the expert sample. Three teachers who were lecturers of the course manually graded the student samples on content. To ensure validity of human grades, samples with discrepant agreement were excluded and an inter-rater reliability test was conducted on remaining samples with quadratic weighted kappa. After the grades of the remaining samples were proven valid, a Pearson correlation analysis between semantic similarity scores and human grades was conducted and results showed that InferSent was the most effective tool in predicting the human grades. The study further pointed to the limitations of the six tools and suggested three alternatives to traditional methods in turning semantic similarity scores into reporting grades on content.
AB - This study searched for open-source semantic similarity tools and evaluated their effectiveness in automated content scoring of fact-based essays written by English-as-a-Foreign-Language (EFL) learners. Fifty writing samples under a fact-based writing task from an academic English course in a Japanese university were collected and a gold standard was produced by a native expert. A shortlist of carefully selected tools, including InferSent, spaCy, DKPro, ADW, SEMILAR and Latent Semantic Analysis, generated semantic similarity scores between student writing samples and the expert sample. Three teachers who were lecturers of the course manually graded the student samples on content. To ensure validity of human grades, samples with discrepant agreement were excluded and an inter-rater reliability test was conducted on remaining samples with quadratic weighted kappa. After the grades of the remaining samples were proven valid, a Pearson correlation analysis between semantic similarity scores and human grades was conducted and results showed that InferSent was the most effective tool in predicting the human grades. The study further pointed to the limitations of the six tools and suggested three alternatives to traditional methods in turning semantic similarity scores into reporting grades on content.
KW - Automated content scoring
KW - Automated writing evaluation
KW - EFL learners
KW - Fact-based writing
KW - Open-source semantic similarity tools
KW - Semantic similarity
UR - http://www.scopus.com/inward/record.url?scp=85132140198&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85132140198&partnerID=8YFLogxK
U2 - 10.1007/s10639-022-11179-1
DO - 10.1007/s10639-022-11179-1
M3 - Article
AN - SCOPUS:85132140198
SN - 1360-2357
VL - 27
SP - 13021
EP - 13049
JO - Education and Information Technologies
JF - Education and Information Technologies
IS - 9
ER -