TY - JOUR
T1 - Developing computational infrastructure for the CorCenCC corpus
T2 - The National Corpus of Contemporary Welsh
AU - Knight, Dawn
AU - Loizides, Fernando
AU - Neale, Steven
AU - Anthony, Laurence
AU - Spasić, Irena
N1 - Funding Information:
This work has been funded by the UK Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC) as part of the Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction project (Grant number: ES/M011348/1).
Publisher Copyright:
© 2020, The Author(s).
PY - 2021/9
Y1 - 2021/9
N2 - CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes—National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent uses of the corpus which include lexicography, pedagogical research and corpus analysis. A grass-roots approach to design has been adopted, that has adapted and extended previous corpus-building and introduced new features as required for this specific context and language. The key pillars of the infrastructure include a framework that supports metadata collection, an innovative mobile application designed to collect spoken data (utilising a crowdsourcing approach), a backend database that stores curated data and a web-based interface that allows users to query the data online. A usability study was conducted to evaluate the user facing tools and to suggest directions for future improvements. Though the infrastructure was developed for Welsh language collection, its design can be re-used to support corpus development in other minority or major language contexts, broadening the potential utility and impact of this work.
AB - CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes—National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent uses of the corpus which include lexicography, pedagogical research and corpus analysis. A grass-roots approach to design has been adopted, that has adapted and extended previous corpus-building and introduced new features as required for this specific context and language. The key pillars of the infrastructure include a framework that supports metadata collection, an innovative mobile application designed to collect spoken data (utilising a crowdsourcing approach), a backend database that stores curated data and a web-based interface that allows users to query the data online. A usability study was conducted to evaluate the user facing tools and to suggest directions for future improvements. Though the infrastructure was developed for Welsh language collection, its design can be re-used to support corpus development in other minority or major language contexts, broadening the potential utility and impact of this work.
KW - Data modelling
KW - Information retrieval
KW - Language resources
KW - Natural language processing
KW - Usability testing
KW - Web interfaces
UR - http://www.scopus.com/inward/record.url?scp=85088818484&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85088818484&partnerID=8YFLogxK
U2 - 10.1007/s10579-020-09501-9
DO - 10.1007/s10579-020-09501-9
M3 - Article
AN - SCOPUS:85088818484
SN - 1574-020X
VL - 55
SP - 789
EP - 816
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
IS - 3
ER -