TY - GEN
T1 - Rethinking end-to-end evaluation of decomposable tasks
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
AU - Arora, Siddhant
AU - Ostapenko, Alissa
AU - Viswanathan, Vijay
AU - Dalmia, Siddharth
AU - Metze, Florian
AU - Watanabe, Shinji
AU - Black, Alan W.
N1 - Funding Information:
This work was supported in part by the National Science Foundation under Grant No. IIS2040926, the NSF SaTC Frontier project(CNS-1914486) [37], Bridges PSC (ACI-1548562, ACI-1445606) and an AWS Machine Learning Research Award.
Publisher Copyright:
© 2021 ISCA
PY - 2021
Y1 - 2021
N2 - Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with held-out speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community.
AB - Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with held-out speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community.
KW - Challenge set
KW - End-to-end evaluation
KW - Fluent speech commands
KW - Generalization
KW - Snips
KW - Spoken intent prediction
UR - http://www.scopus.com/inward/record.url?scp=85119199124&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119199124&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-1537
DO - 10.21437/Interspeech.2021-1537
M3 - Conference contribution
AN - SCOPUS:85119199124
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 3881
EP - 3885
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
Y2 - 30 August 2021 through 3 September 2021
ER -