TY - GEN
T1 - Unsupervised Keyphrase Generation by Utilizing Masked Words Prediction and Pseudo-label BART Finetuning
AU - Ju, Yingchao
AU - Iwaihara, Mizuho
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - A keyphrase is a short phrase of one or a few words that summarizes the key idea discussed in the document. Keyphrase generation is the process of predicting both present and absent keyphrases from a given document. Recent studies based on sequence-to-sequence (Seq2Seq) deep learning framework have been widely used in keyphrase generation. However, the excellent performance of these models on the keyphrase generation task is acquired at the expense of a large quantity of annotated documents. In this paper, we propose an unsupervised method called MLMPBKG, based on masked language model (MLM) and pseudo-label BART finetuning. We mask noun phrases in the article, and apply MLM to predict replaceable words. We observe that absent keyphrases can be found in these words. Based on the observation, we first propose MLMKPG, which utilizes MLM to generate keyphrase candidates and use a sentence embedding model to rank the candidate phrases. Furthermore, we use these top-ranked phrases as pseudo-labels to finetune BART for obtaining more absent keyphrases. Experimental results show that our method achieves remarkable results on both present and abstract keyphrase predictions, even surpassing supervised baselines in certain cases.
AB - A keyphrase is a short phrase of one or a few words that summarizes the key idea discussed in the document. Keyphrase generation is the process of predicting both present and absent keyphrases from a given document. Recent studies based on sequence-to-sequence (Seq2Seq) deep learning framework have been widely used in keyphrase generation. However, the excellent performance of these models on the keyphrase generation task is acquired at the expense of a large quantity of annotated documents. In this paper, we propose an unsupervised method called MLMPBKG, based on masked language model (MLM) and pseudo-label BART finetuning. We mask noun phrases in the article, and apply MLM to predict replaceable words. We observe that absent keyphrases can be found in these words. Based on the observation, we first propose MLMKPG, which utilizes MLM to generate keyphrase candidates and use a sentence embedding model to rank the candidate phrases. Furthermore, we use these top-ranked phrases as pseudo-labels to finetune BART for obtaining more absent keyphrases. Experimental results show that our method achieves remarkable results on both present and abstract keyphrase predictions, even surpassing supervised baselines in certain cases.
KW - Finetuning
KW - Keyphrase generation
KW - Masked language model
KW - Sentence embedding
KW - Unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=85145009869&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85145009869&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-21756-2_2
DO - 10.1007/978-3-031-21756-2_2
M3 - Conference contribution
AN - SCOPUS:85145009869
SN - 9783031217555
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 21
EP - 34
BT - From Born-Physical to Born-Virtual
A2 - Tseng, Yuen-Hsien
A2 - Katsurai, Marie
A2 - Nguyen, Hoa N.
PB - Springer Science and Business Media Deutschland GmbH
T2 - 24th International Conference on Asia-Pacific Digital Libraries, ICADL 2022
Y2 - 30 November 2022 through 2 December 2022
ER -