Unsupervised Keyphrase Generation by Utilizing Masked Words Prediction and Pseudo-label BART Finetuning

Yingchao Ju, Mizuho Iwaihara*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A keyphrase is a short phrase of one or a few words that summarizes the key idea discussed in the document. Keyphrase generation is the process of predicting both present and absent keyphrases from a given document. Recent studies based on sequence-to-sequence (Seq2Seq) deep learning framework have been widely used in keyphrase generation. However, the excellent performance of these models on the keyphrase generation task is acquired at the expense of a large quantity of annotated documents. In this paper, we propose an unsupervised method called MLMPBKG, based on masked language model (MLM) and pseudo-label BART finetuning. We mask noun phrases in the article, and apply MLM to predict replaceable words. We observe that absent keyphrases can be found in these words. Based on the observation, we first propose MLMKPG, which utilizes MLM to generate keyphrase candidates and use a sentence embedding model to rank the candidate phrases. Furthermore, we use these top-ranked phrases as pseudo-labels to finetune BART for obtaining more absent keyphrases. Experimental results show that our method achieves remarkable results on both present and abstract keyphrase predictions, even surpassing supervised baselines in certain cases.

Original languageEnglish
Title of host publicationFrom Born-Physical to Born-Virtual
Subtitle of host publicationAugmenting Intelligence in Digital Libraries - 24th International Conference on Asian Digital Libraries, ICADL 2022, Proceedings
EditorsYuen-Hsien Tseng, Marie Katsurai, Hoa N. Nguyen
PublisherSpringer Science and Business Media Deutschland GmbH
Pages21-34
Number of pages14
ISBN (Print)9783031217555
DOIs
Publication statusPublished - 2022
Event24th International Conference on Asia-Pacific Digital Libraries, ICADL 2022 - Hanoi, Viet Nam
Duration: 2022 Nov 302022 Dec 2

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13636 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th International Conference on Asia-Pacific Digital Libraries, ICADL 2022
Country/TerritoryViet Nam
CityHanoi
Period22/11/3022/12/2

Keywords

  • Finetuning
  • Keyphrase generation
  • Masked language model
  • Sentence embedding
  • Unsupervised learning

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Unsupervised Keyphrase Generation by Utilizing Masked Words Prediction and Pseudo-label BART Finetuning'. Together they form a unique fingerprint.

Cite this