Self-training involving semantic-space finetuning for semi-supervised multi-label document classification

Zhewei Xu*, Mizuho Iwaihara


研究成果: Article査読


Self-training is an effective solution for semi-supervised learning, in which both labeled and unlabeled data are leveraged for training. However, the application scenarios of existing self-training frameworks are mostly confined to single-label classification. There exist difficulties in applying self-training under multi-label scenario, since unlike single-label classification, there is no constraint of mutual exclusion over categories, and the vast number of possible label vectors makes discovery of credible predictions harder. For realizing effective self-training under multi-label scenario, we propose ML-DST and ML-DST+ that utilize contextualized document representations of pretrained language models. A BERT-based multi-label classifier and newly designed weighted loss functions for finetuning are proposed. Two label propagation-based algorithms SemLPA and SemLPA+ are also proposed to enhance multi-label prediction, whose similarity measure is iteratively improved through semantic-space finetuning, by which semantic space consisting of document representations is finetuned to better reflect learnt label correlations. High-confidence label predictions are recognized through examining the prediction score on each category separately, which are in turn used for both classifier finetuning and semantic-space finetuning. According to our experiment results, the performance of our approach steadily exceeds the representative baselines under different label rates, proving the superiority of our proposed approach.

ジャーナルInternational Journal on Digital Libraries
出版ステータスPublished - 2024 3月

ASJC Scopus subject areas

  • 図書館情報学


「Self-training involving semantic-space finetuning for semi-supervised multi-label document classification」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。