TY - JOUR
T1 - RiFeGAN2
T2 - Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge
AU - Cheng, Jun
AU - Wu, Fuxiang
AU - Tian, Yanling
AU - Wang, Lei
AU - Tao, Dapeng
N1 - Funding Information:
This work was supported in part by the National Natural Science Foundation of China under Grant U21A20487, Grant U1913202, and Grant U1813205; in part by the Shenzhen Technology Project under Grant JCYJ20200109113416531, Grant JCYJ20180507182610734, and Grant JCYJ20180302145648171; and in part by the CAS Key Technology Talent Program.
Publisher Copyright:
© 1991-2012 IEEE.
PY - 2022/8/1
Y1 - 2022/8/1
N2 - Text-to-image synthesis is a challenging task that generates realistic images from a textual description. The description contains limited information compared with the corresponding image and is ambiguous and abstract, which will complicate the generation and lead to low-quality images. To address this problem, we propose a novel generation text-to-image synthesis method, called RiFeGAN2, to enrich the given description. To improve the enrichment quality while accelerating the enrichment process, RiFeGAN2 exploits a domain-specific constrained model to limit the search scope and then uses an attention-based caption matching model to refine the compatible candidate captions based on constrained prior knowledge. To improve the semantic consistency between the given description and the synthesized results, RiFeGAN2 employs improved SAEMs, SAEM2s, to compact better features of the retrieved captions and effectively emphasize the descriptions via incorporating centre-attention layers. Finally, multi-caption attentional GANs are exploited to synthesize images from those features. Experiments performed on widely-used datasets show that the models can generate vivid images from enriched captions and effectually improve the semantic consistency.
AB - Text-to-image synthesis is a challenging task that generates realistic images from a textual description. The description contains limited information compared with the corresponding image and is ambiguous and abstract, which will complicate the generation and lead to low-quality images. To address this problem, we propose a novel generation text-to-image synthesis method, called RiFeGAN2, to enrich the given description. To improve the enrichment quality while accelerating the enrichment process, RiFeGAN2 exploits a domain-specific constrained model to limit the search scope and then uses an attention-based caption matching model to refine the compatible candidate captions based on constrained prior knowledge. To improve the semantic consistency between the given description and the synthesized results, RiFeGAN2 employs improved SAEMs, SAEM2s, to compact better features of the retrieved captions and effectively emphasize the descriptions via incorporating centre-attention layers. Finally, multi-caption attentional GANs are exploited to synthesize images from those features. Experiments performed on widely-used datasets show that the models can generate vivid images from enriched captions and effectually improve the semantic consistency.
KW - Text-to-image synthesis
KW - multiple captions
KW - prior knowledge
UR - http://www.scopus.com/inward/record.url?scp=85122105011&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85122105011&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2021.3136857
DO - 10.1109/TCSVT.2021.3136857
M3 - Article
AN - SCOPUS:85122105011
SN - 1051-8215
VL - 32
SP - 5187
EP - 5200
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 8
ER -