Multi-modal joint embedding for fashion product retrieval

A. Rubio, Longlong Yu, E. Simo-Serra, F. Moreno-Noguer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)


Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem, akin to finding a needle in a haystack. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space, which is both efficient and accurate. We train this embedding using large-scale real world e-commerce data by both minimizing the similarity between related products and using auxiliary classification networks to that encourage the embedding to have semantic meaning. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset. We also provide an analysis of the different metadata.

Original languageEnglish
Title of host publication2017 IEEE International Conference on Image Processing, ICIP 2017 - Proceedings
PublisherIEEE Computer Society
Number of pages5
ISBN (Electronic)9781509021758
Publication statusPublished - 2017 Jul 2
Event24th IEEE International Conference on Image Processing, ICIP 2017 - Beijing, China
Duration: 2017 Sept 172017 Sept 20

Publication series

NameProceedings - International Conference on Image Processing, ICIP
ISSN (Print)1522-4880


Other24th IEEE International Conference on Image Processing, ICIP 2017


  • Multi-modal embedding
  • Neural networks
  • Retrieval

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition
  • Signal Processing


Dive into the research topics of 'Multi-modal joint embedding for fashion product retrieval'. Together they form a unique fingerprint.

Cite this