Effective text extraction and recognition for WWW images

Jun Sun*, Zhulong Wang, Hao Yu, Fumihito Nishino, Yukata Katsuyama, Satoshi Naoi

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)


Images play a very important role in web content delivery. Many WWW images contain text information that can be used for web indexing and searching. A new text extraction and recognition algorithm is proposed in this paper. The character strokes in the image are first extracted by color clustering and connected component analysis. A novel stroke verification algorithm is used to effectively remove non-character strokes. The verified strokes are then used to build the binary text line image, which is segmented and recognized by dynamic programming. Since text in WWW image usually has close relationship with webpage content, approximate string matching is used to revise the recognition result by matching the content in the webpage with the content in the image. This effective post-processing not only improves the recognition performance, but also can be used in other applications such like image - webpage paragraph corresponding.

Original languageEnglish
Title of host publicationProceedings of the 2003 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery (ACM)
Number of pages3
ISBN (Print)1581137249, 9781581137248
Publication statusPublished - 2003
Externally publishedYes
EventProceedings of the 2003 ACM Symposium on Document Engineering - Grenoble, France
Duration: 2003 Nov 202003 Nov 22

Publication series

NameProceedings of the 2003 ACM Symposium on Document Engineering


ConferenceProceedings of the 2003 ACM Symposium on Document Engineering


  • Approximate matching
  • Text extraction
  • Text recognition

ASJC Scopus subject areas

  • Engineering(all)


Dive into the research topics of 'Effective text extraction and recognition for WWW images'. Together they form a unique fingerprint.

Cite this