Abstract
We have developed a method that allows Japanese document images to be retrieved more accurately by using OCR character candidate information and a conventional plain text search engine. In this method, the document image is first recognized by normal OCR to produce text. Keyword areas are then estimated from the normal OCR produced text through morphological analysis. A lattice of candidate-character codes is extracted from these areas, and then character strings are extracted from the lattice using a word-matching method in noun areas and a K-th DP-matching method in undefined word areas. Finally, these extracted character strings are added to the normal OCR produced text to improve document retrieval accuracy when using a conventional plain text search engine. Experimental results from searches of 49 OHP sheet images revealed that our method has a high recall rate of 98.2%, compared to 90.3% with a conventional method using only normal OCR produced text, while requiring about the same processing time as normal OCR.
Original language | English |
---|---|
Pages (from-to) | 57-67 |
Number of pages | 11 |
Journal | Proceedings of SPIE - The International Society for Optical Engineering |
Volume | 4670 |
DOIs | |
Publication status | Published - 2002 |
Externally published | Yes |
Event | Documentation Recognition and Retrieval IX - San Jose, CA, United States Duration: 2002 Jan 21 → 2002 Jan 22 |
Keywords
- Document image
- Document retrieval
- Document-management systems
- Morphological analysis
- OCR
ASJC Scopus subject areas
- Electronic, Optical and Magnetic Materials
- Condensed Matter Physics
- Computer Science Applications
- Applied Mathematics
- Electrical and Electronic Engineering