TY - GEN
T1 - Fast title extraction method for business documents
AU - Katsuyama, Yutaka
AU - Naoi, Satoshi
PY - 1997
Y1 - 1997
N2 - Conventional electronic document filing systems are inconvenient because the user must specify the keywords in each document for later searches. To solve this problem, automatic keyword extraction methods using natural language processing and character recognition have been developed. However, these methods are slow, especially for japanese documents. To develop a practical electronic document filing system, we focused on the extraction of keyword areas from a document by image processing. Our fast title extraction method can automatically extract titles as keywords from business documents. All character strings are evaluated for similarity by rating points associated with title similarity. We classified these points as four items: character sitting size, position of character strings, relative position among character strings, and string attribution. Finally, the character string that has the highest rating is selected as the title area. The character recognition process is carried out on the selected area. It is fast because this process must recognize a small number of patterns in the restricted area only, and not throughout the entire document. The mean performance of this method is an accuracy of about 91 percent and a 1.8 sec. processing time for an examination of 100 Japanese business documents.
AB - Conventional electronic document filing systems are inconvenient because the user must specify the keywords in each document for later searches. To solve this problem, automatic keyword extraction methods using natural language processing and character recognition have been developed. However, these methods are slow, especially for japanese documents. To develop a practical electronic document filing system, we focused on the extraction of keyword areas from a document by image processing. Our fast title extraction method can automatically extract titles as keywords from business documents. All character strings are evaluated for similarity by rating points associated with title similarity. We classified these points as four items: character sitting size, position of character strings, relative position among character strings, and string attribution. Finally, the character string that has the highest rating is selected as the title area. The character recognition process is carried out on the selected area. It is fast because this process must recognize a small number of patterns in the restricted area only, and not throughout the entire document. The mean performance of this method is an accuracy of about 91 percent and a 1.8 sec. processing time for an examination of 100 Japanese business documents.
UR - http://www.scopus.com/inward/record.url?scp=0031335348&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0031335348&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:0031335348
SN - 0819424382
T3 - Proceedings of SPIE - The International Society for Optical Engineering
SP - 192
EP - 201
BT - Proceedings of SPIE - The International Society for Optical Engineering
PB - Society of Photo-Optical Instrumentation Engineers
T2 - Document Recognition IV
Y2 - 12 February 1997 through 13 February 1997
ER -