TY - GEN
T1 - Automatic metadata generation and video editing based on speech and image recognition for medical education contents
AU - Tamura, Satoshi
AU - Hashimoto, Koji
AU - Jiong, Zhu
AU - Hayamizu, Satoru
AU - Asai, Hirotsugu
AU - Tanahashi, Hideki
AU - Kanagawa, Makoto
PY - 2006
Y1 - 2006
N2 - This paper reports a metadata generation system as well as an automatic video edit system. The metadata are information described about the other data. In the audio metadata generation system, speech recognition using general language model (LM) and specialized LM is performed to input speech in order to obtain segment (event group) and audio metadata (event information) respectively. In the video edit system, visual metadata obtained by image recognition and audio metadata are combined into audio-visual metadata. Subsequently, multiple videos are edited to one video using the audio-visual metadata. Experiments were conducted to evaluate event detection of the systems using medical education contents, ACLS and BLS. The audio metadata system achieved about a 78% event detection correctness. In the edit system, an 87% event correctness was obtained by audio-visual metadata, and the survey proved that the edited video is appropriate and useful.
AB - This paper reports a metadata generation system as well as an automatic video edit system. The metadata are information described about the other data. In the audio metadata generation system, speech recognition using general language model (LM) and specialized LM is performed to input speech in order to obtain segment (event group) and audio metadata (event information) respectively. In the video edit system, visual metadata obtained by image recognition and audio metadata are combined into audio-visual metadata. Subsequently, multiple videos are edited to one video using the audio-visual metadata. Experiments were conducted to evaluate event detection of the systems using medical education contents, ACLS and BLS. The audio metadata system achieved about a 78% event detection correctness. In the edit system, an 87% event correctness was obtained by audio-visual metadata, and the survey proved that the edited video is appropriate and useful.
KW - Audio-visual integration
KW - Automatic video edit
KW - Metadata
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=44949119513&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=44949119513&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:44949119513
SN - 9781604234497
T3 - INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP
SP - 2466
EP - 2469
BT - INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP
PB - International Speech Communication Association
T2 - INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP
Y2 - 17 September 2006 through 21 September 2006
ER -