This work is about planning and implementing a complex system for multimodal segmentation and indexing of video streams. The software I present here is capable of cut the given videos into logically well-defined parts, mainly shots, with the use of the visual and auditory data from the videos. In addition, the system is capable of finding user-given images in the video stream, hence implements a visual indexing and searching method for finding semantic data in the videos.
In the first part of the document I present the basic principles of the image and video processing, with some real-life examples. Then I explain the different layers and units used in the field of organizing videos, and define the meaning of the three modalities: visual, auditory and textual. In the last part of the theoretical introduction I show the different sources of the multimodal data, and give some methods on how to extract and process those.
In the main part of the work I demonstrate the planning and implementation of the system, with the main technical details and algorithms I used. Firstly, I explain how I planned to process the visual data, and which exact features I used for this task. Then the core mathematical model of the system is shown, with the additional methods for training, and evaluating it. After that, I present the method of processing the sound from the video stream, and combining it with the visual processing. Finally, the implemented video indexing and visual searching methods are explained, plus the metric I invented for measuring the quality of the searching/matching.