Trending ▼
ICSE
CBSE 10th
ISC
CBSE 12th
CTET
GATE
UGC NET
Vestibulares
ResFinder
gatexam solved paper
8 pages, 0 questions, 0 questions with responses, 0 total responses
,
0
0
raje
+Fave
Message
Profile
Timeline
Uploads
Home
>
raje
>
Formatting page ...
H.-G. Kim et al: Real-Time Highlight Detection in Baseball Video for TVs with Time-shift Function 831 Real-Time Highlight Detection in Baseball Video for TVs with Time-shift Function Hyoung-Gook Kim, Jinguk Jeong, Jang-Heon Kim and Jin Young Kim Abstract Time-shift is a crucial function of interactive Televisions as like DVR and internet TV broadcasting services. Automatic important event detection allows users utilize timeshift function conveniently. In this paper, we propose a method to extract important events in baseball videos. In the proposed method, we first detect play scenes and audio events separately from video and audio tracks. For robust play scene extraction, we proposed off-line learning model having local adaptation based on ongoing analyzed video. And we implemented the audio event detection with a SVM-based classifier. Final important events are determined by a combination of each audio-visual detection results in real time. We evaluated our method with a baseball database of Korean and Major League games. Experimental results show that the implemented system runs in real time and achieves a remarkable performance of 0.85 recall and 0.97 precision rates1. Index Terms Event Detection, Video Summarization, Real Time Video Analysis, Video Indexing. I. INTRODUCTION Recently the industries involved with internet TV broadcasting (ex TiVo), PVR, and DVR with time shift function (TSF) have been expanded. Thus the style of how to watch TV has changed with such interactive TVs, and viewers can jump to time positions in the video which they want. However, it is difficult for users to find the intended time position, for they cannot comprehend the whole story of the broadcasting contents. Therefore, it is necessary to provide users with a way to automatic detection important events in a video for a more convenient time-shift function. The research on automatic event detection in sports videos has a long history. There have been several research studies on extracting play scenes as a basic unit of a sports video. In [1][3] sports videos are segmented into the play scenes and the break scenes with a static template defined by low level 1 This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund) (KRF-2007-331-D00339) Hyoung-Gook Kim is with the Intelligent Multimedia Signal Processing Lab., Kwangwoon University, South Korea, (email: hkim@kw.ac.kr). Jinguk Jung is with Samsung Advanced Institute of Technology., South Korea (email: jinguk.jeong@samsung.com). Jang-Heon Kim is with Technical University Berlin, Germany (email: j.kim@nue.tu-berlin.de) Jin Young Kim is with Chonnam National University, South Korea (email: beyondi@chonnam.ac.kr). Contributed Paper Manuscript received April 3, 2008 features such as color information. Zhong et al. and Chang et al. proposed a framework by combining domain-specific knowledge with the supervised machine learning methods [4], [5]. Also the adaptive model based method was proposed by Wu et al. [6]. However, the performances of the templatebased methods [1]-[3] and the supervised machine learningbased methods [4,5] have yet to be improved, because the static template method and the static model cannot cover a variety of broadcasting conditions such as different stadiums, dynamic weather, and so on. The adaptive model-based method [6] cannot be applied to TV with TSF, because the statistical information of the whole video stream is needed in order to use that method. Several research studies on automatic extraction of important event scenes have been reported. These research studies can be classified into audio analysis based approach [7], video analysis based approach [8], [9], caption recognition based approach [10], and multi modal analysis based approach [11]. However, the audio analysis-based approach [7] cannot detect the start and end time positions of different events, because the play scenes are not segmented by this approach. Other approaches of [8]-[11] cannot be applied to TVs with a time shift function. Fig. 1. Basic Concept of the Proposed System In this paper we propose a real time event detection algorithm based on multi-modal analysis. The proposed TV system that adopts our proposed algorithm is illustrated in Fig. 1. In this figure, the function is to play a broadcasted stream, and the processing module analyzes and detects the ongoing video simultaneously. This concurrency enables important events to be extracted before they are shown on a monitor. Generally, real time processing means that the processing is finished by the end of the play. In this paper, we define realtime processing as a processing that can detect the event as soon as they are played. In the proposed method, we first extract play scenes and audio events separately from audio and video streams of a video. An audio event refers to a time 0098 3063/08/$20.00 2008 IEEE 832 IEEE Transactions on Consumer Electronics, Vol. 54, No. 2, MAY 2008 position that contains the excited speech and cheering of announcers at during an event like a hit or homerun . For robust play scenes extraction, we use not only an off-line learning model, but also a local adaptation model generated from ongoing analyzed video. Audio events are extracted based on the support vector machine (SVM) based approach. Finally, important events are determined using information of the play scenes and the audio events extracted in the previous stage. The rest of the paper is as follows. In section II, we suggest a problem definition in the even detection of baseball videos. In section III, we present our proposed event detection algorithm based on multi-modal processing. In section IV, we present our experimental results on a baseball database of Korean and Major League games and finally we will give our conclusions of the study in section V. scenes. So, the problem is to segment video into play scenes and to determine the importance of the current scene. Some notations needed for our problem definition are presented in Fig. 2. In this figure, let tc be the current time, and tm be the analyzed time. The set of play scenes composed of n play scenes played before tc is represented by Ps={p1, , pn}, and Is={I1, , Im} means the set of important events composed of m events that are played before the current time. The purpose of the proposed scheme is that the detection of Ps and Is has to be finished by the current time. II. PROBLEM DEFINITION Generally important events that viewers want to watch are diverse due to viewers various intentions and tastes. However, important events such as a goal event in soccer and a homerun event in baseball can be definitely be defined in a sports video compared with other video genres. To define important events in our implementation of time-shift, we analyzed nine sports news broadcast as a pre-study. In the sports news broadcasts, we identified the following dominant events types: 34 score events, 3 hit events, 1 strike-out event, and 2 good defense events. From the observations, we regard scoring event as the most important event that must be detected, because most of all events are scoring events. There are three other event types that are considered as secondary elements in the important event set. Thus the main problem is to detect scoring events in baseball videos. Of course, the detection precision should be very accurate for users satisfactions. Also we aim to detect important events in real time unlike previous research studies. Thus we define three important issues in sports highlight detection as follows. - - - Immediacy (Real Time Processing): Important events should be detected when their being played simultaneously. Then the playback positions of important events can be proposed in real time for views convenience. Accuracy: For viewers to understand the contents with just watching the detected events, it is very important to accurately detect the important events. Locality: Important events should be detected using local information of current contents for improving the accuracy of event detection. On the other hand, the automatic detection of important events in baseball begins with extracting play scenes including important events, for all of events are contained within play Fig. 2. Notations for problem definition III. SYSTEM FRAMEWORK FOR DETECTING HIGHLIGHTS In this section, we propose an event detection algorithm that automatically extracts the important events from the input stream. The algorithm is largely based on audio-visual analysis of the input stream. The overall approach is shown in Fig. 3 Fig. 3. Flow diagram of the proposed algorithm First, the input stream is demultiplexed into the streams of video and audio. Second, the play scenes and the audio events are detected from the video stream and the audio stream respectively. Third, the results of the play scene detection and the audio event detection are combined for the last decision of highlight events. In section 3.1 and 3.2 we describe the play scene detection method and the important event detection algorithm including the audio event detection skills. H.-G. Kim et al: Real-Time Highlight Detection in Baseball Video for TVs with Time-shift Function A. Play Scene Detection Algorithm Baseball is a sort of well structured video; especially a sequence of play/breaks. So, it is important to extract play scene units, because the most important events arise within a play scene. By observing many baseball games, we found that every play scene starts with a pitching shot and ends with close-up shots. Thus we propose a play scene detection that is composed of the pitching shot detection (PSD) and the closedup shot detection (CSD) as shown in Fig. 4. In Fig. 4 the input for the play scene detection module is the detected shots. The shot boundary detection is performed as follows. From the MPEG-2 video encoder the scene change time along with key-frame extraction is estimated based on HSV color histograms. The distances between color histograms in the HSV color space are used for the detection of scene boundaries. In detecting the scene boundaries, only hard-cut shots are considered as the predominant scene transition type in the broadcast input video streams. PSD and CSD are presented the following subsections in detail. 833 with only the off-line learning model. Thus we need to analyze the statistical information of the current stream for applying a local adaptive model. The proposed pitching shot detection algorithm is presented in Fig. 6. First, a pitching shot in the input video stream is detected by using the offline pitching shot model. For learning the offline model, the linear SVM is applied because it can be detected quickly compared with kernel-based SVM and is proper for a binary decision. The linear SVM model uses geometric properties to exactly calculate the optimal separating hyperplane expressed by a linear function of the inputs such as (1). f ( X ) = W T X + b; b : bias , X = { x1 ,... x n } (1) Fig. 4. Play scene detection algorithm 1) Pitching Shot Detection Since the pitching shots are always taken by a fixed camera behind a pitcher as shown in Fig.5, all of the pitching shots have very consistent structures that do not vary largely in the same game. However, the color features of pitching shots are very different throughout the games because of a variety of broadcasting conditions such as diverse weather and the layout of the stadium. Fig. 5. Examples of the pitching shots Considering these dynamic characteristics, we propose a real time play scene detection algorithm. In the algorithm, we utilize not only an off-line learning model, but also a local model adaptive to analyzed video ongoing. This is necessary because the variety of broadcasting styles cannot be covered Fig. 6. Flowchart of the pitching shot detection algorithm We classify each pitching shot using just the return value of f(x). Therefore, only some linear calculations are needed for the pitching shot detection. Since the pitching shots have a specific pattern as shown in Fig. 5, the edge component histogram (ECH) in MPEG-7 [12] is adopted as a visual feature vector. Second, using the results of the PSD based on the offline model, we perform the learning local adaptation (LLA) module. We use only the shots that have a good confidence value (CV). That is, the input shots which have a lower CV than a given model threshold, are discarded. Our algorithm training of the LLA model is applied for only half or one inning of a game. After finishing the adaptation model training, only the LLA model is used for pitching shot detection. Fig. 7 shows the proposed learning algorithm for the local adaptive model based on a simple clustering scheme. In the 834 IEEE Transactions on Consumer Electronics, Vol. 54, No. 2, MAY 2008 early training stage, every pitching shot detected by the offline model is considered as a sample shot for clustering. The adaptation procedure is as follows. selected pitching shot cluster is suitable for making the final local adaptive model using CV shown in Fig. 6 and Fig. 7. This CV represents the degree of the clusterconfidence and is calculated with cluster scatter value of Davies Bouldin s Index [14]. The local adaptive model Mp is obtained by calculating the average of the features of the key frames in the selected cluster. In this paper we used ECHs and the HSV color histograms as the features for a given shot. The ECH and HSV color histogram mutually complement each other because the ECH characterizes the composition of frame and the HSV color histogram characterizes the color information of frame. Also, these features are suitable for real time processing because of fast extraction time. The Mp is constructed as Mp ={Me, Mc}, where Me and Mc represent the 80-bin ECH model and the 36bin HSV color histogram model respectively. Me and Mc are defined as Me = {Me0, , Me79} and Mc = {Mc0, , Mc35}, where Mei is the ith bin number of ECH model and Mci is the ith bin number of HSV color histogram model. Mei and Mci can be computed as follows: i e M= Fig. 7. Flowchart of learning algorithm for the local adaptive model - - (step 1: clustering) The difference values between each cluster and an input key frame of the detected shot are calculated. The input key frame is assigned to the cluster having the minimum difference value less than the cluster threshold. A new cluster is generated by using the input key frame when the minimum value is larger than the cluster threshold. The difference value is calculated with the Euclidean distance between the hue saturation value (HSV) histogram [13] of the input key frame and the averaged HSV histogram of the sample vectors belonging to each cluster. There are some common clustering algorithms such as SOM clustering, ISODATA and k-means clustering. But they are not adequate for real time processing, because they require more calculation compared with the proposed clustering algorithm as shown in Fig. 7. (step 2: local model decision) After the clustering process, the cluster having the most key frames is selected as the representative pitching shot cluster, which is used to generate an actual local adaptive model for pitching shot detection. Since all of the pitching shots have similar characteristics as explained in previous chapter, it is likely that many key frames of pitching shots are clustered together. Thus, we select the cluster that includes the most key frames as the pitching shot cluster. After that, it is confirmed that the e (i) t S p t N (S p ) , i c M= c (i ) t S p t N (S p ) (2) where et(i) and ct(i) represent the ith ECH value and HSV color histogram value extracted from tth frame, and Sp is the pitching shot cluster. - (step 3: adaptation) As shown in Fig. 6, a pitching shot is detected by the local adaptive model when the local adaptive model is reliable. For the detection of a pitching shot using the local adaptive model the weighted Euclidean distance is applied. The input frame is determined as the pitching shot when the value of the weighted Euclidean distance is lower than the threshold. The pitching shot detected in the previous stage are also used for updating the local adaptive model, because the detected pitching shot may be a positive sample. The local adaptive model is updated by applying the simple gradient descent method as shown in (3). M ei = M ei + F ( M ei , I ei ), M ci = M ci + F ( M ci , I ci ) (3) F (a, b) = (a b) where Iei and Ici represent the ith ECH value and HSV color histogram value extracted from the current input frame. 2) Close-up Shot Detection The end of a play scene should be detected with the start of the next play scene as detected by the CSD algorithm. The end H.-G. Kim et al: Real-Time Highlight Detection in Baseball Video for TVs with Time-shift Function 835 of play scene is determined by the close-up shot detection. That is because each play scene ends with a close up shot. In a close-up shot, the key frame is dominated and densely aggregated by the non-field region that is not field region, because players are closed. So, the field colors should be obtained for classifying the non-field region. Fig. 8 represents the proposed field color detection algorithm. Fig. 8. Field color detection algorithm The field colors can be extracted from the key frame of a pitching shot. In order to extract those colors, the key frame of a pitching shot is clipped in half and HSV color histogram is extracted from the lower half of the frame image. The most prevailing dominant bins from the 12th bin to the 23rd bin are selected, and we determine the average HSV values of the pixels contained in those bins as the field colors. Since there are two types of field surfaces, such as grass and ground, we select two dominant bins. The bins from 12th bin to 23rd bin are only checked for obtaining the field colors for efficiency because it has been observed that only this range can be perceived as a field area through several experiments. The sliding window based approach [15] is applied to the close-up shot detection. The spatial window slides through the bottom half of the frame. If, at least, one of the spatial windows contains the non-field region more than a threshold, the shot is judged as a close up shot. B. Audio Events Detection Important events such as hit , homerun , or scoring are usually accompanied with cheer sounds (background crowd noise or an announcer s excited speech) or long play scenes. So, if we combine the detected audio event and the duration information of the corresponding play scene, the detection performance would be higher. In the real world, important events never appear in a short audio segment. Instead, such events appear in a much longer time frame, because the most of the exciting event in baseball occurs right after an event like a hit or homerun . The duration of an important event is longer than other play scenes. The duration of a play scene can be easily attained by the play scene detection algorithm explained in previous section. Under these circumstances we propose an audio event detection (AED) algorithm as shown in Fig. 9. We describe the details of AED in this section. Fig. 9. Audio event detection algorithm 1) Audio features Several audio features are extracted based on MDCT coefficients which are readily available from the AC-3 audio encoder as shown in Fig. 10. The MDCT coefficients are smoothed by a log-scale filter-bank {H(p,l) | 0 p <21, 0 l <576}, where the 21st filter is given by l < f ( p 1) 0, 2{l f ( p 1)} , f ( p-1) l f ( p ) { f ( p + 1) f ( p 1)}{ f ( p ) f ( p 1)} H ( p, l ) = 2{ f ( p + 1) l } , f ( p ) l f ( p + 1) { f ( p + 1) f ( p 1)}{ f ( p + 1) f ( p )} 0, l < f ( p + 1) (4) {f(p) | 0 p <21} is the center frequency of each filter, which increases logarithmically. Then, the 21-order Logarithmic MDCT (LMDCT) feature is calculated as the log-energies of the output of the filter bank: 576 LMDCT ( p ) = ln MDCT (l ) H ( p, l ) l =0 (5) The resulting LMDCT feature is converted to the decibel scale: LMDCTdB (n, p ) = 10 log10 (LMDCT (n, p )) (6) where p is the index of a logarithmic frequency range, and n is the frame index. 836 IEEE Transactions on Consumer Electronics, Vol. 54, No. 2, MAY 2008 middle two-subbands from LMDCT to conduct the excited speech recognition. Fig. 10. Audio feature extraction and classification Each decibel scale feature is normalized with the RMS energy envelope yielding a Normalized LMDCT (NLMDCT). The full-rank features for each frame consist of both the RMSnorm gain value Rn and the NLMDCT vector NLMDCT(n,p): Rn = P (LMDCT (n, p )) p =1 2 dB (7) and NLMDCT (n, p ) = LMDCT dB (n, p ) Rn (8) where p is the number of NLMDCT coefficients and L is the total number of frame. The derivative of the NLMDCT features are added to the initial vector in order to take into account the temporal changes in the spectra. One way to capture this information is to use delta coefficients that measure the change in coefficients over time. These coefficients result from a linear regression over a few adjacent frames. The two previous and the two following frames are used to generate delta NLMDCT c as follows: 1 1 c (l ) = c (l 2 ) c(l 1) + c(l + 1) + c(l + 2 ) 2 2 (9) Considering the perceptual property of human ears, the LMDCT coefficients are divided into four sub-bands, each of which consists of critical bands that represent cochlear filters in the human auditory model. These four subbands are 0630Hz, 630-1729Hz, 1720-4400Hz, and 4400Hz and above. Because human speech s energy resides mostly in the middle two-subbands we extract the energies from SE23 of these 2) Audio event detection algorithm The NLMDCT features, delta NLMDCT, RMS-norm gain values, and excited speech energies SE23 are used to generate an audio event models by SVM. The remarkable characteristic of SVM is that it can automatically find the required capacity to learn the training samples without being over trained. Once the event models are trained, an audio classification module using SVM is performed to find the best time-aligned audio event segments corresponding to the audio event models for which the likelihood of the observed features is the maximum. Each segment is classified as the events are used to build a group of the continuous segments and all other sub-segments are ignored. Then audio signals, except the silence regions, are classified into event regions or non event regions. Based on the observation and the prior knowledge, the announcer and crowd become excited at the end of an event. In addition, the event appears for a much longer duration. As a candidate event, the last three seconds of the audio track and the first three seconds of its following shot should both contain at least one exciting point. Therefore, a filtering process of candidate events is needed to select candidate event regions with long duration among the classified events. Experimentally, we find that a candidate event segment has a minimum length of 10 seconds. The pre-filtered segments of the candidate events (>10 seconds) does not provide enough confidence in extracting the important event. To detect the real important event among candidate events we use the post-filtering process for measuring the excited speech energies. At a baseball stadium, however, human speech is almost always mixed with other background noise. In this case, SE23 s distinguishing power drops significantly, because microphone s AGC amplifies the background noise level when the announcers are not talking. The energy level of non speech signal can therefore be as strong as that of normal speech. In the case of the arising event, SE23 s distinguishing power goes up significantly compared with non-speech signal, because SE23 energy helps to filter out low energy. However, high variance background interference and delta NLMDCT c helps to filter out low variance but high energy noise. C. AV Highlight Detection In the previous subsections we explained the PSD and AED. Lastly, the important events are detected by combining the results of PSD and AED. The joint processing is implemented very simply by applying PSD and AED sequentially. First PSD is performed for segmenting video streams into the units of play scene. Because we trained the pitching shot model without commercial regions, we can easily reject commercial shots, where audio events could be detected because of cheer sounds. H.-G. Kim et al: Real-Time Highlight Detection in Baseball Video for TVs with Time-shift Function Second, we select the audio events contained in each play scene as the candidates of important events. Then, it is checked whether current play scene occurs with last pitch defined as the event that is associated with the last pitch of a batter because important events should occur in the last pitch. The last pitch is extracted when the length of the play scene is longer than the threshold. After all, important events are detected when the audio event occurs during the play scene and the length of play scene is relatively long. This process can be also processed in real time, for we used only the features contained in AC3 and MPEG codecs, and the SVM needs a small calculation in the decision stage. 837 First, we evaluated the usefulness of LLA model. For that we performed the play scene detection with only offline model. The detection ratios are between 51 and 55%. Adopting LLA model we could improve the performance of 88~96%. Second, the PSD and CSD performances are shown in Table I. TABLE I EXPERIMENTAL RESULTS (PLAY SCENE). VIDEO NAME PITCHING SHOT CLOSE UP SHOT RECALL PRECISION ACCURACY SS VS. SK 1.000 0.993 0.984 IV. EXPERIMENTAL RESULTS STL VS. FLO 0.992 0.993 0.947 We evaluated our algorithm with a database of 26 baseball videos recorded from TV programs. The database consists of two US baseball videos and twenty four Korean baseball videos. We deliberately included the videos from different countries and stadiums to verify the robustness of our algorithm against a variety of environments. The videos are all encoded by MPEG1 with 352*240 frame resolution or MPEG2 with 720*480 frame resolution. As explained in Section II, we defined the three requirements which are immediacy, the accuracy, and the locality. We will evaluate the proposed algorithm with two requirements; the immediacy and the accuracy, because our algorithm satisfies locality as explained in the previous section. It is important that the proposed algorithm can quickly process a unit of video stream (for example, a one second video sample) to satisfy the immediacy requirement. Therefore, the processing time for a one second video stream was measured. From the experiments it was observed that it took an average 0.024 sec to process one second video stream on our experimental platform, 2.8GHz CPU and 768MB memory. This result shows that the proposed algorithm satisfies the immediacy requirement. To measure the performance of the pitching shot detection algorithm, two commonly used metrics, recall and precision, are defined, and one more metric, accuracy, is defined to measure the close up shot detection algorithm as follows: HH VS. LG 0.986 0.945 0.930 KT VS. DS SK VS. HD SS VS. DS SS VS. HH SS VS. KT SS VS. LG SS VS. SK KOREANSERIES5 KOREANSERIES6 KOREANSERIES7 BOS VS. NY SS VS. DS 1 DS VS. HH SS VS. DS SS VS. KT SS VS. KT SS VS. LG SS VS. DS 3 HD VS. HH LG VS. SK SK VS. LT KOREANSERIES1 KOREANSERIES2 TOTAL 0.971 0.979 1.000 0.992 0.998 0.957 0.994 0.990 0.993 0.969 0.971 0.989 0.997 1.000 1.000 0.989 0.992 0.989 1.000 0.997 1.000 1.000 0.989 0.989 0.990 0.963 0.952 1.000 0.983 0.984 0.981 0.990 0.986 1.000 1.000 0.994 0.986 0.947 0.985 0.989 0.969 0.993 0.944 0.991 0.982 1.000 0.992 0.982 0.956 0.969 0.959 1.000 0.967 0.963 0.967 0.965 0.949 1.000 0.932 0.963 0.979 0.969 0.967 0.974 0.963 0.974 0.978 0.969 0.977 0.984 0.985 0.968 Recall = Nc Nc , Precision = Nc + N m Nc + N f Accuracy = (10) Pc Pt where Nc represents the number of correctly detected pitching shots, Nm is the number of missed pitching shots and Nf denotes the number of falsely detected pitching shots. Pt represents total number of play scenes in which the pitching shot is correctly detected, and Pc is the number of correctly detected close up shots. Using the above measures we tested our segmentation algorithm of play scenes on several baseball videos for evaluating the accuracy. The close up shot detection algorithm is applied for play scenes after detecting the pitching shot. Therefore, we only test the close up shot detection algorithm on play scenes for which the pitching shot is correctly detected. As shown in the table, it is observed that the precision value and the recall value of the proposed algorithm are very good regardless of the broadcasting style, because local model can be adaptively learned in a short time after the beginning of the game. It is a very significant fact that the important events can be correctly detected using our algorithm. To measure the performance of the highlight detection, Recall and Precision metrics that are similar to (10) are used. However, the dominant event types as explained in section II are used for calculating these metric. That is, since the scoring events must be detected, Nc represents the number of the correctly detected scoring events, and Nm is the number of missed scoring events and Nf denotes the number of falsely detected events not contained in dominant event types. 0.85 recall (141 scores/165 838 IEEE Transactions on Consumer Electronics, Vol. 54, No. 2, MAY 2008 scores) and 0.97 precision (797/822) can be achieved based with our proposed method. It thus was proven that our algorithm satisfies the accuracy requirement for high quality highlight detection. V. CONCLUSION It is necessary to automatically detect important events in sports video for more convenient time-shift function, because users cannot comprehend the whole story of the broadcast contents at one time. Although the research studies on automatic event detection in sports video have a long history, they could not detect the start of an event and the end of that event, because play scenes were not segmented in the previous approaches. So, the previous methods could not be applied for time shift function, because the statistical information of the whole video stream is required for the former methods. In this paper we proposed a new real time event detection algorithm based on audio-visual joint processing. In the proposed method we first extracted the play scenes and the audio events from the video and audio streams. For the robust play scenes extraction, we used not only the offline learning model, but also the local adaptive model generated from the ongoing analyzed video, and the audio events were extracted with SVM-based approach. Experimental results showed that our proposed method worked in real time and achieved remarkable performance. The proposed algorithm can be effectively applied to any TV with a time shift function. REFERENCES [1] L. Baoxin, and M. I. Sezan, Event Detection and Summarization in Sports Video, IEEE Workshop on Content Based Access of Image and Video Libraries, pp. 132-138, 2001. [2] G. Sudhir, J.C.M. Lee, and A.K. Jain, Automatic Classification of Tennis Video for High-Level Content-Based Retrieval , Proceedings of International Workshop on Content-Based Access of Image and Video Database, pp. 81-90, 1998. [3] W. Hua, M. Han, and Y. Gong, Baseball Scene Classification Using Multimedia Features, Proceedings of International Conference on Multimedia and Expo, pp. 821-824, 2002. [4] D. Zhong, and S. F. Chang, Structure Analysis of Sports Video Using Domain Models, Proceedings of International Conference on Multimedia and Expo, pp. 713-716, 2001. [5] P. Chang, M. Han, and Y. Gong, Extract Highlights from Baseball Game Video with Hidden Markov Models, Proceedings of International Conference on Image Processing, pp. 609-612, 2002. [6] J. Wu, X. Hua, J. Li, B. Zhang, and H. J. Zhang, An Online Learning Framework for Sports Video View Classification, Proceedings of Pacific Rim Conference on Multimedia, pp. 289-297, 2004. [7] I. Otsuka, R. Radharkishnan, M. Siracusa, A. Divakaran, and H. Mishima, An Enhanced Video Summarization System Using Audio Features for a Personal Video Recorder, IEEE Transactions on Consumer Electronics, pp. 168-172, 2006. [8] A. Ekin, A. M. Tekalp, and R. Mehrotra, Automatic Soccer Video Analysis and Summarization, IEEE Transactions on Image Analysis, pp. 796-807, 2003. [9] P. Chang, M. Hang, and Y. Gond, Extract Highlights from Baseball Game Video with Hidden Markov Models, Proceedings of International Conference on Image Processing, pp. 609-612, 2002. [10] D. Zhang, and S. F. Chang, Event Detection in Baseball Video Using Superimposed Caption Recognition, Proceedings of ACM international conference on Multimedia, pp. 315-318, 2002. [11] D. A. Sadlier, and N. E. O Connor, Event detection in field sports video using audio-visual features and a support vector Machine, IEEE Transactions on Circuits and Systems for Video Technology, pp. 12251233, 2005. [12] A. Yamada, M. Pickering, S. Jeannin, L. Cieplinski, and Jens, MPEG-7 Visual Part of eXperimentation Model Version 9.0, ISO/IEC JTC1/SC29/WG11/N3914, 2001. [13] L. Zhang, F. Lin, and B. Zhang, A CBIR Method Based on ColorSpatial Feature, Proceedings of IEEE Region 10 Conference, pp. 16616, 1999. [14] D. L. Davies, and D. W. Bouldin, A Cluster Separation Measure, IEEE Transactions on Pattern Recognition and Machine, pp. 224-227, 1979. [15] S. K. Kim, J. G. Jeong, E. H. Hwang, Personal Video-Casting System for Intelligent TV Browsing, Accepted by the Proceedings of International Workshop on Multimedia Content Analysis and Mining, 2007. Hyoung-Gook Kim received the diploma degree in electronic engineering and the Ph. D. degree in computer science from the Technical University of Berlin, Berlin, Germany. From 1998 to 2002, he worked on mobile service robots at Daimler Benz AG, speech recognition at Siemens AG, and speech signal processing at Cortologic AG, Germany. From 2002 to 2005, he served as adjunct professor of the Communication Systems Dept., Technical University of Berlin. From 2005 to 2007 he was a project leader in Samsung Advanced Institute of Technology, Korea. Since 2007 he has been a professor in the Wireless Communications Engineering Dept., Kwangwoon University, Korea. His research interests include audio signal processing, music information retrieval, audiovisual content indexing and retrieval, automatic segmentation, speech enhancement, and robust speech recognition. Jinguk Jeong Jinguk Jeong received his B.S., M.S. and Ph.D. in computer science from Sogang University, Korea, in 1998, 2000, and 2004, respectively. He is currently a researcher in Samsung Advanced Institute of Technology. His research interest includes multimedia computing system, content-based multimedia indexing and retrieval algorithm, and MPEG compression standard. Jang-Heon Kim received the M.S. in Electric and Electronic Engineering from Yonsei University, Seoul, Korea. From 2004 to 2007, he worked on 3D video analysis at Communication Systems Department of Technical University Berlin, Germany. He is an active research partner in EU 3DTV project Network of Excellence who aims at solving low-level vision problems for 3D applications. Since 2008 he is a researcher in Philips Medical Systems, Netherland. Jin Young Kim received the Ph.D degree in electronic engineering from the Seoul National University. He worked on speech synthesis at Korea Telecom from 1993 to 1994. Since 1995 he has been a professor in the Dept. of Electronics and Computer Eng., Chonnam National University. His research interests are speech synthesis, speech and speaker recognition, and audio-visual speech processing.
Formatting page ...
Formatting page ...
Formatting page ...
Formatting page ...
Formatting page ...
Formatting page ...
Formatting page ...
Print intermediate debugging step
Show debugging info
Hide debugging info
Horizontal lines at:
Guest Horizontal lines at:
AutoRM Data:
Box geometries:
Box geometries:
Text Data:
© 2010 - 2026 ResPaper.
Terms of Service
Contact Us
Advertise with us