Audio Signal Classification

3 downloads 0 Views 840KB Size Report
There crowd noise and classical music identification shows an accuracy level over. 95%. Detection of background noise and speech leads to a failure. There no ...
Audio Signal Classification MDP Bamunuarachchi Faculty of Information Technology University of Moratuwa. mailtodinithi@g mail.co m

ABSTRACT- Audio signal classification plays a major import role in Audio Content Analysis. This review paper is based on Audio Signal classification technologies. Paper critically reviews on existing approaches to audio signal classification. In areas such as speech to text conversion, speech data were segmented manually, but with the need for and availability of increasing amounts of speech data the task of manual speech segmentation became a time consuming and an expensive activity In the field of music also techniques for automatic genre classification would be a valuable for the development of audio information retrieval systems for music. Therefore efficient and accurate audio signal classification methods can add a huge value Audio Signal Content Analysis area. Major topics covered are about current audio signal classification methodologies, types of feature which to be extracted, their positive and negative aspects on feature extraction efficiency and classification reliability aspects are discussed.

methods are used for the automatic phonetic analysis of large amounts of speech data. Manual labeling & seg mentation are subjective for resulting in significant differences in the transcriptions created by different expert listeners [2]. Automatic speech recognition has made great strides with the development of digital signal processing hardware and software. Despite of all these advances, machines cannot match the performance of their hu man counterparts in terms of accuracy and speed, especially in case of speaker independent speech recognition [3]. Therefore today significant portion of speech recognition research is focused on speaker independent speech recognition problem. The reasons are its wide range of applications, and limitat ions of available techniques of speech recognition.

1.0 Introduction

In Intelligent Signal Processing also classification plays a major role. Possible example of intelligent signal processing is automatic equalizat ion [19]. Equalizers are a well known feature in virtually all types of audio equipments such as HiFi systems, iPods, computer sound cards, vehicle audio systems, etc. Capability of these systems in detecting the kind of audio they are playing, the corresponding equalization presets could be automatically loaded and applied. Another application is content dependent automatic control of dynamics processing, for example in the adaptation of loudness in broadcasting.

Over the past few years, interest in the automatic segmentation of audio stream has increased. Audio signal classificat ion a finds its utility in many research fields such as speaker recognition, language recognition, sound effects retrieval, context awareness, video segmentation using audio, musical instrument and genre classification etc.

Signal

Feature Extrac tion

Feature Sel ection

Classification model

Classifier

In the field of music humans are remarkably good at genre classification of music, but with the advancement of technology genre classification for digitally availab le music is going in search of more auto matic methods. Therefore techniques for automatic genre classification would be a valuable for the development of audio information retrieval systems for music [4].

Class

Figure 1 – Audio signal classification system In the field of speech recognition large amounts of reliab ly is needed for segmented speech data, for instance, for imp roving recognition and synthesis performance. Furthermore, automatic speech segmentation

Automatic bandwidth allocation in telephone network with audio classificat ion capabilit ies

1

could dynamically allocate bandwidth for the signal being transmitted. Therefore bandwidth can be efficiently used for different types of data transmission. This would help mu ltip lexing systems to work in a more efficient manner. Above reasons influenced automatic segmentation of audio stream to be a major area to carry out researches at present. Above described problem areas are associated with efficient and accurate audio signal feature extraction and classification. Section 2 provides a broad overview on Audio Signal Classification, including key research areas and importance. Section 3 critically reviews majo r research work carried out in the area of audio signal feature extraction. Sect ion 4 provides information on audio signal features additional to informat ion on signal classification process. Classificat ion methodologies are given a detailed description section 5. Future trends and direction in the audio signal classificat ion is address in section 6. Sect ion 7 is the discussion about the findings.

2.0 Overvie wAudio Classification

Signal

Currently, browsing and management of audio data rely mostly on textual informat ion attached manually, wh ich is an extremely time-consuming task. Furthermore, this informat ion is often incomplete or unavailable. Therefore field of audio signal content analysis has become a major area of importance [7]. It is a broad area of study. One of the key research areas in the field is audio signal classification wh ich is been studied, analyzed and reviewed in this review paper. Audio signals are most commonly classified based on acoustic segmentation types is speech/music classification. In the spectrum of typical speech and music signals, we can see the difference between these two acoustic classes. Typically different kinds of speech have certain common features, as an examp le, most of speech energy is in the lower part of the frequency spectrum (below 1 kHz). Depending on the type of music the music frequency spectrum can be quite different. Audio signal classification finds its utility in many research fields such as audio content

analysis, broadcast browsing, and informat ion retrieval [5]. Therefore techniques for automatic genre classification would be a valuable for the development of audio informat ion retrieval systems. Audio signal classification system needs to be able to categorize many different audio input formats. Classification system emp loys the extraction of a set of features from the input signal. Each of these features represents an element of the feature vector in the feature space. The dimension of the feature space is equal to the number of ext racted features. New do mains for classification are constantly emerging, speech/music discrimination and segmentation is an active field of research. Great amount of research effort has been put into speech and music classification. Many different systems for classification have been introduced based on many models. Major classification models used in the current research area are identified as fo llo ws. 

Gaussian Mixture Models (GMMs)



Hidden Markov Models (HMMs)



k Nearest-Neighbour

There are many existing works to discriminate speech from music and most of them explored robust audio features to train pattern classifiers in a supervised manner, classifiers such as GMM , art ificial neural network (ANN) and knearest neighbor (NN) classifier have been widely employed. The GMM can be viewed as a hybrid between a parametric and nonparametric density model. GMM classifiers have been used in many fields from image pattern recognition [10] to text-independent speaker recognition as well as in the field of MIR. Major real world applications in are also identified. Detecting the audio type of a signal such as speech, background noise, and musical genres allows such new applications as automatic organizat ion of audio databases, segmentation of audio streams, intelligent audio coding automatic, auto matic equalization, automatic control of sound dynamics and intelligent signal analysis [9]. Other applications of this area are for pitch detection, automatic music transcription, speech and language applications, and mu ltimed ia databases.

3.0 Audio Signal Classification Research Projects Researches based on approaches of Hidden Markov Models (HMMs), k NearestNeighbour and Gaussian Mixture Models (GMMs) in audio signal classification are reviewed here. 3.1 Audio Signal Classification Gaussian Mi xture Models (GMMs)

using

Many research projects have use training of Gaussian Mixture Models to classify various data. Audio signal classification and field also systems are using GMM trained models to recognize and classify audio signals. These systems have achieve a satisfactory success level in the areas tracking of speech through audio documents containing speech, noise and music [14]. GMM has potential ability to represent an underlying set of acoustic classes by individual Gaussian components. Furthermore GMM has the capability to form a smooth approximat ion to the arbitrarily shaped observation densities in the absence of other informat ion. GMM can be recognized as a hybrid between a parametric and nonparametric density model. Similar to non parametric model it has many degrees of freedom which allow arb itrary density modeling, without undue computation and storage demands. Furthermore it has structure and parameters that control the behavior of the density in known ways, but without constraints that the data mus t be of a specific distribution type like Gaussian which relates it to parametric model. It can also be thought of as a single-state HMM with a Gaussian mixture observation density or an ergodic Gaussian observation HMM with fixed, equal transition probabilit ies [11]. 3.1.1

Gaussian Mixture Model Speech / Music classification

for

Alan P. Schmidt Trevor K. M. Stone Depart ment of Co mputer Science Un iversity of Colorado, Boulder has proposed a Music classification and identification system [15]. The proposed system has the capability to classify the signal according to the genre, artist, album, and title of the given song. There the approach used is to compare the input songs features with existing models to determine the identity of the song, along with its genre, album etc. Signals input in WAV format is classified according to the above criteria. They have used several classificat ion methods in their study. Such as

 Brute Force Since this method checks a song against each model, it is incredibly accurate, but. Since his method checks a song against each model, it is extremely slow. First 30 seconds will only be analyzed through this system which cause a problem if that can contain silence or noise inserted. Other issue is that this method works on the general idea that most songs in general have an internal structure that includes repetition of common themes. Therefore system only takes one pass of these themes to get an accurate classification. There can be exceptions where with extended intro segments or more advanced musical features.  Decision Tree – Top Down Decision trees are considered as one of the most used approaches for representing classifiers. Researchers from various research areas such as machine learning, pattern recognition, and data mining considered the issue of growing a decision tree from available data. In this paper also they have used a weighted sum vector, to build a decision trees. The weighted vectors have been computed by sorting GMMs’ mixtures by the weight of that mixtu re to normalize vector element order between GMMs.  Decision Tree – Bottom Up In this technique model is calcu lated for n artist by combining the models for the artist’s albums. Th is is carried further fo r art ist and genre. The GMMs are combined by computing their average mean and variance vectors. For bottom-up decision trees, they need to identify the best way to construct models that are representative of a collection of other models. This problem could have been easily solvable if training speed can be enhanced. One approach would be to build the hierarchy based on the unsupervised clustering method rather than hand-labeled collections.  Clustering They have used a weighted sum vectors computed as in decision-tree and organized GMMs into clusters, considering each cluster a class. They have applied two variations on this approach, based on human-labeled data and unsupervised learning technique. The fact that using all 640 elements of the GMM vector has not offered a performance gain thus using full vectors requires a significant performance reduction. Unsupervised clustering has shown to be really time consuming and also significantly

outperform decision tree, despite supervised clusters and decision trees show nearly equal performance. Limiting the test set is the prime way to enhance and scale system performance. P. Dhanalakshmi, S.Palan ivel, V. Ramalingam fro m University of Annamalai, India has proposed a Gaussian Mixture Model based technique for pattern classification in audio signals [8]. Auto Associative Neural Network (AANN) model was used for occupying the distribution of the acoustic features of a class. A technique has been used to minimize the mean square error for each feature vector. Proposed audio classification scheme is proven to be efficient as it has an accuracy rate of 93.1%. Experimental results also show that when GMM is used with Mel-Frequency Cepstral Coefficients. (MFCC) it achieves the highest overall accuracy of 93% over Linear Prediction Cepstrum Coefficients (LPC, LPCC). Jeroen Breebaart Philips Research Laboratories and Martin McKinney, Prof. Holstlaan Eindhoven, Netherlands proposed a system to differentiate signals into five audio classes, classical music, speech, and crowd noise, popular music and noise [16]. The feature sets include low-level signal properties MFCC and two new sets based on perceptual models of hearing based on temp values. There crowd noise and classical music identification shows an accuracy level over 95%. Detection of background noise and speech leads to a failure. There no features show strong discrimination between background noise and speech. Reason can be identified as low discrimination power in features. Classification of background noise shows a large increase in perfo rmance, at 75% fo r the MFCC feature set over SSL feature set. It can be shown that temporal variations in features are important for audio class discrimination. George Tzanetakis, Georg Essl, Perry Cook fro m Co mputer Science Depart ment Princeton have design an Automatic music genre classification system for audio signals [2]. Performance of the feature set has been evaluated by training statistical pattern recognition classifiers using real world audio collections fro m rad io, co mpact disks and the Web. They have developed two graphical user interfaces for browsing and interacting with

large audio collections have based on the automatic hierarchical genre classification. Gaussian classifier has been trained using a dataset of 50 samp les (each 30s duration). Using the Gaussian classifier each class is represented as a single mult idimensional Gaussian distribution with parameters estimated fro m the train ing dataset. For the musical genres classical and country, the combined feature set was used. For the classical genres orchestra and piano and for the speech genres male and female voice melfrequency cepstral coefficients (MFCC) have been used. Due to the fact that the jazz and rock are very broad categories and their boundaries were fuzzier than classical or hip hop, resulted best predicted genres were classical and hip hop while the worst predicted are jazz and rock. 3.1.2

Gaussian Mixture Model Speaker Verificati on System

for

Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn fro m M.I.T. Lincoln Laboratory, Massachusetts proposed a Speaker Verification system using Adapted GMM. GMM -UBM (Un iversal Background Model) system is proven to be effect ive system has faced several mismatch conditions as well. The system rely on low level acoustic informat ion but speaker and channel informat ion are bound together in an unknown way in the current spectral-based features. Therefore performance decrement happens in the systems when the microphone or acoustic environment changes between training and recognition data. Though GMM is computationally inexpensive, method based on a well-understood statistical model, and, for text-independent tasks, it is insensitive to the temporal aspects of speech. This is a disadvantage as the higher levels of informat ion about the speaker conveyed in the temporal speech signal are not used. 3.2 Audio Signal Classification Hi dden Markov Models (HMM)

using

In many audio classification problems, it is important to do some form of template matching. A set of templates is stored in the computer, the audio signal is compared to each template, and a distance measure is used to determine wh ich temp late is closest to the input signal. For sounds such as bird calls, impact and percussive sounds, this technique

works well because individual instances of the sound are very similar. In depth a Markov model is a system with a set of states, and a set of transitions between states (or to the same state). Each transition has an associated probability, and the system proceeds from state to state based on the current state and the probability of transition to a new state. What makes a Markov model hidden is the observability of the states [7]. 3.2.1 Hi dden Markov Model for Music Classification Wei Chai, Barry Vercoe fro m Media Institute of Technology Cambridge, has proposed a system to classify folk music fro m different countries based on their monophonic melodies using hidden Markov models [6]. Music corporate of Irish, German and Austrian folk music in various symbolic fo rmats was used as the data set. In two way classification (IrishGerman, Irish-Austrian, Austrian-German) more discriminate data set has achieved a more accurate result which is Irish – Austrian. They have used HMM’s with different number of hidden states and different structures to do classification, and then compare their performances. It has been proven from this work that hidden Markov models can be used to build classifiers based on melody informat ion of fo lk music. 3.2.1

Hi dden Markov Model for TV broadcast classification Zhu Liu, Jincheng Huang and Yao Wang from Polytechnic University, Brooklyn proposed a technique for classifying TV broadcast video focusing HMM [13]. Signals are been classified into five types of TV programs, namely commercials, basketball games, football games, news reports, and weather forecasts. They have applied HMM fo r video content classification using audio informat ion. Some temporal structures of video sequence required them to use ergodic HMM. There each state can be reached from other states and can be revisited after leaving. In the above approach classification between news and weather forecast has been less successful as they both contain mostly pure speech. As the HMM framework can be extended to use visual information it would be more successful if both visual and auditory features could have been considered in classifying.

3.3 K Nearest Neighbor for Cl assification When the classification is directly based on the training data, without previous density or parameter estimation, it’s called nonparametric classification. Its one of most important algorith m is the k-Nearest Neighbor (kNN) classifier. K Nearest Neighbour (kNN) is based on the principle that the instances within a dataset will generally exist in close pro ximity to other instances that have similar properties. The nearest neighbour method consists of assigning to the unlabelled feature vector the label of the training vector that is nearest to it in the feature space [21]. In kNN, a training set T is used to determine the class of a previously unseen sample X. First, we determine the mean and maximu m values in T, and similarly, for the unseen sample X. Then a suitable distance measure in the feature space is used to determine k elements in T closest to X. If most of this k nearest neighbors contains similar values, then X gets classified accordingly. Choosing moderate values for k improves performance in co mparison with the NN ru le, because provides more probabilistic informat ion. Infact large values for k can be detrimental, as it increased the computation complexity and would destroy the locality of the estimation by considering samples that are too far away [19]. In contrast to density estimation based classifiers, kNN classifier requires to store all feature vectors of the training database in order for the input vector to be compared with each of them. Thus kNN do not require any training time. KNN doesn’t provide an abstraction of the classes thus it uses the whole database during classification. Therefore KNN model has to be improved in that area. Bertrand Delezoide, Xavier Rodet IRCAM Centre, Paris, France Po mpido has proposed a system to discriminate audio signals into specific and simp le classes of speech, music and mixture o f both [17]. 

Clustering

They have used a weighted sum vectors computed as in decision-tree and organized GMMs into clusters, considering each cluster a class. They have applied two variations on this approach, based on human-labeled data and unsupervised learning technique. Unsupervised clustering has shown to be really time consuming and also significantly outperform

decision tree, despite supervised clusters and decision trees show nearly equal performance

4.0 Neural Network Approaches

Based

Neural networks are a classification approach which is co mmonly adapted to music classification. In neural networks, weights are applied to threshold functions based on a pre classified training set, so that the network learns relevant patterns in the data [24]. The human ears process the auditory signal by performing Fourier analysis and the frequency components are transferred to the brain by independent channels. It is assumed that speech and music representation is based on known physiological facts of the human ear, followed by a non-linear function that connects the decision about the sound type and representation. Therefore mu lti-layer NN, as a classificat ion tool, is an applicable approach for representing a non-linear decision making system. In order to apply the decision process with a Neural Netwo rk, it is essential to choose a reliable representation of the input s ignal. Time do main representation of the musical tone requires a huge amount of input nodes when considering the fact that a decision is obtained from a few second long time interval. [26]. 4.1 Artificial Neural Network (ANN) for Audi o Signal Classification The human brain has roughly 10 trillion neurons that communicate with one another through synaptic connections. As the brain matures, the synaptic connections also mature, becoming more functional and numerous, and the mammal develops cognitive skills . The NN is a multipurpose technique and it can be used to implement many algorith ms. An Artificial Neural Netwo rk (A NN) is a collection of simple processing elements, called units or nodes, which are connected to each other and organized in layers. Its functionality is vaguely based on the biological neuron. The processing ability of the network is stored in the inter unit connections, or weights, which are tuned in the learning process. Even though the ANN is a parametric model, with contrast is to HMM and GMM models, no assumptions about the underlying data distribution have to be made [20]. Although

some ANN arch itectures are capable of approximate any function and therefore they are a good choice when the function to be learned is not known in advance [22]. Xi Shao, Changsheng Xu, Mohan S Kankanhalli fro m Institute for Infocomm Research, NUS has presented an approach for content- based Audio Classification. Basic learning is to start with an untrained network, present training set to the input layer, pass the signals through the network and determine the output at the output layer. The outputs are compared with the target values and the difference corresponds to an error. The error or criterion function is some scalar function of the weights and is minimized when the network outputs match the desired outputs. Here, the train ing data set is some audio data which belong to a certain audio class. For each audio content, they have calculate the feature vector and associate the feature vector with the desired output. Then, it can input the unlabeled audio data and classify these data using trained network [25]. 4.2 Convol uti onal Neural Networks (CNN) for Audio Signal Classification There are few applications of convolutional neural network (CNN) in audio analysis despite its successes in vision research. Tom LH. Li, Antoni B. Chan and Andy HW. Chun has proposed a novel approach for extraction of musical pattern features in audio music using CNN. It is a model wh ich is widely adopted in image in formation retrieval tasks. Their experiments have shown that CNN has strong capacity to capture informative features from the variations of musical patterns with min imal prior knowledge provided [27]. It has been figured out that the training on a 3 genre dataset converges much faster than the training on a 6 genre dataset. It shows the difficulty in training CNN increases drastically when the number of genres involved in training increases. It may be because CNN gets confused with the complexity of the training data. This experiment reveals that their current model is not robust enough to generalize the training result to unseen musical data. This can be overcome with an enlarged dataset.

5.0 Classification Type Categories Clustering, supervised and unsupervised learning techniques are discussed.

b. 4.1 Supervised and Unsupervised Learning Techni ques. Classification techniques fall into following major categories. 

Supervised Learning



Unsupervised Learn ing

4.1.1 Supervised Learning Supervised machine learn ing is the search for algorith ms that reason from externally supplied instances to produce general hypotheses, which then make predictions about future instances. In other words, the goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictor features. The resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known, but the value of the class label is unknown [21]. 4.1.2 Unsupervised Learning This technique is used if the training set is not labeled. In unsupervised classification, the separation into classes is up to the algorithm, which groups the samples according to some measure of similarity. Th is process is called clustering. 4.2 Clustering In vector based sequence, it is impossible to detect repeating patterns bas ed on exact matching unlike for sy mbolic sequence. Therefore researchers try convert vector-based sequence into symbolic sequence. The clustering methods can be divided into two broad categories at the top level,  

Hierarchical method Partit ional method.

4.2.1 Hierarchical method The hierarchical clustering algorith m in itially treats each pattern vector as an individual cluster and then merges the most similar pair of clusters. Eventually, a hierarchical structure is obtained. to show the relationship among all the pattern vectors. To measure the distance between two clusters, lin kage is used to select single points that can represent the clusters. Three linkage types are common ly used, a.

Single linkage defines the distance between two clusters as the minimu m distance of all possible pairs of pattern vectors in the two clusters.

c.

Co mplete linkage is opposite to the previous one and it uses the maximu m d istance. Average linkage calculates the mean distance.

4.2.1 Partitional method

In another category, partitional algorith ms aim to obtain a single partition of the data instead of constructing a hierarchical structure. There are several strategies to partition data. First, squared error criterion is the most frequently used because of its simplicity in implementation. One typical examp le using square error is k-means clustering which iteratively reassigns the data to clusters based on the distance between data and clusters until convergence. Secondly, graph theoretic clustering presents the relationship among all the data so that the partitions will be clearly archived. The wellknown graph-theoretic divisive clustering algorith m obtains the partitions by removing the long edges from the min imal spanning tree constructed for data. Thirdly, mixture resolving methods assume the data are drawn fro m d ifferent distributions so that their goal is to determine the parameters of the models, as an examp le the EM algorith m has been applied to parameter estimat ion. Finally, single -pass methods aim at reducing computation effort for large size of data and partitioning data in linear t ime. The singlepass algorithm is based on a first co me firstserve discipline and assigns data into the currently nearest cluster. 4.3 Decision Trees It is a classifier exp ressed as a recursive partition of the instance space. Decision tree is consists of nodes that form a Rooted Tree, mean ing it is a Directed Tree with a node called root that has no incoming edges . In the decision tree each internal node splits the instance space into two or more subspaces according to a certain discrete function of the input attributes values [23]. A decision tree classifier can perform different transformations at each level of the tree and is not limited by a fixed number of co mponents.

6.0 Audio Signals and Audio Signal Features Audio signal is the main input for the classification. It will be classified according to its features.

5.1 Audi o signals Sound signal can be defined as mechanical vibrations that can be interpreted as sound are able to travel through all forms of matter: gases, liquids, solids and plasmas. The matter that supports the sound is called the mediu m Sound cannot travel through a vacuum. Sound is transmitted through gases, plasma, and liquids as longitudinal waves, also called compression waves. Through solids, however, it can be transmitted as both longitudinal waves and transverse waves. Longitudinal sound waves are waves of alternating pressure deviations from the equilibriu m pressure, causing local regions of compression and rarefaction, while transverse waves are waves of alternating shear stress at right angle to the direction of propagation [18]. The energy carried by the sound wave converts back and forth between the potential energy of the extra co mpression (in case of longitudinal waves) or lateral displacement strain (in case of transverse waves) of the matter and the kinetic energy of the oscillations of the med iu m. 5.1 Audi o Signal Features Audio signal classification process often consistent with several inner processes such as audio signal feature extraction, audio signal segmentation etc. Audio features play a major role in accuracy of classification methods. Features can be categorized in to two categories which are called physical features and perceptual features. The first step in a classification problem is typically data reduction. Most audio signals contain much redundancy. The data reduction stage is called feature ext raction. Feature extraction stage is consists of discovering a few important facts about each data item. Feature extraction is an essential step in classifying audio signals unless the data in its original form is already in features. Often signal features are classified into two categories which are physical features and perceptual features [7]. 5.1 Physical Features Feature which are directly related to physical properties of the signal are classified under

physical features. Those features are easy to measure, recognize and define. So me of the physical features are Energy (used to discover silence in a signal), ZeroCrossing Rate (provide data on spectral content of the signal), Spectral features (distribution of frequencies of the signal), Formant Locations (in speech recognition) etc. 4.2 Perceptual Features Identificat ion of features based on human perception. They are like an instructive to investigate the physical counterparts to these perceptual features. Perceptual features can be identified as follows Pitch (relat ive fluid quantity, contains more informat ion about the sound signal), voiced/ unvoiced frames (first step in speech recognition), Timbre (quality of sound which allo ws to differentiate between different instruments), Rhythm etc

7.0 Audio Signal Models

Classification

Real wo rld p rocesses generally produce observable outputs which can be characterized as signals. Signals can either be discrete or continuous. Signals source can be stationary (it’s statistical properties do not vary with time) or non-stationary (where the statistical properties vary with time). A problem of fundamental interest is characterizing such signals in terms of signal models. Signals model can provide the basis of theoretical description of signal processing system which can be used to process the signals. As an example if we are interested in enhancing a speech signal corrupted by noise we can use the signal model to design a system which can optimally remove the noise. At the same time signal models are important because they are potentially capable of acknowledging us about the signal source without having the source available. There are many possible choices for the signal model to be selected for characterizing the signal.. One categorizat ion is, 

Deterministic models



Statistical models

In deterministic models, the specificat ion of signal model is generally straight forward and exploits known specific propert ies of the signal. With contrast to deterministic models statistical models try to characterize only the statistical properties of the signal. Some of such models are GMM, HMM etc. Underly ing assumption in statistical model is that the signal can be well characterized as a parametric random p rocess [12].

8.0 Future Directions of Signal Classification

Audio

People are surrounded by sound and each space we occupy has characteristic acoustic properties. Whether for music, telecommun ications or a host of other applications, capturing, rendering and processing sound is crucial to human existence. Audio and acoustic signal processing targets single and multichannel processing techniques and leverages classification technology, adaptive signal processing, system identification, wave field techniques and mach ine learning to name but a few of the excit ing challenges with the scope of this vibrant and growing topic. Many new t rends in audio signal classification area has been identified which leads in determin ing future direction of the area. Recently its demand is increasing in the informat ion retrieval field as a new approach of Query By Hu mming has been invented; in which the user has to hum a tune and the song that corresponds to that tune is returned. Most of the frameworks used in researches can be also extended to utilizing the visual informat ion, including HMM models. It has been shown that classification can be more accurate if the technology can consider both audio and visual information in classificat ion models. At present in areas like speech recognition and speech to text conversion 3D and 2D modeling mechanisms are being used. Lip read ing methods have also been introduced. Most ongoing researches tries to take an audio- visual combined approach in classifications. In past when it comes to music genre classification, the output result classes have been genre, author etc. Thus present research work focus on identify other user likeness and matching songs for giving recommendations

for newly arrived albu ms etc. This is based on acoustic similarity, of the input track with other tracks. Similarly classifying tracks based on ages and composers is another area of interest. Some of the other new trends in field are single and mu lti-channel processing techniques and leverages classification technology, adaptive signal processing, system identification, wave field techniques and mach ine learning to name but a few of the excit ing challenges with the scope of this vibrant and growing topic.

9.0 Discussion Intention of this paper was to perform a critical review on audio signal classification research area. Therefore importance of subject area, major classification classes, feature, and classification methodologies has been identified by reviewing the research work that has been done. The process of audio signal classificat ion involves the extraction of features fro m sound signal and the use of these features to identify the class it belongs to. Audio signals have to be classified into two major fields, music and speech. Apart from them there are various other classes like noise, silence, background noise etc. Audio signal classification plays an important role in analyzing and characterizing audio content. Auditory scene analysis, contentbased retrieval, indexing, and fingerprinting of audio are few of the applications that require efficient feature extract ion. However, due to the discontinuities exist in these signals, their quantification and classification remains a formidable challenge. While the relevant features have been well studied and identified for speech signals, they are relatively less studied for other types of audio signals. Considering the fact that different classes of audio signals have their own unique characteristics, the idea of class dependent feature selection and classificat ion is examined in this paper. Major classificat ion methods were identified as Gaussian Mixture Model, Hidden Markov Model, k Nearest Neighbour Model and

Neural Network based models . All above models have been used for classification of audio signals separately as well as sometimes in comb ination with other models, such as AANN. As the Gaussian distribution lies an assumption that the class model is truly a model of one basic class, in cases where as the actual model, the actual probability density function, is mult imodal, it do not succeed. Gaussian mixtu re model (GMM) is a mixture of several Gaussian distributions and can therefore represent different subclasses inside one class. The probability density function is defined as a weighted sum of Gaussians. In GMM model speech, noise, music, music genre classification, speaker verificat ion systems are discussed in detail. The nearest neighbour method consists of assigning to the unlabelled feature vector the label o f the training vector that is nearest to it in the feature space. KNN doesn’t provide an abstraction of the classes thus it uses the whole database during classification. Therefore KNN model has to be improved in that area. Hidden Marko f Model is another model studies through this study. HMM has good capability to grasp the temporal statistical property of stochastic process and is used widely in pattern recognition field. In HMM other than music, speech, noise classification TV broadcasting classificat ion has been studied here. Neural Network based approaches such as Associative Neural Networks, Convolutional Neural Net works are also covered under the scope of this study. The applications of Neural Network models in audio content analysis and automatic audio patter recognition have been studied in details. Audio signal features are co-related with the system performances of the audio signal classification systems which are imp lemented using models such as HMM, GMM, CNN, ANN, kNN etc. Therefore a high level analysis is done on audio signals and audio signal features by dividing the features in to two main categories as physical features and perceptual features. The area of supervised and unsupervised learning has also been studied. In unsupervised learning it is possible to learn larger and more complex models than in the supervised

learning. Supervised learning always tries to find the connection between two sets of observations. Therefore the difficu lty of the learning task grows exponentially in the number of steps between the two sets . Models with deep hierarchies cannot be learnt using supervised learning due to this reason. The key factor areas included in the scope of this independence study are audio signals and their features and five of the classificat ion techniques including GMM, HMM, kNN, ANN and CNN. There are some key factors which have been identified to be interesting to further study and research. After a better understanding of the strengths and limitations of each method, the possibility of integrating two or more algorith ms together to solve a problem should be investigated. The objective is to utilize the strengths of one method to complement the weaknesses of another. This is one of the identified area. Also using video data in audio content analysis is identified as n area to continue the research work.

10.0 Contribution to Audio Signal Classification Automatic audio classificat ion is very useful for applications like speech recognition, audio indexing, content-based audio retrieval and online audio distribution, audio database creation and information retrieval, health condition monitoring, audio scene analysis, etc. There the challenge is to extract the most common and salient themes from unstructured raw audio data. My contribution is given mostly in identify ing the existing classification methodologies, analyzing and reviewing current methods of audio signal classification and related work, identifying advantages and disadvantages in classification models and identifying the future directions in Audio Signal Classificat ion through writing this review paper. There are several requirements for a classifier such as it must be computationally efficient with less complexity in its algorithm which will economizes its cost. Among the direct approach and hierarchical classifiers the latter is been identified to has the advantage of having flexibility in structure when future expansions are considered but the drawback being that it is comp licated and expensive. In supervised learning, basic kNN algorith m uses a great deal of storage space for the

training phase, and its execution space is at least as big as its training space. But for all non-lazy learners, execution space is usually much smaller than training space, since the resulting classifier is usually a highly condensed summary of the data. kNN method consists of assigning to the unlabelled feature vector the label of the training vector that is nearest to it in the feature space. kNN and Neural Networks require complete records to do their work. Moreover, kNN is generally considered intolerant of noise. Its similarity measures can be easily distorted by errors in attribute values, thus leading it to misclassify a new instance on the basis of the wrong nearest neighbors. A disadvantage of this method is its requirement of large computing power, since for classifying an object its distance to all the objects in the learning set has to be calculated. When it comes to NN techniques in supervised learning, speed of learning with respect to number of attributes and the number of instances are relatively low when co mpared to kNN like techniques. Infact, speed of classification is relatively faster than kNN. Both kNN and NN in supervised learning are not much tolerance to missing values. Thus the tolerance to irrelevant attributes is slightly higher in kNN, co mpared to NN. Infact NN has much slightly higher tolerance to noise over kNN. Also the accuracy is somewhat higher in NN. Another model studies in this study is Gaussian Mixture Model (GMM ). It can be concluded that GMM is computationally inexpensive, method based on a wellunderstood statistical model, and, for textindependent tasks, it is insensitive to the temporal aspects of audio signal. Thus comparing it with kNN classifier it was identified that GMM classifier has to only store the set of estimated parameters for each class where as a kNN classifier needs to store all the train ing vectors in order to compute the distances to the input feature vector. Also the number of features that are required to attain the same level of accuracy is more in the case of kNN classifier when compared with the GMM classifier. Therefore these features make the GMM mo re co mputationally optimal thus the kNN classifier is still an efficient classifier that is very simple in its methodology. HMM is a system with a set of states, and a set of transitions between states (or to the same

state). Each transition has an associated probability, and the system proceeds fro m state to state based on the probability of transition to a new state and current state. Comparing with other clustering methods such as k means, HMM clusters data on the temporal level rather than the frame level. There are several limitat ions which can be identified in this model. One such major limitation, when using it specially with audio signal classification is that it assumes that the probability of being in a given state at a given time(t) only depends on state at the time (t-1). But dependencies in audio tends to be extends through several states. Neural Network Models also play a major role in audio signal classification. Experiments have shown that CNN is a viable alternative for automat ic feature ext raction. CNN can be used to build CNN highly scalable models. It is also noted that CNN has strong capacity to capture informat ive features from the variations of musical patterns with minimal prior knowledge provided. It is noted that noted that the HMM/ANN has many parameters that can be adjusted for an optimal solution. It is identified that each classifying model contain both advantages and disadvantages in it, therefore in order to achieve the highest accuracy and performance they have to be combined accordingly and should be used together. In the areas of supervised and unsupervised learning there are several conclusions that have been concluded through this study. In the practical scenario supervised learning is inefficient when it’s used to learn models with deep hierarchies. This is because supervised learning always tries to find the connection between two sets of observations and the difficulty of the learning task grows exponentially in the number o f steps between the two sets. Models with deep hierarchies doesn’t suite to be learnt using supervised learning due to this reason. Selecting a model is not the only factor which affects the accuracy and the performance of the system for audio signal classification but also there are several other factors which play in it. Thus care must also be taken to optimize the number of features selected because each feature represents a dimension in the feature space. Therefore reducing number of features reduces the computational costs and at the same time maintains the accuracy levels.

I hope that this review paper will be a good contribution to audio signal processing field under audio signal classification and the parties involved in the field will be able to receive some benefits through it.

Acknowledge ment My sincere gratitude goes to Mr. Saminda Premaratne, my supervisor who guided me throughout all the stages of this independence study. At the same t ime all my colleagues at the Faculty of Informat ion Technology and my parents deserve to be appreciated for contributing to the succes s of this project. Also all the authors of the references I have used throughout the study are highly appreciated. Finally my special thanks go to all the parties who have not been mentioned above, but who has helped me throughout the project.

REFERENCES [1] Automatic Musical Gen re Classification Of Audio Signals. George Tzanetakis, Georg Essl, Perry Cook. Co mputer Science and Music Dep. 35 Olden Street Princeton NJ 08544. [2] Automatic Musical Genre Classification Of Audio Signals, George Tzanetakis, Georg Essl, Perry Cook. [3] An informat ive feature selection method for music genre classificat ion , Seo, J.S, Depart ment of Electrical Engineering, Gangneung -Wonju National University, [4] FEATURE EXTRA CTION FOR SPEECH AND MUSIC DISCRIM INATION Huiyu Zhou, Abdul Sadka and Richard M. Jiang [5] AUDIO SIGNA L CLASSIFICATION Hariharan Subramanian (Roll No: 04307909) Supervisors: Prof. Preeti Rao and Dr. Sumantra. D. Ro

[8] Pattern classification models for classifying and indexing audio signals. P. Dhanalakshmi,S.Palanivel,V.Ramalingam Depart ment of CSE, Annamalai University,Engineering Applications of Artificialntelligence. [9] Feng, Y., Dou, H., Qian, Y. CINA 2010 2010 International Conference on Information, Networking and Automation, Proceedings Vo lu me 1, 2010, Art icle nu mber 5636385 Pages 1298-V1302 [10] Reynolds, D. and Rose, R. [1995]. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, vol. 3(1): pp. 72– 83. [11] Speaker Verification Using Adapted Gaussian Mixture Models Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn M.I.T. Lincoln Laboratory, 244 Wood St., Lexington, Massachusetts 0242 [12] Hidden Markov Models and selected application in speech recognition. Lawrence R. Rabiner. Fellow IEEE [13]CLASSIFICATION OF T V PROGRAMS BASED ON AUDIO INFORMATION USING HIDDEN MA RKOV MODEL Zhu Liu, Jincheng Huang and Yao Wang, Depart ment of Electrical Engineering Polytechnic University, Brooklyn, NY 11201 [14] M. Seck, I. Magrin -Chagnolleau, and F. Bimbot, Experiments on Speech Tracking in Audio Documents using Gaussian Mixture Modeling, IRISA, France [15] M USIC CLASSIFICATION AND IDENTIFICATION SYSTEM . Alan P. Schmidt, Trevor K. M . Stone, Depart ment of Co mputer Science, University of Colorado, Boulder

[6]. Folk Music Classificat ion Using Hidden Markov Models. Wei Chai, Barry Vercoe Media LaboratoryMassachusetts Institute of Technology Cambridge, MA, U.S.A.

[16]FEATURES FOR AUDIO CLASSIFICATION Jeroen Breebaart and Martin McKinney. Philips Research Laboratories, Prof. Ho lstlaan 4 (WY82), 5656 AA Eindhoven, The Netherlands

[7] Audio Signal Classification:History and

[17]Learn ing optimal descriptors for audio cla

Current Techniques David Gerhard Depart ment of Co mputer Science Un iversity of Regina,Regina, Saskatchewan, CANADA 2003 -07 November, 2003

arning optimal descriptors for audio class discrimination Bertrand Delezo ide, Xav ier Rodet IRCAM - Centre Po mp idou 1 Place Igor.

[18] The diffract ion of light by high frequency sound waves: Part I. C. V. Raman Proceedings Math emat ical Sciences Vo lu me 2, Nu mber 4, 406-412, D OI: 10.100 7/BF03035840

[19] An Objective Approach to Content-Bas ed Audio Signal Classification. Dip lo marbeit eingereicht am 19. Mai 2003 Prof. Dr.-Ing. Thomas Siko ra Fachgebiet Nachrichten¨ ubertragung, Institut f¨ur Teleko mmunikations system. [20] Text Dependent Speaker Verification with a Hybrid HMM/ANN System Textberoende talarverifiering med ett hybrid HMM/ANNsystem, Johan Olsson

[21] Supervised Machine Learning: A Review of Classification Techniques S. B. Kotsiantis Depart ment of Co mputer Science and Technology University of Peloponnese, Greece. [22] To wards instrument segmentation for music content description: a critical rev iew of instrument classification techniques Perfecto Herrera, Xavier A matriain, Elo i Batlle, Xav ier Serra Audiovisual Institute - Po mpeu Fabra University. [23] Top-Do wn Induction of Decision Trees Classifiers – A Survey Lio r Rokach and Oded Maimo [24] Unsupervised Classification of Music Signals: Strategies Using Timbre and Rhythm Zachary W. Bond [25] Applying Neural Network on the ContentBased Audio Classification Xi Shao # *, Changsheng Xu # , Mohan S Kankanhalli* # Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613 *School of Co mputing Nat ional University of Singapore [26] Speech and Music Classification and Separation: A Review Abdullah I. Al-Shoshan Depart ment of Computer Science, College of Co mputer, Qassim University, Saudi Arabia [27] Automatic Musical Pattern Feature Extraction Using Convolutional Neural Network To m LH. Li, Antoni B. Chan and Andy HW. Chun.