Spoken Document Retrieval Using Multilevel ... - Semantic Scholar

3 downloads 0 Views 588KB Size Report
“Real-world” queries with an average of 16.2 hits for each query were applied ... on personal favorites for the spoken keyword extraction results. Thirty randomly ...



Spoken Document Retrieval Using Multilevel Knowledge and Semantic Verification Chien-Lin Huang, Student Member, IEEE, and Chung-Hsien Wu, Senior Member, IEEE

Abstract—This study presents a novel approach to spoken document retrieval based on multilevel knowledge indexing and semantic verification. Multilevel knowledge indexing considers three information sources, namely transcription data, keywords extracted from spoken documents, and hypernyms of the extracted keywords. A semantic network with forward–backward propagation is presented for semantic verification of the retrieved documents. In the forward step for semantic verification, a bag of keywords is chosen based on word significance measures. Semantic relations are estimated and adopted for verification in the backward procedure. The verification score is then utilized to weight and rerank the retrieved documents to obtain the final results. Experiments are performed on 40 h of anchor speech extracted from 198 h of collected broadcast news. Experimental results indicate that multilevel knowledge indexing and semantic verification achieve better retrieval results than other indexing schemes. Index Terms—Multilevel knowledge, semantic verification, spoken document retrieval (SDR), spoken keyword extraction.



PEECH is the most natural and effective medium for human-to-human communication and human-to-machine interaction. Applications of speech recognition include spoken document retrieval (SDR) [1], [2], spoken document summarization [3], multilingual oral history archives access [4], meeting record browsing [5], etc. These applications focus on retrieving the information required by users, and all attempt to maximize the relevance of the returned information satisfying the user requirements. The requirement for spoken document retrieval is growing rapidly in various applications such as education, business, and entertainment. Spoken document retrieval using traditional textbased information retrieval methods is not always appropriate for these applications. A large-vocabulary continuous-speech recognition (LVCSR) system is usually applied to transcribe the spoken documents into text for indexing and retrieval. Due to imperfect dictation, accurate transcriptions are difficult to obtain for spoken document retrieval. Consequently, many methods have been developed recently to solve retrieval problems in the speech content [6]–[13].

Manuscript received April 10, 2006; revised August 3, 2007. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Helen Meng. The authors are with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2007.907429

A. Related Works in SDR Research on spoken document retrieval [6] has concentrated on spoken language processing, while only limited progress has been achieved in indexing spoken database applications. To solve the effect of deletion error and insertion error on speech recognition, the overlapping multigram phone sequences have been estimated for variable-length syllables indexing [7]. A reliable method for better utilizing fusion with the syllable-, character-, and word-based features is needed to compensate for possible speech recognition errors [8]. A large-vocabulary speech recognition system has an inherently restricted vocabulary size, because of acoustic and linguistic constraints, causing an out-of-vocabulary (OOV) problem. To consider this problem, syllable-like units called particles can be employed for spoken document indexing [9]. Particles are defined as within-word sequences of characters obtained from orthographic or phonetic transcriptions of words. Other subword-based schemes, including phone and syllable lattice techniques [10]–[13], also address the OOV problem. Moreover, many techniques have been recently proposed for advanced retrieval. Crestani [14] presented the prosodic information for spoken document retrieval. An adaptive signal propagation network has been developed for automatic relevance feedback [15]. In the above discussion, previous SDR systems are based on a speech recognition system with various indexing transcriptions. While speech content can be recognized from signal to word transcriptions, significant words and semantic knowledge of the transcribed words are not well adopted for spoken document retrieval. Due to the redundant property of spontaneous speech and recognition errors from LVCSR, transcriptions contaminated by redundant/noisy data are adopted in spoken document retrieval, degrading the retrieval performance. Additionally, spoken document retrieval systems generally use one-pass retrieval strategies. Two-pass retrieval by backward verification based on semantic information may enhance performance. B. Proposed Framework This study presents a novel semantic verification approach to spoken document retrieval by a forward–backward algorithm. Prosodic information, speech recognition confidence, and the term frequency and inverse document frequency (TF-IDF) scores [18] are first utilized to extract important keywords. Second, multilevel knowledge indexing hierarchically considers speech transcriptions, extracted keywords, and hypernyms of the extracted keywords. Third, the verification of spoken document retrieval based on the semantic relevance between keywords reranks the retrieved spoken documents by

1558-7916/$25.00 © 2007 IEEE



Fig. 1. Flowchart of spoken document retrieval using multilevel knowledge and semantic verification.

a forward–backward algorithm [19]. Fig. 1 shows the block diagram of the proposed method. C. Outline of the Paper The rest of this paper is organized as follows. Section II presents the multilevel knowledge indexing method for spoken document retrieval. Section III proposes a semantic verification network model for speech retrieval. Section IV summarizes the results of applying the methods to a Mandarin news spoken database. Conclusions are finally drawn in Section V. II. MULTILEVEL KNOWLEDGE FOR SDR Semantic information for spoken document retrieval, including transcriptions, keywords, and hypernyms of the keywords are used as the multilevel knowledge. A. Spoken Document Database Characterization A spoken document database comprises an accumulation of speech audio files. A single speech audio stream is defined. Definition 1: A speech audio stream can be transcribed , a character into a syllable sequence , and a word sequence sequence . The terms , , and correspond to the lengths of the syllable, character, and word sequences, respectively. A spoken document for retrieval includes these three types of transcriptions. These transcriptions are traditionally used for information retrieval without high-level knowledge. Definition 2: A spoken keyword extraction method is used to extract keywords from transcriptions. In spoken keyword extraction, a sequence of transcribed words along with their extracted keywords represent the spoken document content, while denotes the length of a keyword sequence. Keywords contain richer semantic and meaningful information than the raw transcription data. Definition 3: In the context of knowledge, an ontology denotes a specification of a conceptualization. Let represent a mapping function that is used to convert the seinto a sequence of hyperquence of keywords . nyms Hence, a spoken document containing three level descriptors , where denotes a can be characterized as set of transcriptions containing syllable and character contents, and denotes the length of a syllable or character sequence.

B. Spoken Keyword Extraction Spoken documents and speech queries contain complex descriptions such as events, places, times, and people. A set of keywords that completely characterizes a spoken document has to be extracted. Consider the following sentence as an example. Find the restaurant selling hand-pulled noodles near NCKU The keywords in the above sentence is extracted as Restaurant Hand-pulled noodles NCKU As an informative signal, speech contains not only the implication of its language content, but also the speaker’s emotions. This study presents a method for extracting spoken keywords for spoken documents based on acoustic, prosodic, and linguistic information. The keywords are obtained by considering the three information types. First, the speech recognition confiis estimated in order to choose reliable words from dence , the speech transcription. Second, prosodic significance including duration, pitch and energy, is used to select stressed words. Finally, the linguistic information is used to detect in the sentence. The above key related significant words information of each word is estimated as follows: (1) where , and denote the weighting factors for balancing , and . The weights, , , and are among calculated empirically. Several keyword sets corresponding to different weights were extracted to derive the weights. Eighteen graduate students were then asked to choose the best keyword , set subjectively. The weights were finally chosen as , and . The score was then compared with a predefined verification threshold . If , then the word was extracted as a significant keyword . 1) Speech Recognition Confidence: A spoken document is different from a text document. An LVCSR system automatically converts speech into transcriptions. A misrecognized transcription is considered as noise in spoken document retrieval. is used to measure The speech recognition confidence the assurance of each transcribed word. Specifically, the likelihood score of each transcribed word as a confidence measure is obtained from a speech recognizer. The speech recognition confidence of observed signal , considering the acoustic , and the linguistic log-likelihood log-likelihood , of a word hypothesis is estimated as follows:




(VSM)-based approach indexed by the multilevel knowledge is adopted for SDR. The query and spoken documents are represented as three levels of feature vectors for highly efficient retrieval. The cosine measure was used to estimate the similarity between query and spoken document as (4) Fig. 2. Example of ontology knowledge representation.

where the linguistic log-likelihood is estimated using the n-gram language model and the combination weights and denote the combination weights for the acoustic and linguistic scores, respectively. The best weight decision and by was determined in this study as experimentally minimizing the speech recognition error rate. 2) Prosodic Significance: Another difference between speech and text documents is the prosodic information in can reflect speech signals. The prosodic information a speaker’s intent, mood, and emotion. This study extracted stressed words in spoken documents according to the prosodic information used by Wu et al. [20]. 3) TF-IDF Score: The TF-IDF score is used to obtain the words with high word frequency (TF) and low inverse document frequency (IDF) and is useful linguistic information for herein is used to information retrieval. The TF-IDF score evaluate the importance of the words in an utterance, and is calculated as follows: freq


where freq denotes the number of occurrences of word in represents the the transcribed spoken document ; is the number of documents inverse document frequency; that contain at least one occurrence of the word in the training denotes the total number of spoken spoken documents and documents for training. Moreover, a stop word list consisting of 388 words was used to filter out the noninformative words. C. Knowledge Representation Every knowledge-based system is explicitly or implicitly committed to some conceptualization [21]. A conceptualization is an abstract, simplified view of a word representing a concept, given by keyword . Hypernym denotes a set of concepts belonging to the same semantic category. In this paper, denotes the group of keywords pertaining to a certain concept, as depicted in Fig. 2. HowNet [22], [23] is an explicit specification of a conceptualization, and is to derive hypernyms. In this paper, the word sense (or concept) defined in HowNet [22] with the highest frequency is selected as the hypernym. Each word token is simply mapped to a corresponding hypernym, and the resulting hypernym stream is indexed. D. Multilevel Spoken Document Indexing and Retrieval To address multilevel knowledge, information fusion is applied to provide various viewpoints of the spoken documents. This study integrated three level indexes, namely the transcription, keyword, and hypernym indexes. The vector space model

where and denote the feature vectors of query and spoken document , and denotes the dimension of the feature vector. The values of were 30 121 for words, 402 for syllables and 402 402 for syllable pairs in feature vectors. 6395 respective unigrams and 6395 6395 overlapped bigrams of the Chinese characters are used for indexing. Retrieval results are ranked according to the similarities obtained in the retrieval process. The multilevel description of spoken document is characterized . The values of resulting by the membership of from the mapping of the entire syllable and character transcriptions are concatenated into a single weight vector . The OOV problem can be alleviated by using syllable and character indexing [8]. Each component in the feature vector was estimated using TF-IDF from the query and the spoken documents. The similarity for retrieval is thus defined as a linear combination of indexes from the transcription vector , the extracted keyword vector and the hypernym vector

(5) , , and represent where the similarities between the query vector and each spoken document vector for syllables and characters, keywords, and hypernyms, respectively. The speech transcription indexing based on syllable- or character-level features is performed by combining a unigram with an overlapped bigram. The weighting , , and are empirically determined. Because factors these feature vectors are sparse, only nonzero values have to be stored. Therefore, the indexing storage size is small, and the cosine measure can efficiently estimate the similarity. III. VERIFICATION USING SEMANTIC RELEVANCE Due to imperfect speech recognition, some data included for spoken document indexing and retrieval are mistranscribed. This study used the concepts, such as query expansion [24] and relevance feedback [25], used in typical text information retrieval (IR) for semantic spoken document retrieval [16]–[18]. Clustering techniques were adopted for query expansion to construct association matrices which quantify the term correlations. Relevance feedback developed for IR systems is a supervised learning technique to improve effectiveness. Relevance feedback uses positive and negative examples provided by the user to reweight the vector space model [25]. An adaptive signal propagation network has been presented for automatic relevance feedback [15]. Soo et al. [26] proposed concept-based retrieval based on automated semantic annotation in the image retrieval. In contrast with the concept-based retrieval method [26], this study proposes semantic verification considering the semantic dependency and co-occurrence relations between



Fig. 3. Network architecture for semantic verification. (a) Two documents were retrieved in the first pass. (b) Three related keywords were collected from two extracted documents and mapped to their related hypernyms. (c) Using the forward–backward procedure in semantic verification, the first-pass retrieved score was modified by the verified score.

words and documents. The goal of semantic verification is to verify and rerank the retrieved spoken documents by modeling the semantic relevance between the keywords in the retrieved spoken documents. Knowledge relation and propagation are formulated as a semantic network [15], as illustrated in Fig. 3. The network comprises three layers, one each for the spoken documents, the keywords, and the hypernyms of the keywords. As observed previously, a bidirectional connection exists between a document node and a keyword node, corresponding to a keyword in the document. Moreover, each keyword node corresponds to a hypernym node. A. General Model In the first pass, forward propagation is adopted to reorganize the keywords and infer the user’s requirement, as shown in Fig. 3(a). Therefore, a bag of keywords is collected. Based on the forward inferred keywords, this study derives the related hypernyms corresponding to the keywords shown in Fig. 3(b). In the second pass, the retrieved results are verified by the semantic relevance between the keywords. The conditional co-occurrence probability [27] is estimated to discover the co-occurrence relationship between keywords. The semantic dependency grammar (SDG) [20] derived from the probabilistic context-free grammar (PCFG) [27] is adopted to obtain the semantic relevance between hypernyms. The verification score is estimated from the relevance between keywords in the retrieved spoken documents. The backward propagation computes the estimated verification score to weight the retrieved results as depicted in Fig. 3(c) and to obtain the final output. B. Forward Keyword Relation Model The purpose of forward relation aims to find the important keywords from the ranked results in the first pass. In this study, ranked retrieval results are selected according to the the topretrieval score. These candidate spoken documents initiate the inference process by showing the document-related keyword nodes. Each spoken document has a weighted connection to the

Fig. 4. Illustration of the forward relation. (a) The forward condition is one document to one keyword correspondence. (b) The forward condition is many documents to one keyword correspondence.

keyword. The keyword nodes then create the corresponding hypernym nodes. ranked spoken docuGiven retrieval results consisting of ments, the forward relation algorithm is performed to extract a bag of keywords. The algorithm is described as follows. 1) Initialization (6) This step initializes the forward score using the cosine of spoken documents can be measure. The confidence of the estimated by the similarity measurement retrieved spoken documents. 2) Forward keyword relation (7) Fig. 4 illustrates the forward keyword relation, and indicates how keyword can be reached from the related denotes the initial score of spoken document . Since the related spoken documents, the connected weight relabetween documents and keywords are estitions mated as the keyword extraction score (8)



where and are the confidence and prosody denotes the scores of keyword , respectively, and keyword significance score, which is estimated as

(9) is the frequency of keyword in the tranwhere scribed spoken document . The keyword significance describes the product of the term frequency score and the document frequency . is the number of documents that contain at least one in the transcribed spoken occurrence of the keyword document. in (9) and in (3) is The difference between considers the keyword frequency and the frethat quency of the documents containing the keyword, while measures the word frequency and the inverse document frequency. In the semantic verification, the frequent keyword with high occurrences in different documents reveals high significance to the word under verification. utilizes the document frequency Based on this idea, (DF) rather than the inverse document frequency (IDF), as in (3). In semantic verification, the forward process is adopted to discover the strongly co-occurring keywords. The paramis defined as the forward score of the keyword eter by summing the product over all the related spoken documents . Fig. 4 depicts two possible conditions, namely the one-to-one correspondence shown in Fig. 4(a) and the many-to-one correspondence shown in Fig. 4(b). is presented with more meaningful inThe keyword tentions in Fig. 4(b) than in Fig. 4(a) with respect to the user’s query. C. Backward Semantic Verification

Fig. 5. Example of the semantic dependency graph.

is considered probability in the spoken document as the co-occurrence relation score of the keyword . Smoothing is performed by interpolating the bigram and unigram relative frequencies [28]. In this study, the maximum co-occurrence probability was chosen according to the maximum contribution from the backward verification keyword to the document, instead of the combined contribution from multiple keywords. This is because due to the redundant property of spontaneous speech and recognition errors from LVCSR, transcriptions are generally contaminated by redundant/noisy data and thus degrade the retrieval performance. Moreover, the low-level data representation is difficult to map to the high-level semantic representation. The semantic relationship between two keywords is analyzed from the semantic dependency grammar. The semantic dependency grammar (SDG) [20], which is derived from a modified probabilistic context-free grammar, is applied to calculate the semantic dependency score SDG SDG


Verification is used to measure the relevance of a spoken document given a query. A bag of keywords is extracted in forward propagation. Similar to the forward algorithm, the weight of each keyword is estimated according to the significance of the keyword to the spoken document. The backward semantic verification models the keyword and document relations with the conditional co-occurrence probability and semantic dependency grammar. The backward semantic verification algorithm is described as follows. 1) Measurement of the semantic relationship is apThe conditional co-occurrence probability plied to estimate the statistical language co-occurrence of keyword given the observed keyword . The problem of data sparseness is solved by a simple back-off smoothing of the method. The co-occurrence relation score keyword is derived as (10) where and denote the keywords in the spoken document . The maximum conditional co-occurrence

denotes the probabilistic context-free where grammar that is adopted to parse the hierarchical tree represents the SDG score, denotes structure, denotes the the total sentence number, and and . Each word sentence containing hypernyms token is simply mapped to its corresponding hypernym by denotes the parse tree, denotes the existing HowNet. relation index, and represents the possible dependency relation in the parse tree with word length . Moreover, the score is estimated with the following equation:

(12) denotes the frequency at which where dependency relation occurs in the training denotes the co-occurrence frequency corpus, and of hypernyms and in the training corpus. Fig. 5 shows an example for a semantic dependency graph, constructed from a Chinese sentence.



is applied to update the multilevel knowledge Finally, retrieval score (15) where denotes the weighting factor for . The linear weighting factors for confidence and similarity score combination are empirically derived. The combined similarity . The score is obtained using the weighting factor document similarity scores for the retrieved documents are then reranked. IV. EXPERIMENTS AND RESULTS

Fig. 6. Illustration of the backward semantic verification. The backward verification is (a) one keyword to one document correspondence, (b) many keywords to one document correspondence, (c) many keywords to one document correspondence with the probability of conditional co-occurrence, and (d) many keywords to one document correspondence with the probability of conditional co-occurrence and semantic dependency.

This section presents the performance of the proposed methods for spoken document retrieval. A brief introduction of the in-house speech recognizer and the experimental setting is first presented, followed by a subjective evaluation of spoken keyword extraction. A comparison of the proposed multilevel indexing and retrieval approach with other indexing methods is presented. Finally, the evaluation of the semantic verification is presented. A. Structure of the Continuous Speech Recognition System

Semantic relations between keywords are determined by SDG using HowNet [22] and Sinica Treebank [29]. To avoid the sparse data problem caused by estimating dependency relations, PCFG is adopted to parse the sentence and replace each keyword with its own hypernym based on of hypernym HowNet. The semantic relation score is derived as SDG


The spoken document includes a semantic relation and . Moreover, the semantic between hypernyms dependency score is estimated from the pretrained semantic dependency grammar. The maximum semantic dependency score in the spoken document, given by , is used as the semantic relation score of hypernym . 2) Backward semantic verification (14) Similarly, the backward semantic verification score for spoken document can be estimated. Since denotes the forward score of the document-related spoken and are used to estimate the sekeyword, mantic scores for the keyword in the spoken document . The term denotes the backward semantic verification score of spoken document , and is calculated by summing the products over all document-related keywords . Fig. 6 illustrates the four possible backward conditions, namely one-to-one correspondence, many-to-one correspondence, many-to-one correspondence with co-occurrence probability, and many-to-one correspondence with co-occurrence probability and semantic dependence.

An in-house speech recognizer [30] was used for experiments. This study defined 150 subsyllables for automatic speech recognition, comprising 112 right-context-dependent INITIALs and 38 context-independent FINALs, as the acoustic units, based on the phonetic structure of Mandarin speech. Each subsyllable unit was modeled by a hidden Markov model (HMM), with three states for the INITIALs and four states for the FINALs. The Gaussian mixture number per state of the acoustic HMM ranged from 2 to 32, depending on the quantity of the training data. The silence model was a one-state HMM with 64 Gaussian mixtures trained with nonspeech segments. B. Experimental Setup The spoken document corpus was obtained from the Mandarin Chinese broadcast news corpus (MATBN) collected by Academia Sinica [31], Taiwan. The corpus contained a total of 198 h of broadcast news with the corresponding transcripts in Big5-encoded form from the Public Television Service Foundation (Taiwan, R.O.C.) [32]. The MATBN corpus was annotated with acoustic conditions, background conditions, story boundaries, speaker turn boundaries, and audible acoustic events, including hesitations, repetitions, vocal nonspeech events, and external noises. 3603 anchor news stories ranging over three years, from August 2001 to July 2004 were extracted. The average news story length was 36.83 s with an average of 89.59 characters. News manuscripts, collected from the Internet from 2001 to 2006 and comprising a total of 11.5 million Chinese characters, were adopted to construct the language models. The vocabulary size for the speech recognizer was 30 121 words. Buckley [33] recommended adopting at least 25 and preferably 50 queries to ensure a reliable evaluation. Therefore, 30 speech sentence queries were collected for testing. “Real-world” queries with an average of 16.2 hits for each query were applied, similar to in the content of the spoken document database. The average length of a query sentence was





11.75 characters. Each sentence contained only one keyword. The lexicon of the speech recognition system was adopted to access HowNet and Sinica Treebank. Moreover, the semantic dependency grammar was constructed from the Sinica Treebank [29] with 36 953 sentences and the HowNet knowledgebase [22]. A total of 22 025 rules were extracted according to the tree structure of parts-of-speech (POSs), and the probabilities of each rule estimated from the Treebank were obtained. The experiments conducted in this study related to the recognition of Mandarin speech. Three classes of speech recognition errors, namely insertion errors Ins , deletion errors Del , and substitution errors Sub , were considered. The accuracy Acc of syllable and character [34] was estimated as Acc


Ins Del Num



where Num denotes the number of recognition units in the spoken documents. Table I shows the syllable, character, and word recognition performances for the spoken document database [31] and query sentences. The syllable accuracies for the spoken documents and speech queries were 0.728 and 0.881, respectively. Because of the integration of the language model, the character accuracies were 0.775 and 0.904, which were greater than the syllable accuracies. Since word segmentation in a sentence is a challenging problem, and the out-of-vocabulary rate in the Chinese spoken documents was fairly high, the word accuracy of the spoken documents and speech queries was only 0.583 and 0.702, respectively. C. Evaluation of Spoken Keyword Extraction The spoken keyword extraction performance was evaluated by the mean opinion score (MOS) measure. Eighteen graduate students were invited to evaluate the spoken keyword extraction results subjectively according to two criteria, keyword-document relation (KDR) and favorite (FAV). The KDR was scored according to the percentage of the number of related keywords in the extracted keywords with respect to the number of keywords in the spoken document. The FAV was scored depending on personal favorites for the spoken keyword extraction results. Thirty randomly selected documents were provided to every graduate student. Each student rated each document with a score of 0–10. Table II shows the evaluation results. The results were tested by the -test at a significance level of [27], confirming the significance of the evaluation.

Fig. 7. Recall and precision plots for various type of indexing including the extracted keyword (K; K (L); K (C ); K (S )), characters, word, syllable, and hypernym H .

D. Performance on Multilevel Knowledge Indexing Spoken keyword retrieval can be measured according to precision (PRE) and recall (REC). The precision is the percentage of the retrieved data matching that is correct, and is also called the raw average precision (rAP) estimated as PRE


where denotes the number of queries, and represents the retrieved number of relevant documents contained in the documents for query . Information retrieval can also be evaluated in terms of recall, which is the proportion of accurately matched data extracted as the desired output REC


where denotes the number of relevant spoken documents in the corpus for the th query . The multilevel knowledge indexing and retrieval approach was compared with other indexing approaches. Fig. 7 shows the precision versus recall plot for 30 speech sentence queries; the comparison results of different indexing approaches based on character, word, syllable and hypernym, and extracted keyword with various information types. The four information types were assessed by the extracted keywords with speech recognition confidence, ; the extracted prosodic information, and TF-IDF score ; the keywords with only speech recognition confidence ; and the extracted extracted keywords with only prosody . The approach using keywords with only TF-IDF score achieved a small improvement over the approaches using , , and . Additionally, the keyword indexing produced by the spoken keyword extraction method was highly accurate for spoken document retrieval. Due to the precise meaning of spoken content, the character- and word-based indexing methods were found to be better than syllable- and




Fig. 8. Recall and precision plots for the manual transcriptions , keyword+recognized transcription+hypernym ( + + ), keyword+recognized transcription ( + ), keyword+hypernym ( + ), extracted keyword , and recognized transcription indexing methods.





hypernym-based indexing methods. The semantic concept indexing, i.e., the hypernym, did not perform as well as others, because the meaning of the hypernym is obscure and less distinctive than the original word. To compare with the various hybrid methods and manual transcriptions, Fig. 8 shows the comparison results of the indexing based on manual transcription , multilevel knowl, fusion of extracted keyword and speech edge , fusion of extracted keyword and keytranscription word-related hypernym , extracted keyword , and speech transcription indexing. The weights of character and syllable indexing applied for the combination were and . The speech recognition confidence, prosodic information and TF-IDF score were considered in extracted keyword indexing. The speech transcription indexing used the character and syllable information. The upper bound of speech retrieval performance was generally obtained by retrieving the manually transcribed textual content. Manual word-based results were utilized for indexing and retrieval. Due to the imperfect recognition of speech, a gap was observed between indexing using speech recognition results and using hand-transcribed results. Multilevel knowledge performed better than fusion of extracted keyword and speech transcription, and fusion of extracted keywords and keyword-related hypernyms. and only a very marThere is almost no gain for ginal gain for over . The three indexing criteria, i.e., extracted keywords, speech transcriptions, and keyword-related hypernyms, are complementary. Hence, multilevel knowledge performed slightly better than two-way fusion of the various and streams, i.e., keyword+recognized transcription . The multilevel knowledge keyword+hypernym indexing clearly achieved the best performance among the surveyed methods. The optimal weighting factors were determined from experiments on a training set. The training set, containing 3000 files with an average length of 14.1 characters, is manually segmented into short sentences. The weights applied for , the combination of multilevel knowledge were

Fig. 9. F-measure plots of the top 5, 10, 15, and 20 retrieved documents for the methods of user feedback, semantic verification, semantic verification without ( ) of hypernym, local context analysis, and onesemantic relation score pass retrieval.

G h

, and . The fusion of hypernym, denoted , achieved a slight improvement compared as . We can see an incremental increase to the approach between the indexing of the individual approach in Fig. 7 and the fusion approaches in Fig. 8. Moreover, using multilevel knowledge achieved a closest performance to manual transcriptions among the approaches tested. Figs. 7 and 8 indicate that the average precision fell at a significance range from the average recall of 50% to 100%. The results were experimented using the -test [27]. at a significance level E. Evaluation of Semantic Verification A tradeoff exists between the precision and recall rates in information retrieval. Retrieving more candidates raises the recall rate, but reduces the precision rate. Therefore, the F-measure was incorporated into the experiment to reflect the tradeoff between precisions and recalls, and is defined as FM



In order to evaluate the effect of replacing IDF with DF in (9), an experiment conducted on semantic verification was perusing docuformed. The experimental result shows that ment frequency (DF) achieves a better performance than IDF with a significance value of the 0.23 improvement in F-measure. In the following experiments, DF instead of IDF is adopted for semantic verification. In the retrieval experiment, an objective evaluation was employed to evaluate the information retrieval by the semantic verification for the documents ranked at 5, 10, 15, and 20. Several methods were evaluated. Fig. 9 shows the analytical results. First, user provided feedback [18] on each retrieved documents from top-15 best matches and ten iterations achieved better performance than semantic verification with an improvement of




traditional spoken document retrieval using various transcriptions. The spoken keyword extraction method is applied to extract a bag of keywords from document contents. Based on the multilevel knowledge of spoken documents, this study incorporates the forward–backward semantic network for semantic verification, which is adopted to organize lexical-semantic relations in order to verify the retrieved results. The experimental results indicate that the proposed scheme has better spoken document retrieval, in terms of the proposed multilevel knowledge indexing and semantic verification, than previous methods. REFERENCES

2.1%. The score of semantic verification with semantic relation was slightly better than that of semantic verification without semantic relation. As shown before, the hypernym stream yielded only a marginal improvement for IR. In order to validate the value of using hypernyms, an experiment was performed to evaluate the impact of the semantic relation score. It was found that including the semantic relationship actually improved performance, with an absolute increase of 1.28% in F-measure. Second, query expansion based on local context analysis [18] achieved a modest improvement compared to the one-pass retrieval method (without verification). There are three steps in local context analysis [18]. First, the top 2000 ranked passages are retrieved. Second, the similarity between query and concept is computed using the TF-IDF ranking. Third, the top 20 ranked concepts according to similarity are added to the original query. Third, this study achieved a relative improvement of around 7% compared to the one-pass retrieval method. Finally, semantic verification without adopting semantic relation score slightly degraded the performance. The highest F-measure achieved was 0.625 when the top 15 documents were retrieved. The evaluation was conducted on a Pentium IV 3.0-GHz PC with 2-GB RAM using a Windows XP system and C++ implementation. The semantic dependency grammar and the conditional co-occurrence probability were trained in the offline process. In the multilevel knowledge indexing, the document to keyword relation was constructed as an indexing table. Table III lists the retrieval times (milliseconds) used on different sets of the retrieved documents. The proposed method performed better in semantic verification than one-pass retrieval, but took a longer time to retrieve information. In the experiment, this study achieved a good performance in retaining 100 documents. The F-measure was calculated on the top 15 retrieval results. The system took less than 1 s of CPU time to complete a retrieval process. The forward–backward procedure improved the F-measure from 0.583 for one-pass retrieval to 0.625 for semantic verification, representing an improvement of 4%. V. CONCLUSION This study has presented a novel approach to spoken document retrieval by multilevel knowledge indexing and semantic verification. The proposed method uses a three-level indexing method, based on transcription data, extracted keywords, and hypernyms of the keywords. Since the multilevel knowledge considers various representations of spoken documents, it enables the semantic analysis of spoken documents, rather than

[1] J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava, “Speech and language technologies for audio indexing and retrieval,” Proc. IEEE, vol. 88, no. 8, pp. 1338–1353, Aug. 2000. [2] A. G. Hauptmann and H. D. Wactlar, “Indexing and search of multimodal information,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1997, vol. 1, pp. 195–198. [3] C. Hori and S. Furui, “A new approach to automatic speech summarization,” IEEE Trans. Multimedia, vol. 5, no. 3, pp. 368–378, Sep. 2003. [4] W. Byrne, D. Doermann, M. Franz, S. Gustman, J. Hajic, D. Oard, M. Picheny, J. Psutka, B. Ramabhadran, D. Soergel, T. Ward, and W.-J. Zhu, “Automatic recognition of spontaneous speech for access to multilingual oral history archives,” IEEE Trans. Speech Audio Process., vol. 12, no. 4, pp. 420–435, Jul. 2004. [5] R. Cutler, Y. Rui, A. Gupta, J. J. Cadiz, I. Tashev, L. He, A. Colburn, Z. Zhang, Z. Liu, and S. Silverberg, “Distributed meetings: A meeting capture and broadcasting system,” ACM Multimedia, pp. 503–512, 2002. [6] J. H. L. Hansen, R. Huang, B. Zhou, M. Seadle, J. R. Deller, Jr., A. R. Gurijala, M. Kurimo, and P. Angkititrakul, “SpeechFind: Advances in spoken document retrieval for a national gallery of the spoken word,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 712–730, Sep. 2005. [7] K. Ng, “Subword-based approaches for spoken document retrieval,” Ph.D. dissertation, Mass. Inst. Technol., Cambridge, MA, 2000. [8] B. Chen, H. Wang, and L. Lee, “Discriminating capabilities of syllablebased features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese,” IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 303–314, Jul. 2002. [9] B. Logan, J.-M. Van Thong, and P. J. Moreno, “Approaches to reduce the effects of OOV queries on indexed spoken audio,” IEEE Trans. Multimedia, vol. 7, no. 5, pp. 899–906, Oct. 2005. [10] D. A. James and S. J. Young, “A fast lattice-based approach to vocabulary independent word spotting,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1994, vol. 1, pp. 19–22. [11] P. Yu, K. Chen, C. Ma, and F. Seide, “Vocabulary-independent indexing of spontaneous speech,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 635–643, Sep. 2005. [12] S. Dharanipragada and S. Roukos, “A multistage algorithm for spotting new words in speech,” IEEE Trans. Speech Audio Process., vol. 10, no. 8, pp. 542–550, Nov. 2002. [13] K. Thambiratnam and S. Sridharan, “Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2005, vol. 1, pp. 465–468. [14] F. Crestani, “Towards the use of prosodic information for spoken document retrieval,” in Proc. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2001, pp. 420–421. [15] R. Wilkinson and P. Hingston, “Using the cosine measure in a neural network for document retrieval,” in Proc. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 1991, pp. 202–210. [16] L. Lee and B. Chen, “Spoken document understanding and organization,” IEEE Signal Process. Mag., vol. 22, no. 5, pp. 42–60, Sep. 2005. [17] T. Hoffman, “Probabilistic latent semantic analysis,” in Proc. 15th Conf. Uncertainty Artif. Intell., 1999, pp. 289–296. [18] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983. [19] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing. Upper Saddle River, NJ: Prentice-Hall, 2001. [20] C.-H. Wu, C.-H. Hsieh, and C.-L. Huang, “Speech sentence compression based on speech segment extraction and concatenation,” IEEE Trans. Multimedia, vol. 9, no. 2, pp. 434–437, Feb. 2007.



[21] T. R. Gruber, “A translation approach to portable ontologies,” Knowledge Acquisition, vol. 5, no. 2, pp. 199–220, 1993. [22] HowNet. [Online]. Available: http://www.keenage.com/ [23] C.-H. Wu, J.-F. Yeh, and Y.-S. Lai, “Semantic segment extraction and matching for internet FAQ retrieval,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 7, pp. 930–940, Jul. 2006. [24] H. Cui, R. Sun, K. Li, M.-Y. Kan, and T.-S. Chua, “Question answering passage retrieval using dependency relations,” in Proc. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2005, pp. 400–407. [25] J. J. Rocchio, Jr., “Relevance feedback in information retrieval,” in The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Englewood Cliffs, NJ: Prentice-Hall, 1971, pp. 313–323. [26] V.-W. Soo, C.-Y. Lee, C.-C. Li, S. L. Chen, and C. Chen, “Automated semantic annotation and retrieval based on sharable ontology and case-based learning techniques,” in Proc. 3rd ACM/IEEE-CS Joint Conf. Digital Libraries, 2003, pp. 61–72. [27] C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press, 1999. [28] L. Rabiner and B.-H. Juang, Fundamental of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1983. [29] K.-J. Chen, C.-R. Huang, F.-Y. Chen, C.-C. Luo, M.-C. Chang, C.-J. Chen, and Z.-M. Gao, “Sinica treebank: Design criteria, representational issues and implementation,” in Building and Using Parsed Corpora, A. Abeille, Ed. Dordrecht, The Netherlands: Kluwer, 2003, pp. 231–248. [30] C.-H. Wu and Y.-J. Chen, “Recovery of false rejection using statistical partial pattern trees for sentence verification,” Speech Commun., vol. 43, pp. 71–88, 2004. [31] H. Wang, B. Chen, J.-W. Kuo, and S.-S. Cheng, “MATBN: A Mandarin chinese broadcast news corpus,” Int. J. Comput. Ling. Chinese Lang. Process., vol. 10, no. 2, pp. 219–236, 2005. [32] Public Television Service Foundation. [Online]. Available: http://www. pts.org.tw/ [33] C. Buckley and E. Voorhees, “Evaluating evaluation measure stability,” in Proc. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2000, pp. 33–40. [34] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book (for HTK Version 3.3). Cambridge, U.K.: Cambridge Univ., 2005.

Chien-Lin Huang (S’04) received the M.S. degree in computer science from National Cheng Kung University (NCKU), Tainan, Taiwan, R.O.C., in 2004. He currently pursuing the Ph.D. degree in the Department of Computer Science and Information Engineering at NCKU. His research interests include speech processing, spoken language processing and multimedia information retrieval.

Chung-Hsien Wu (S’88–M’88–SM’03) received the B.S. degree in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 1981, and the M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University (NCKU), Tainan, Taiwan, in 1987 and 1991, respectively. Since August 1991, he has been with the Department of Computer Science and Information Engineering, NCKU. He became a Professor in August 1997. From 1999 to 2002, he served as the Chairman of the department. He also was with the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology Cambridge, in Summer 2003 as a Visiting Scientist. He is the Editor-in-Chief for the International Journal of Computational Linguistics and Chinese Language Processing. His research interests include speech recognition, text-to-speech, multimedia information retrieval, spoken language processing, and sign language processing for the hearing-impaired. Dr. Wu is a member of the International Speech Communication Association (ISCA) and ROCLING.

Suggest Documents