synthesis and coding of continuous speech with ... - Semantic Scholar

11 downloads 21255 Views 300KB Size Report
Email: kubin@email.tuwien.ac.at. ABSTRACT. The Nonlinear Oscillator Model for speech signals has been previously applied to time-scale modi cation of.
SYNTHESIS AND CODING OF CONTINUOUS SPEECH WITH THE NONLINEAR OSCILLATOR MODEL Gernot Kubin Institute of Communications and High-Frequency Engineering Vienna University of Technology Gusshausstrasse 25/389, A{1040 Vienna, Austria Email: [email protected]

ABSTRACT

The Nonlinear Oscillator Model for speech signals has been previously applied to time-scale modi cation of speech. In the present contribution, we report on the application of the model to speech synthesis and waveform coding. For speech synthesis, a new radial basis function representation of the oscillator feedback nonlinearity is introduced. The local approximation properties of this representation have been investigated and it is shown how it can be adapted to continuous speech on a frame-by-frame basis. For speech coding, synchronization of the nonlinear oscillator model to the original waveform is demonstrated and applied in the design of a new backward-adaptive speech coding system.

6

xn?NM +M M

M

6 

?

xn

1



ear dynamical systems can be synchronized even in the chaotic regime [3] if some partial information of the original system is transmitted to the synthesizer system. This can be exploited for waveform-coding purposes. The other alternative is asynchronous synthesis where the synthesizer operates rather like a vocoder, i.e., without any information on waveform details of the original signal. This avenue is pursued for speech synthesis applications.  Time-invariant versus adaptive operation. Timeinvariant models are only useful for sustained speech sounds and can serve as an initial test of a model structure. Synthesis of continuous speech requires the online update of the system such that its self-sustained oscillations follow the transitions between attractors corresponding to the sequence of acoustic segments in continuous speech. This continuous adaptation process can be organized in a frame-based approach. To illustrate this point, g. 2 displays several snapshots obtained by time-domain windowing of continuous speech. The rst snapshot shows the extremely fast onset of a limit cycle for the word-initial nasal [n], the second snapshot an almost periodic vowel attractor, and the third one shows fricative noise. Note that the gure shows both the time-domain representation of the waveform segments as well as the corresponding reconstructed phase portraits of the oscillator model, projected onto two dimensions. This frame-adaptive display suggests that local phase portraits are meaningful for transient speech segments where they capture the dynamics of onsets and/or decays etc. Even if, in general, it is dicult to extend chaostheoretic tools to time-varying systems, time-varying phasespace reconstructions give an accurate account of the tran-

(1)

Note that in general, and contrary to the state vector of standard autoregressive models, the state vector ~x(n) has a non-trivial delay M > 1, (2)

In [1], we have demonstrated that it is possible to estimate the nonlinear map a() from given speech data and to use the estimated system to regenerate speech without resorting to an excitation signal driving the system. The gain from this modeling paradigm is much more dramatic than from any prediction approach, be it linear or nonlinear, as we simply eliminate the prediction error from our model. In the application of this model, two choices have to be made:

 A signi cant part of this work has been carried out during a sabbatical leave at the Information Principles Research Laboratory of AT&T Bell Laboratories, Murray Hill, New Jersey. Further support through grant P08779-TEC from the Austrian Science Foundation (FWF) is gratefully acknowledged.

To appear in Proc. ICASSP-96

xn?M

 Synchronous versus asynchronous synthesis. Nonlin-

The nonlinear oscillator model [1, 2] represents the speech waveform x(n) as the output of a nonlinear feedback system that is capable of self-sustained oscillations without external input. The model is characterized by its order N (i.e., the number of delayed signal samples x(n ? k) in its feedback path), the delay M between samples, and the feedback nonlinearity a() which maps the vector of delayed samples ~x(n ? 1) onto the next output x(n), cf. also gure 1.

~x(n) = [x(n); x(n ? M ); : : : ; x(n ? (N ? 1)M )]T :

6

Figure 1. Nonlinear feedback oscillator structure. M is the delay parameter, a() the feedback nonlinearity.

1. NONLINEAR OSCILLATOR MODEL

x(n) = a(~x(n ? 1)):

 

r-

xn+1

a()

1

c IEEE 1996

speech sounds. While prediction works satisfactorily, diculties are reported for the resynthesis experiments: extensive manual ne-tuning of various parameters (such as the model order N , the delay M , the number and initial position of the RBF centers, and several a priori variance estimates needed in the training algorithm) is needed to achieve stable self-sustained oscillations of the model. Diculties apparently increase for speech sounds with cluttered phase portraits as is the case for sounds with a high rst formant frequency such as [a]. We propose a new RBF approximation technique which overcomes this problem and, at the same time, allows on-line adaptation of the oscillator model to continuous speech. Rather than using long signal records to learn the underlying nonlinearity incrementally, we process only blocks or frames of L samples at a time. The RBF kernels are used to interpolate between the realization pairs [~x(0); x(1)]; : : : ; [~x(L ? 1); x(L)] obtained from the reconstructed trajectory. The normalized estimate of the nonlinear map a(~x) is then de ned as

Time−domain waveform (time in ms)

Amplitude x(n)

1 0.5 0 −0.5 −1 0

50

1

100

150

(a)

0.5

200

250

300

0.1 0.05

0.5 x(n)

0 0

0 −0.5

−0.5 −1

x(n)

−0.1

250

x(n + 6)

125

x(n + 6)

37.5

x(n + 6)

−1

−0.05

x(n)

(b)

a^(~x) =

x(n)

sitions between attractors (e.g., bifurcations from voiced to unvoiced) just as they occur naturally in the given speech waveform.

2. SYNTHESIS WITH RADIAL BASIS FUNCTION (RBF) APPROXIMATION

Tishby [4] reports only limited success when applying multilayer perceptron networks with feedback as nonlinear oscillators for speech resynthesis. Observing that reconstructed speech trajectories are highly localized in phase space, one is led to try other neural network structures that allow for a better control of the locality of approximation. Radial basis function (RBF) networks are a class of such specialized structures [5] where the global approximation problem is decomposed into several local approximation problems around centers ~ci distributed over phase space. For an arbitrary point ~x, the global nonlinearity a(~x) is approximated by the sum of RBF kernel functions k(r) which depend only on the distance r = jj~x ? ~ci jj from the centers. A typical choice for the RBF kernel is the Gaussian



2  ? 2N r 2 x2 ;

(4)

This means that we use each point on the measured trajectory as a center for the RBF network, ~cl = ~x(l ? 1); l = 1; : : : ; L which is a costly approach but assures good approximation results even for short frame lengths L. The normalization in eq. (4) enforces that, for all ~x, min(x(l)) < a^(~x) < max(x(l)) which proves the stability of the oscillator model for bounded training data. Note that the degree of locality and smoothing can still be varied by modifying the normalized spread :  For ! 0, the approximation function becomes piecewise constant on Voronoi cells around the centers ~x(l ? 1) such that we e ectively turn o the capability to average over multiple realization pairs. This is equivalent to the nearest-neighbor based adaptivecodebook technique presented in [1].  For ! 1, the approximation function degenerates to a constant over the whole phase space, in this case we lose all local approximation capabilities. Experimental results are displayed in g. 3. The original utterance is spoken by a female speaker and sampled at 8 kHz. It is rst interpolated to 32 kHz to achieve a suf ciently dense sampling of the trajectories in phase space. Next, it is processed in overlapping frames of 10 ms duration with an RBF-based oscillator model of order N = 6 and delay M = 24 (i.e., 0.75 ms). A useful choice for the normalized kernel spread is = 0:1 for which typically about 10 to 20 points ~x(l ? 1) of the given trajectory will contribute a weight k(jj~x ? ~x(l ? 1)jj) > 0:1 in the estimator eq. (4). These 10 to 20 points can be seen as the local neighborhood (cf. the center plot in gure 3) which effectively controls the approximation function. Whenever a frame extends over an onset (i.e., rst silence, then speech) the number of close neighbors shows a peak. This is an artifact due to the particular power normalization in eq. (3) but does not a ect system performance. The synthesizer state vector is initialized with the zero vector from which fast initial convergence to the rst speech attractor is observed. As from there, the free-running oscillator with frame-adaptive nonlinearity synthesizes continuous speech which is perceptually very close to the original

Figure 2. Local phase-space portraits: (a) Original waveform with window placement, time scaled in ms, window length 35 ms (b) Enlarged time-domain snapshots and reconstructed phase portraits, delay  = 0:75 ms, projection on N = 2 dimensions.

k(r) = exp

PL k(jj~x ? ~x(l ? 1)jj)x(l) l=1 P L k(jj~x ? ~x(l ? 1)jj) : l=1

(3)

where N is the embedding dimension, x2 is an estimate of the signal power in the given speech frame, and is the normalized spread of the kernel function. The spread parameter can be used to trade the locality for the smoothness of the approximation function where small means high locality and large means more global smoothing. An ecient learning algorithm has been presented in [6] which allows the simultaneous optimization of all parameters of an RBF network, i.e., RBF centers, kernel spreads, and output weights. This algorithm is also used in [6] for nonlinear prediction and resynthesis of sustained voiced 2

control signal

0

-1 0

2000

4000

6000

8000

10000

signal source

12000

40 20

2000

4000

6000

8000

10000

6

Q

?

12000

speech s(n)

Synthetic waveform

- HA -

oscillator (coder)

6 ? ? + ? - - +jsynthetic  original

Number of close neighbors

0 0

C

-

Original waveform 1

speech x^(n)

N N E L oscillator (decoder)

? ?

decoded speech x(n)

1

Figure 4. Waveform coding with synchronized nonlinear oscillators.

0 -1 0

2000

4000

6000 8000 Discrete time index

10000

waveform coding we also have to provide a control mechanism which drives the synthetic output waveform towards the original waveform. In the rst demonstrations of synchronization for chaotic circuits [3, 7], this control mechanism was obtained by injecting one component of the state vector of the `master oscillator' into the corresponding state component of the `slave oscillator'. For several known chaotic systems, it was shown that the two coupled oscillators converge onto synchronous trajectories. More recently, ideas from control theory have been applied to the synchronization problem [8, 11]. In this approach, feedback control is based on the instantaneous error between a state vector component of the master oscillator and of the slave oscillator. This control mechanism can work even if only partial information on the error is available such as a quantized error signal. The explanation for the viability of this approach is very much like for the signed-error feedback used to simplify certain adaptive ltering algorithms. Signederror algorithms have been applied successfully, e.g., in the ITU-T G.726 speech coding standard based on adaptive differential pulse code modulation (ADPCM). In our case, the signal waveform is the only observed state vector component of the source oscillator. At the encoder, this component can be compared to the waveform output of the synthesis oscillator. This waveform error is quantized with an adaptive quantizer Q to provide the partial information for feedback control. The quantized signal controls the synthesis oscillator which is able to lock onto the original waveform and maintain good synchrony even with the limited information contained in the quantized waveform error. Note that in the synthesis oscillator, all state vector components will quickly be synchronized as they are obtained by simple delay operations from the oscillator output. As any copy of the oscillator (with the same state-transition map) will produce an identical output if presented with the same control input, it suces to transmit the control signal over the channel. At the receiver side, speech is decoded with another oscillator driven by this control. Thus, the overall setup consists of three synchronized oscillators: the signal source, the coder, and the decoder. A concise summary of this system is given by the following equations which can be regarded as the discrete-time counterparts of [8, eq. (4)].

12000

Figure 3. Speech synthesis based on a nonlinear oscillator model using radial basis function approximation: Original waveform (top ), number of close neighbors (center ), and synthetic waveform (bottom ).

but which does not exhibit waveform synchrony. The oscillator model runs smoothly over voiced-unvoiced transitions, too. While unvoiced segments do not exhibit a deterministic signal structure the details of their noise-like waveforms do not matter [2].

3. BACKWARD-ADAPTIVE SPEECH CODING

The previous synthesis application is characterized by the lack of synchronism between the original and the synthetic signal. In speech coding, we distinguish between parametric-coding and waveform-coding approaches. The latter maintains synchronism of the original and decoded speech waveforms. Therefore, waveform coding provides an interesting application example for synchronized nonlinear dynamical systems. Synchronization is a pervasive issue in digital communications and has received much interest in the context of chaotic systems [3, 7]. Only recently, synchronization between an estimated nonlinear model and the original data has been demonstrated experimentally [8]. We have used related ideas to develop a waveform coder based on a synchronized nonlinear oscillator model for speech, eq. (1) and g. 1. A block diagram is given in g. 4. This coder can be seen as a special case of the general class of recursive source coding systems introduced by [9]. On the other hand, relationships to the so-called `pitch-synchronous' speech processing techniques [10] are rather limited as the latter perform static synchronization by splicing signals at pitch mark locations whereas the new technique relies on a dynamical synchronization principle as explained below. In the block diagram, the signal source produces the original speech waveform which is to be transmitted over the channel. We assume that the signal source can be modeled as a nonlinear dynamical system exhibiting self-sustained oscillations. The speech coder is a model system which is designed such that it can replicate the original waveform with its synthetic speech output. To this end, its statetransition map has to be adapted to match the signal source. As we want to move a step beyond speech synthesis to

s(n) = a(~s(n ? 1)) signal source (5) x^(n) = a^(~x(n ? 1)) oscillator model (6) x(n) = x^(n) + Q(s(n) ? x^(n)) synchronization (7)

3

In eq. (6), the nonlinearity a^() can be obtained from the adaptive-codebook approach [1] by constructing state transition codebooks from a frame of the most recent decoded speech samples . Implicitly, this provides for backward adaptation of the coder needed to follow the changing local attractors of continuous speech. Thus the transmitted control signal carries both the information on the temporal changes of the attractors and on the synchronization of states. Note that when we reduce the resolution of the quantizer Q to zero bits we lose synchronization between s(n) and x(n) completely, whereas, with in nite resolution of Q, perfect synchronization x(n) = s(n) is achieved. Equations (5){(7) are similar to those used in conventional waveform coders such as the ADPCM standard ITU-T G.726. The distinctive feature is that the new coder replaces the linear ARMA predictor by a nonlinear oscillator, re ecting the di erent assumptions about the signal-generating mechanism. To allow for direct comparison, the quantizer Q is designed after the adaptive 4-bit quantizer used in the 32 kbit/s ADPCM standard. The new adaptive-codebook pulse code modulation (ACPCM) system has been tested with speech material from 3 female and 3 male speakers, totaling one minute of speech. The oscillator-based ACPCM coder achieves a segmental signal-to-noise ratio of 26.3 dB (averaged over all data) whereas the ADPCM standard reaches 25.3 dB. However, the 1 dB di erence should not be regarded as an important issue here. The remarkable fact is that the oscillator-based approach can be integrated in an existing coding standard such that overall speech quality is maintained. Note that the present approach is di erent from any adaptive prediction approach, including some nonlinear attempts reported in [12, 13], since the receiver oscillator will continue to oscillate on the current speech attractor if the channel is interrupted temporarily. Contrary to that, predictive systems always require some input to maintain an output signal. This feature may be considered an advantage of the oscillator model as it allows to interpolate across severe data losses (such as lost packets, [14]), or it may be considered a disadvantage as it can also imply error propagation problems if synchrony between the transmitter and the receiver is lost. Further study is needed to resolve this issue. Computationally, the adaptive-codebook approach is more involved than the simple sign-based gradient algorithm used for predictor adaptation in ADPCM. However, stability of parameter-adaptation in recursive systems is always dicult to guarantee [12]. The nonparametric table lookup employed in ACPCM o ers much simpler stability control as it just recycles past values for its output.

with existing standards. This approach could nd future applications in low-delay wide-band coding of speech where the available bit budgets allow to work in a waveform coding regime.

REFERENCES

[1] G. Kubin and W. B. Kleijn, \Time-scale modi cation of speech based on a nonlinear oscillator model," in Proc. Int. Conf. Acoust. Speech Sign. Process., vol. I, (Adelaide, Australia), pp. I{453{I{456, Apr. 1994. [2] G. Kubin, \Nonlinear processing of speech," in Speech Coding and Synthesis (W. B. Kleijn and K. K. Paliwal, eds.), pp. 557{610, Amsterdam etc.: Elsevier, 1995. [3] T. L. Carroll and L. M. Pecora, \Synchronizing chaotic circuits," IEEE Trans. Circ. Syst., vol. 38, pp. 453{456, Apr. 1991. [4] N. Tishby, \A dynamical systems approach to speech processing," in Proc. Int. Conf. Acoust. Speech Sign. Process., (Albuquerque, NM), pp. 365{368, Apr. 1990. [5] F. M. Aparicio Acosta, \Radial basis function and related models: An overview," Signal Process., vol. 45, July 1995. [6] M. Birgmeier, \A fully Kalman-trained radial basis function network for nonlinear speech modeling," in Proc. 1995 IEEE Int. Conf. Neural Networks, ICNN'95, (Perth, Western Australia), Nov. 1995. [7] K. M. Cuomo, A. V. Oppenheim, and S. H. Strogatz, \Synchronization of Lorenz-based chaotic circuits with appplications to communications," IEEE Trans. Circ. Syst. II, vol. 40, pp. 626{633, Oct. 1993. [8] R. Brown, N. F. Rulkov, and E. R. Tracy, \Modeling and synchronizing chaotic systems from experimental data," Physics Letters A, vol. 194, pp. 71{76, 1994. [9] G. Gabor and Z. Gyor , Recursive Source Coding. New York, Berlin, etc.: Springer, 1986. [10] E. Moulines and W. Verhelst, \Time-domain and frequency-domain techniques for prosodic modi cation of speech," in Speech Coding and Synthesis (W. B. Kleijn and K. K. Paliwal, eds.), pp. 519{555, Amsterdam (The Netherlands): Elsevier, 1995. [11] M. P. Kennedy and H. Dedieu, \Synchronization of dynamical systems: Recent progress, potential applications and limitations," in Proc. IEEE Workshop Nonlin. Signal and Image Process., NSIP'95, (Halkidiki, Greece), pp. 121{124, June 1995. [12] E. Mumolo, A. Carini, and D. Fracescato, \ADPCM with nonlinear predictors," in Signal Processing VII: Theories and Applications (M. Holt, C. F. Cowan, P. M. Grant, and W. A. Sandham, eds.), vol. 1, pp. 387{390, Amsterdam: Elsevier, Sept. 1994. [13] S. Haykin and L. Li, \Nonlinear adaptive prediction of nonstationary signals," IEEE Trans. Signal Process., vol. 43, pp. 526{535, Feb. 1995. [14] R. E. Bogner and T. Li, \Pattern search prediction of speech," in Proc. Int. Conf. Acoust. Speech Sign. Process., vol. 1, (Glasgow, Scotland), pp. 180{183, May 1989.

4. CONCLUSION

We have shown how the nonlinear oscillator model for speech signals can be applied in speech synthesis and speech coding applications. The feedback nonlinearity of the oscillator can be parameterized with various local approximations based on radial basis functions which extend previously reported methods such as the adaptive codebook approach. Online adaptive operation of the oscillator model has been demonstrated and shows that the limited amount of signal samples available in one stationary segment or frame is not prohibitive for the estimation of an appropriate nonlinear model of low dimensionality. We have further demonstrated how the nonlinear oscillator can be synchronized to a given speech waveform. This new dynamical synchronization mechanism allows to develop a fully quantized speech coder which is competitive 4