Speaker rate variation - Time Warping
What happens when people vary their rate of speech during a
phrase? How can a speaker verification system with a password of
"Project" accept the user when he says "Prrroooject"?
- Obviously, a simple linear squeezing of this longer password
will not match the key signal because the user slowed down the
first syllable while he kept a normal speed for the "ject" syllable.
- We need a way to non-linearly time-scale
the input signal to the key signal so that we can line up appropriate
sections of the signals (i.e. so we can compare "Prrrooo" to "Pro"
and "ject" to "ject").
- The solution to this problem is to use a technique known as
"Dynamic Time Warping" (DTW). This procedure computes a
non-linear mapping of one signal onto another by minimizing the
distances between the two.
|
Dynamic Time Warphing Algorihtm
Define the signal and difference between it a set "key"
signal (one with no variation in rate)
- In order to get an idea of how to minimize the distances
between two signals, let's go ahead and define two: K(n),
n = 1,2,...,N, and I(m), m = 1,2,...,M.
- We can develop a local distance matrix (LDM) which contains
the differences between each point of one signal and all
the points of the other signal. Basically, if n = 1,2,...,N
defines the columns of the matrix and m = 1,2,...,M defines
the rows, the values in the local distance matrix LDM(m,n)
would be the absolute value of (I(m) - K(n)).
LDM(m,n) = |I(m) - K(n) |
|
Goal
warping function (f) that minimizes the total
distance between the respective points of the two signals.
find f : K(i) ---> I(j)
such that minimize LDM(i,j) + surrounding values
of LDM(i,j)
|
Results
|
|
|