A2RT Connected Digit Recognition Experiments
Abstract
This document describes the corpora used for the Connected Digit Recognition
research within the research group A2RT. Three speech databases
containing Dutch connected digit strings are involved: Polyphone, SESP and
Casimir. All these corpora contain telephone speech recorded in a wide variety
of acoustic conditions.
Databases
SESP
SESP is a Dutch connected digit database that was originally meant
for Speaker Verification. The format of the speech files is A-law.
POLYSPDAT
The first part of this name refers to the Dutch Polyphone speech database.
Polyspdat contains the digit strings of this corpus. The second part of the
name refers to SpeechDat; we have chosen to take the Speechdat database
format to work with.
All noise tokens in the Polyphone transcription were replaced by a single
noise annotation.
The format of the speech files is A-law.
CASIMIR
-no detailed information available-
The format of the speech files is A-law.
Acoustic Feature Extraction
Acoustic preprocessing is done with the research version of the
PHIlips COntinuous Speech recognizer, PHICOS. The signal
in the soundfiles is split up in 16 ms frames with 10 ms frame shift. The frames
are subjected to a Hamming window-filter bank analysis, resulting in 14 Mel-scale
Log Energy coefficients per frame
Next, 14 Mel-scale Frequency Cepstrum coefficients (c0 - c13) and their first
order derivatives are computed. These 28 features are the components of the
feature vectors.
Training
Corpus
Acoustic models are trained on either one of these two training corpora:
- trainmatch.corpus
- trainbestmatch.corpus
trainmatch.corpus consists of utterances taken from SESP and POLYSPDAT.
First we made two segmentations: one using a forced alignment with general
purpose phone models and the other using digit word models. The degree of
agreement between these segmentations determines the primary sorting order
of all utterances. The following criteria were also applied:
- signal-to-noise ratio < 10
- clipping rate == 0%
- minimum sample value < -4096 and maximum sample > 4096
- string length != 1 (too short), != 10 (telephone numbers are for tests
only) and != 14 (many so-called 'scope numbers', which are all too much alike)
- other than gender category u or x, i.e. the gender of the speaker was
unknown.
A uniform distribution of the frequency of each kind of digit was obtained
by selection rules concerned with:
- the number of occurrences per digit (< 15500)
- the number of digits per speaker (< 400)
- the number of digits in utterance initial position (< 2000)
- the number of digits in utterance final position (< 2000)
- the number of digits and utterances per gender
A set of 9922 utterances could satisfy these criteria.
One of the first generations of the CDR models was tested on the trainmatch
corpus. Error analyses showed that 169 utterances contained
transcription errors or were falsely accepted by the SNR tool. These were
manually removed and the remaining 9753 utterances are used for almost every
experiment.
trainbestmatch.corpus
This corpus contains the first best 1000 utterances (ranked by agreement
between word and phone segmentation) of the trainmatch.corpus.
Acoustic & Language Modeling
Acoustic modeling
Acoustic modeling is done using PHICOS. The HMMs have a
left-to-right topology (see Figure 1).
 |
| Figure 1. Typical phone model topology in the Phicos architecture |
The states are clustered in segments. All states within one segment share the same
Gaussian mixture PDF. Transitions between states are always one of the following types
- SELF
- A self loop
- NEXT
- To the next state
- SKIP
- Skip one state
In order to enforce duration modeling, transition penalties are associated
with every type of transition. NEXT is typically the cheapest.
We have ran experiments both for phone model-based and word model-based
lexicon entries.
Language modeling
The language model is a unigram model, with equal probabilities for each
digit. We didn't use zerogram, because there is one digit, 'zeven' (E: 'seven')
that has two pronunciation variants (i.e. /z2:v@n/ and /ze:v@n/). Probabilities
for pronunciation variants must add up to the total probability of the
canonical form.
Development
Corpus
Two corpora are used for development puposes:
- dev.corpus
- dev.first500.corpus
dev.corpus consists of 1/3 of all material that was not selected for
trainmatch.corpus, however it does not contain the utterances from CASIMIR.
It contains 9355 utterances.
dev.first500.corpus contains a random selection of dev.corpus.
Hidden tests
Corpora
- test.corpus
- test.first1000.corpus
test.corpus contains the remaining material, including CASIMIR.
Recognition results acquired from this corpus has never been analysed; the
corpus is always subject to blind tests.
test.first10000.corpus contains a random selection of 10,000
utterances from test.corpus. Most of our research experiments
are performed on this corpus.