A2RT Connected Digit Recognition Experiments

Abstract

This document describes the corpora used for the Connected Digit Recognition research within the research group A2RT. Three speech databases containing Dutch connected digit strings are involved: Polyphone, SESP and Casimir. All these corpora contain telephone speech recorded in a wide variety of acoustic conditions.

Databases

SESP

SESP is a Dutch connected digit database that was originally meant for Speaker Verification. The format of the speech files is A-law.

POLYSPDAT

The first part of this name refers to the Dutch Polyphone speech database. Polyspdat contains the digit strings of this corpus. The second part of the name refers to SpeechDat; we have chosen to take the Speechdat database format to work with.
All noise tokens in the Polyphone transcription were replaced by a single noise annotation.
The format of the speech files is A-law.

CASIMIR

-no detailed information available-
The format of the speech files is A-law.

Acoustic Feature Extraction

Acoustic preprocessing is done with the research version of the PHIlips COntinuous Speech recognizer, PHICOS. The signal in the soundfiles is split up in 16 ms frames with 10 ms frame shift. The frames are subjected to a Hamming window-filter bank analysis, resulting in 14 Mel-scale Log Energy coefficients per frame

Next, 14 Mel-scale Frequency Cepstrum coefficients (c0 - c13) and their first order derivatives are computed. These 28 features are the components of the feature vectors.

Training

Corpus

Acoustic models are trained on either one of these two training corpora:
trainmatch.corpus consists of utterances taken from SESP and POLYSPDAT. First we made two segmentations: one using a forced alignment with general purpose phone models and the other using digit word models. The degree of agreement between these segmentations determines the primary sorting order of all utterances. The following criteria were also applied: A uniform distribution of the frequency of each kind of digit was obtained by selection rules concerned with: A set of 9922 utterances could satisfy these criteria.
One of the first generations of the CDR models was tested on the trainmatch corpus. Error analyses showed that 169 utterances contained transcription errors or were falsely accepted by the SNR tool. These were manually removed and the remaining 9753 utterances are used for almost every experiment.

trainbestmatch.corpus This corpus contains the first best 1000 utterances (ranked by agreement between word and phone segmentation) of the trainmatch.corpus.

Acoustic & Language Modeling

Acoustic modeling
Acoustic modeling is done using PHICOS. The HMMs have a left-to-right topology (see Figure 1).

Figure 1. Typical phone model topology in the Phicos architecture

The states are clustered in segments. All states within one segment share the same Gaussian mixture PDF. Transitions between states are always one of the following types
SELF
A self loop
NEXT
To the next state
SKIP
Skip one state
In order to enforce duration modeling, transition penalties are associated with every type of transition. NEXT is typically the cheapest.

We have ran experiments both for phone model-based and word model-based lexicon entries.

Language modeling
The language model is a unigram model, with equal probabilities for each digit. We didn't use zerogram, because there is one digit, 'zeven' (E: 'seven') that has two pronunciation variants (i.e. /z2:v@n/ and /ze:v@n/). Probabilities for pronunciation variants must add up to the total probability of the canonical form.

Development

Corpus

Two corpora are used for development puposes:
dev.corpus consists of 1/3 of all material that was not selected for trainmatch.corpus, however it does not contain the utterances from CASIMIR. It contains 9355 utterances.
dev.first500.corpus contains a random selection of dev.corpus.

Hidden tests

Corpora

test.corpus contains the remaining material, including CASIMIR. Recognition results acquired from this corpus has never been analysed; the corpus is always subject to blind tests.
test.first10000.corpus contains a random selection of 10,000 utterances from test.corpus. Most of our research experiments are performed on this corpus.
Back to the CDR homepage