Call for Papers: Motivation and Background
Motivation and Background | Topics | Goals | Submission Details
Despite the wide experience gained in the compilation of written language corpora, working with spoken language data is not immediately straightforward as spoken language involves many novel aspects that need to be taken care of. The fact that spoken language is transient is sometimes offered as an explanation for why it is more difficult to collect spoken data than it is to compile a corpus of written data. However, it is not just the capturing of data that is anything but trivial. Once the (audio) data have been collected and stored, the next step is to produce some kind of transcript (whether orthographic or phonetic). Further annotations such as POS tagging, lemmatisation, syntactic annotation, and prosodic annotation may then build upon this transcription. Among the problems encountered in the processing of spoken language data are the following:

  • There is as yet little experience with the large scale transcription of spoken language data. Procedures and guidelines must be developed, and tools implemented.

  • Well-established practices that have originated from working on written language corpora do not hold up when trying to cope with the idiosyncracies of the spoken language. This is true for all levels of linguistic annotation. Annotation schemes need to be reconsidered and tools must be adapted.

  • In so far as standards have emerged (eg CES), they need to be adapted in order to be able to cater for the needs of spoken language corpora.

  • By their very nature, spoken language corpora bring together speech and language technologists and linguists from various backgrounds. Ideally, such corpora should address the needs of all these different user groups. Often, however, there is a conflict of interest. For example, the quality of recordings of spontaneous conversations in noisy environments although highly interesting and worthwhile from a linguistic perspective will prove too poor to be of any use to someone doing research into speech recognition.

Go to the frameless version of these pages