Tipología y aplicaciones de los corpus orales

Tipología de corpus orales

Inventarios fonéticos y fonológicos

Inventarios de segmentos de las lenguas del mundo extraídos de descripciones publicadas.

Pueden incluir muestras de la señal sonora.

Corpus y bases de datos para la descripción fonética de la lengua

Corpus para la descripción fonética

Corpus preparados ad hoc

Corpus para la descripción fonética comparada

Materiales equivalentes para cada lengua.

Corpus para la descripción fonética de la lengua y para aplicaciones tecnológicas

Corpus y bases de datos para aplicaciones a las tecnologías del habla

Corpus específicamente diseñados para desarrollar aplicaciones en el campo de las tecnologías del habla.

Corpus para aplicaciones tecnológicas generales

Desarrollo de sistemas de conversión de texto en habla

En general, un único hablante o un número reducido de hablantes.

Locutores profesionales: voice talents.

Susan Bennet

Grabación en estudio.

Corpus leído.

Cobertura fonética en el nivel segmental y en el suprasegmental.

Corpus alineado y etiquetado fonéticamente en función de las unidades de síntesis.

Puede incluir un diccionario de pronunciación (pronunciation lexicon).

Spanish TTS Speech Corpus

“The Spanish TTS Speech Corpus contains the recordings of 1 native Spanish speaker (male, 28 years old) recorded in a studio over 1 channel (Shure SM15 unidirectional professional head-word condenser microphone). The data collection and transcription were performed by Appen (Australia).
Speech samples are stored as sequences of 16-bit 22.05 kHz PCM in uncompressed WAV files.
The speaker read 1,787 prompted sentences covering all legal triphones and diphones.
The database is provided with orthographic transcriptions in SAMPA, including canonical and alternative pronunciation, and syllable, stress and acoustic events markings. All transcriptions were segmented at the utterance (sentence/command word) level, annotated at the word level and checked manually. A pronunciation lexicon including 3,748 headwords (plus variants) is also available.”

ELRA-S0150 : Spanish TTS Speech Corpus (Appen). (s. f.). ELRA, European Language Resources Association. Consultado en http://catalog.elra.info/product_info.php?products_id=3

Desarrollo de sistemas de reconocimiento automático

Número elevado de hablantes.

Corpus representativo de las variantes de la lengua.

Corpus representativo de las características de los hablantes.

Estilos de habla en función de la aplicación.

Entorno y canal de recogida de los datos en función de la aplicación.

Cobertura fonética en el nivel segmental en función de las unidades de reconocimiento.

Corpus alineado y transcrito fonéticamente.

Puede incluir un diccionario de pronunciación (pronunciation lexicon).

El corpus se divide en una parte dedicada al entrenamiento y otra parte dedicada a la evaluación del sistema.

Desarrollo de sistemas de reconocimiento automático del habla
Spanish SpeechDat-Car

“The Spanish SpeechDat-Car database contains the recordings of 306 Spanish speakers from 4 different regions (156 males, 150 females), recorded over the Spanish GSM telephone network, and in a car. This database is partitioned into 89 CDs (DVDs are also available).
The speech data files are in two formats. Four of the 5 microphones were recorded on the computer in the boot of the car. The speech data are stored as sequences of 16 kHz, 16 bit and uncompressed. The fifth microphone was connected to the cell phone, and was recorded on a remote machine. The data are stored as sequences of 8 kHz 8 bit A-law. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.
This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SpeechDat-Car format and content specifications.
Each speaker uttered the following items:
2 voice activation keywordsM 1 sequence of 10 isolated digits; 7 connected digits (1 sheet number -5 digits, 1 spontaneous telephone number, 3 read telephone numbers, 1 credit card number ?14/16 digits, 1 PIN code -6 digits); 3 dates (1 spontaneous date e.g. birthday, 1 prompted date, 1 relative or general date expression); 2 word spotting phrases using an embedded application word; 4 isolated digits; 7 spelled words (1 spontaneous e.g. own forename or surname, 1 directory city name, 4 real word/name, 1 artificial name for coverage); 1 money amount; 1 natural number; 7 directory assistance names (1 spontaneous e.g. own forename or surname, 1 city of birth/growing up, 2 most frequent cities, 2 most frequent company/agency, 1 forename/surname); 9 phonetically rich sentences; 2 time phrases (1 spontaneous time of day, 1word style time phrase); 4 phonetically rich words; 67 application words (13 mobile phone application words, 22 IVR function keywords, 32 car products keywords); 2 additional language dependent keywords; Prompts for spontaneous speech.
The following age distribution has been obtained: 160 speakers are between 18 and 30, 80 speakers are between 31 and 45, 65 speakers are between 46 and 60, and 1speaker is over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA is also included.”

ELRA-S0140: Spanish SpeechDat-Car database. (s. f.). ELRA, European Language Resources Association. Consultado en http://catalog.elra.info/product_info.php?cPath=37_39&products_id=690
Desarrollo de sistemas de identificación y verificación automática del locutor

“The Ahumada I spontaneous telephone speech contains telephone and microphone recordings from 100 male speakers from the “Guardia Civil”.
Speakers were asked to: a) read 24 isolated digits; b) read 10 digit strings consisting of 10 digits each; c) read 10 phonologically and syllabically balanced utterances of 8-12 word length; d) read 1 phonologically and syllabically balanced text, of about 180 words (more than 1 minute of duration) at a normal speaking rate; e) read the previous fixed text twice, the first time fastly and the second time slowly; f) read a specific text, different from speaker to speaker and from session to session; g) describe whatever they wanted (avoiding long pauses and hesitations), which results in about 1 minute (or more) of spontaneous speech.
A subcorpus of the Ahumada I corpus was used in NIST Speaker Recognition Evaluations in 2000 and 2001.”

ELRA-U-S0224: Ahumada speech corpus. (1998). ELRA, European Language Resources Association. Consultado en http://universal.elra.info/product_info.php?products_id=2168
SIVA, Speaker Identification and Verification Archives

“The Italian speech database SIVA (Speaker Identification and Verification Archives: SIVA), is a database comprising more than two thousands calls, collected over the public switched telephone network.
The SIVA database consists of four speaker categories: male users, female users, male impostors, female impostors. Speakers were contacted via mail before the test, and they were asked to read the information and the instructions provided carefully before making the call. About 500 speakers were recruited using a company specialized in selection of population samples. The others were volunteers contacted by the institute concerned.
Speakers access the recording system by calling a toll free number. An automatic answering system guides them through the three sessions that make up a recording. In the first session, a list of 28 words (including digits and some commands) is recorded using a standard enumerated prompt. The second session is a simple unidirectional dialogue (the caller answers prompted questions) where personal information is asked (name, age, etc.). In the third session, the speaker is asked to read a continuous passage of phonetically balanced text that resembles a short curriculum vitae.
The signal is a standard 8kHz sampled signal, coded using 8 bits mu-law format. The data collected so far consists of: MU: male users 18 speakers, 20 repetitions; FU: female users 16 speakers, 26 repetitions; MI: male impostors: 189 speakers, 2 repetitions, and 128 speakers, 1 repetition; FI: female impostors: 213 speakers, 2 repetitions, and 107 speakers, 1 repetition.”

ELRA-S0028: The “SIVA” Speech Database for Speaker Verification and Identification. (s. f.). ELRA, European Language Resources Association. Consultado en http://catalog.elra.info/product_info.php?products_id=77
Desarrollo de sistemas de identificación automática de la lengua
Multilingual Corpus for Language Identification

“This multilingual corpus was designed to enable the development and testing of algorithms for automatic language identification. It contains speech from 250 natives speakers of each language calling a data collection system from their home country via a toll-free number, as well as 50 native speakers of each language calling from within France (or from Germany, Spain, or the United Kingdom). Types of data : general questions concerning the call and the caller, series of items containing pre-defined texts to read and fixed prompts, set of questions aimed at obtaining spontaneous speech. It contains over 300 calls for each language. 70 hours of data.”

ELRA-ST37: Multilingual Corpus for Language Identification. (s. f.). ELRA, European Language Resources Association. Consultado en http://universal.elra.info/product_info.php?products_id=201

Desarrollo de sistemas de diálogo

Corpus de interacciones entre personas.

En general, el contenido del corpus corresponde a la finalidad del sistema de diálogo.

Corpus de interacciones simuladas entre personas y sistemas de diálogo recogidos mediante el protocolo del Mago de Oz.


“This is a spontaneous-speech dialogue corpus acquired using the Wizard of Oz technique. The task consisted of the retrieval information about Spanish nationwide trains by telephone. 300 different scenarios have been defined. Each scenario contains an objective, a situation and the specific requirements of the travel.
In total 225 speakers recorded 900 dialogues with 6,278 user turns and 48,243 words. The training corpus contains 720 dialogues recorded by 180 speakers and the test corpus consists of 135 dialogues recorded by 45 speakers.
Spontaneous-speech events were labeled from acoustic, lexical and syntactic points of view. 499 lexical and 545 syntactical event were annotated.
The corpus acquisition architecture is composed of an audio server, an automatic speech recognition server, a speech understanding server, a Wizard of Oz server, a dialogue manager server, an oral answer generation server, a speech-to-text conversion server and a communications management client. Finally, each speaker read 16 sentences (8 referred to the task and 8 were phonetically balanced sentences).
The entire corpus consists of 3,600 sentences in total, for 10.8 hours of human voice recorded.”

ELRA-ST79: DIHANA Corpus. (s. f.). ELRA, European Language Resources Association. Consultado en http://universal.elra.info/product_info.php?products_id=1421

Desarrollo de sistemas de traducción automática del habla

Tres tipos de corpus:

TC-STAR Spanish Training Corpora for ASR

“This corpus consists of the recordings of around 283 hours from EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish (a mixture of native and non-native Spanish), 62 hours of which were annotated (transcribed) within the project (the transcriptions are not provided in the present package but will be made available soon). These recordings were obtained from Europe by Satellite (http://europa.eu.it/comm/ebs) from May 2004 until May 2006.”

ELRA-S0252: TC-STAR Spanish Training Corpora for ASR: Recordings of EPPS Speech. (s. f.). ELRA, European Language Resources Association. Consultado en http://catalog.elra.info/product_info.php?products_id=1036
TC-STAR English-Spanish Training Corpora for Machine Translation

“This corpus consists of respectively 34 million (English) and 38 million (Spanish) running words of bilingual sentence segmented and aligned texts in English and Spanish obtained from the Final Text Editions provided by the European Parliament (http://www.europarl.europa.eu) from April 1996 to Sept. 2004, Dec. 2004 to May 2005, and Dec. 2005 to May 2006. The data is accompanied by tools for further preprocessing.”

ELRA-S0250: TC-STAR English-Spanish Training Corpora for Machine Translation: Aligned final text rditions of EPPS. (s. f.). ELRA, European Language Resources Association. Consultado en http://catalog.elra.info/product_info.php?products_id=1033
TC-STAR Spanish Baseline Male Speech Database

“It contains the recordings of one male Spanish speaker recorded simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal in a noise-reduced room. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). This database is distributed on 9 DVDs. The database complies with the common specifications created in the TC-STAR project.
The annotation of the database includes manual orthographic transcriptions, the automatic segmentation into phonemes and automatic generation of pitch marks. A certain percentage of phonetic segments and pitch marks has been manually checked. A pronunciation lexicon in SAMPA with POS, lemma and phonetic transcription of all the words prompted and spoken is also provided.
Speech samples are stored as sequences of 24-bit 96 kHz with the least significant byte first (“lohi” or Intel format) as (signed) integers. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.”

ELRA-S0310: TC-STAR Spanish Baseline Male Speech Database. (s. f.). ELRA, European Language Resources Association. Consultado en http://catalog.elra.info/product_info.php?products_id=1132
LC-STAR Spanish phonetic lexicon

“The lexicon comprises more than 100,000 words, distributed over three categories:
- a set of 55,854 common word entries. This set is extracted from a corpus of more than 37 million words distributed over 6 different domains (sports/games, news, finance, culture/entertainment, consumer information, personal communications). This was done with the aim of reaching a target for each domain of at least 95% self coverage. In addition to extracting word lists from the corpus, a list of closed set (function) word classes are included in the final word list.
- a set of 45,403 proper names (including person names, family names, cities, streets, companies and brand names) divided into 3 domains. Multiple word names such as New_York are kept together in all three domains, and they count as one entry. The 3 domains consist of first and last names (23,114 different entries), place names (15,427 different entries), and organisations (7,777 different entries).
- and a list of 7,498 special application words translated from English terms defined by the LC-STAR consortium. This list contains: numbers, letters, abbreviations and specific vocabulary for applications controlled by voice (information retrieval, controlling of consumer devices, etc.).
The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. The database is stored on 1 CD.”

ELRA-S0208: LC-STAR Spanish phonetic lexicon. (s. f.). ELRA, European Language Resources Association. Consultado en http://catalog.elra.info/product_info.php?products_id=833

Corpus para aplicaciones tecnológicas específicas relacionadas con una aplicación

El contenido del corpus recogido se centra en las necesidades específicas de una determinada aplicación.

The HIWIRE database

“The database contains 8,099 English utterances pronounced by non-native speakers (31 French, 20 Greek, 20 Italian, and 10 Spanish speakers). The collected utterances correspond to human input in a command and control aeronautics application. The data was recorded in studio with a close-talking microphone and real noise recorded in an airplane cockpit was artificially added to the data.”

ELRA-S0293: The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication. (2007). ELRA, European Language Resources Association. Consultado en http://catalog.elra.info/product_info.php?products_id=1088
Cross Towns Database

“The Cross Towns corpus is a non native corpus that covers many language directions (24). For each of these directions the recording contains twice 45 city names per speaker: names are read from a prompt and names are repeated after listening to them via headphone.”

ELRA-U-S 0160: Cross Towns Database. (2006). ELRA, European Language Resources Association. Consultado en http://universal.elra.info/product_info.php?cPath=37_39&products_id=1930
Alcohol Language Corpus

“ALC contains recordings of German speakers that are either intoxicated or sober. The type of speech ranges from read single digits to full conversation style. Recordings were done during drinking test where speakers drank beer or wine to reach a self-chosen level of alcoholic intoxication. The actual level of intoxication was measured by breath alcohol and blood samples taken immediately before the speech recording. Recordings were performed in two standing automobiles to ensure a constant acoustic environment across the different recording locations; both, the intoxicated and sober condition recording were done in the same car and supervised by the same investigator (dialogue partner). In the intoxicated state 30 items were sampled from each speaker (set A), while in the sober state 60 items were recorded (set NA; set A being a subset of set NA).
Preliminary version of 25/03/2009: number of recorded speakers: 88 (final: 150); number of recordings: 8586; number of phonetic segments: 709220.”

ELRA-S0299: Alcohol Language Corpus (BAS ALC). (2009). ELRA, European Language Resources Association. Consultado en http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1097

Corpus de lengua oral

Corpus para el análisis pragmático del discurso y de la conversación

Contenido lingüístico

Estilos informales.

Conversación libre.

Discurso espontáneo.

La variación estilística

Recogida de datos

Entrevista en el entorno del informante.

Grabación sin conocimiento previo por parte del informante.

Corpus para el estudio sociolinguístico

Contenido lingüístico

Variación de estilos de habla.

Cuestionarios sobre fenómenos pragmáticos específicos (por ejemplo, formas de tratamiento).

Cuestionarios sobre fenómenos sociolingüísticos específicos (actitudes lingüísticas).

Selección de informantes

Criterios sociolingüísticos para la selección de informantes

Muestreo estratificado en función de las variables definidas en el diseño del estudio.

Recogida de datos

Entrevista en el entorno del informante.

Elicitación de palabras mediante preguntas por parte del entrevistador.

Explicación y comentario de textos leídos.

Corpus para el estudio de la variación geográfica

Selección de informantes

Criterios dialectológicos para la selección de informantes

Recogida de datos

Cuestionarios específicamente diseñados para la obtención de determinados elementos léxicos en el caso de los atlas lingüísticos.

Corpus de referencia

Pretenden reflejar el uso lingüístico oral teniendo en cuenta diversos factores de variación.

Diseño proporcional en función de la importancia otorgada a cada factor.

up arrow

Aplicaciones de los corpus orales

Descripción de la lengua oral

Estándar oral.

Variedades geográficas, sociales y estilísticas.

Estudios sobre pragmática en la lengua oral.

Cambio lingüístico.

Descripción fonética

Descripción de los elementos segmentales.

Descripción de los elementos suprasegmentales.

Fonética aplicada

Fonética contrastiva.

Producción del habla.

Adquisición del habla.

Adquisición de segundas lenguas.

Patologías del habla.



Tecnologías del habla

Conversión de texto en habla

Obtención de conocimientos fonéticos y lingüísticos para la conversión de texto en habla.

Reconocimiento del habla

Entrenamiento de sistemas de reconocimiento.

Evaluación del sistema de reconocimiento.

Sistemas de diálogo

Diseño y evaluación de sistemas de diálogo persona-máquina, incluyendo la traducción automática de conversaciones telefónicas

Diseño del sistema.

Evaluación del sistema.

Desarrollo de recursos lingüísticos

Diccionarios de pronunciación.

Diccionarios de lengua oral.

Enseñanza de la lengua asistida por ordenador.

Desarrollo de recursos lingüísticos para la documentación y la enseñanza de las lenguas minorizadas.

Recursos lingüísticos y lenguas minorizadas

Corpus orales y lenguas minorizadas

up arrow

Aplicaciones de los corpus orales

Tipología y aplicaciones de los corpus orales
Joaquim Llisterri, Departament de Filologia Espanyola, Universitat Autònoma de Barcelona

Last updated: