Personal tools
You are here: Home Downloads UCAM Bilingual database README for UCAM bilingual database

README for UCAM bilingual database

The corpus contains the speech of four male non-native speakers of English. In the case of the European languages considered (French, Italian and Dutch) the speech corresponds to utterances selected from the Europarl corpus of parallel text of European parliament proceedings. In the case of the Mandarin speaker, the speech corresponds to a subset of the NIST 2008 Chinese-English MT evaluation parallel texts. Each speaker provided speech in his native language as well as the parallel translated speech in English.


The speech was recorded using a Sennheiser close-talking microphone in a quiet office at a sampling rate of 44.1 kHz.


The following directories are contained in the archive:

prompts : files containing the text read by each speaker, split by native language (fr=French, nl-Dutch, ch=Mandarin, it=Italian)

wav_ch : Mandarin speaker's acoustic data (89 parallel utterances in English and Mandarin, plus 40 utterances in English only)

wav_fr : French speaker's acoustic data (130 parallel utterances in English and French, plus 40 utterances in English only)

wav_it : Italian speaker's acoustic data (130 parallel utterances in English and Italian, plus 40 utterances in English only)

wav_nl : Dutch speaker's acoustic data (130 parallel utterances in English and Dutch, plus 40 utterances in Dutch only)





Document Actions