Evaluation Campaign

IWSLT proposes challenging research tasks and an open experimental infrastructure for the scientific community working on spoken and written language translation. This year, the IWSLT evaluation campaign will offer three spoken language translation tasks:

(1) TALK TASK  (public speeches on a variety of topics, from English to French)

The exercise for this year will exploit the TED Talks corpus, a collection of public speeches on a variety of topics for which video, transcripts and translations are available on the Web. Training data for this exercise will be limited to a supplied collection of freely available parallel texts, including a parallel corpus of TED Talks. The translation input conditions of the TALK task consist of (1) automatic speech recognition (ASR) outputs, i.e., word lattices (SLF), N-best lists (NBEST) and 1-best (1BEST) speech recognition results, and (2) correct recognition results (CRR), i.e., text input without speech recognition errors. Participants of the TALK task must submit MT runs for both input conditions. For the ASR input condition, participants may choose any of the available formats.  Details on the TALK task can be found here.

(2) DIALOG TASK (spoken dialogues in travel situations, between Chinese and English)

This task is carried out using the Spoken Language Databases (SLDB) corpus, a collection of human-mediated cross-lingual dialogs in travel situations. In addition, parts of the BTEC corpus (see below) are provided to the participants of the DIALOG task. The translation input conditions of the DIALOG task consist of (1) automatic speech recognition (ASR) outputs, i.e., word lattices (SLF), N-best lists (NBEST) and 1-best (1BEST) speech recognition results, and (2) correct recognition results (CRR), i.e., text input without speech recognition errors. Participants of the DIALOG task must translate both the English ASR outputs into Chinese and the Chinese ASR outputs into English, whereby they can choose the ASR output condition (SLF, NBEST, or 1BEST) that suits best their MT system. The translation of the CRR text input is mandatory for all participants. The monolingual and bilingual language resources that should be used to train the translation engines for the primay runs of the DIALOG task are limited to the supplied corpus. This includes all supplied development sets, i.e., you are free to use these data sets as you wish for tuning model parameters or as training bitext, etc. Details on the DIALOG task can be found here.

(3) BTEC TASK (basic traveling expressions, from Arabic, Turkish, and French to English)

This task is carried out using the Basic Travel Expression Corpus (BTEC), a multilingual speech corpus containing tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. The translation input condition of all BTEC tasks consists of correct recognition results, i.e., text input, for Arabic, Turkish, and French. The target language for all BTEC tasks is English. The monolingual and bilingual language resources that should be used to train the translation engines for the primay runs are limited to the supplied corpus. This includes all supplied development sets, i.e., you are free to use these data sets as you wish for tuning model parameters or as training bitext, etc. All other language resources besides the ones for the given translation task should be treated as additional language resources. Details on the BTEC task can be found here.

Linguistic tools like word segmentation tools, parsers, etc., are allowed to preprocess the supplied corpus, but we kindly ask participants to declare their usage in the system description paper, to measure the impact of these tools on the system performance and to share these tools (if possible) after the evaluation. No additional parallel or monolingual corpora or word-lists should be used for the primary run.

In order to motivate participants to explore the effects of additional language resources, we DO ACCEPT contrastive runs based on additional resources for the DIALOG and BTEC tasks. These will be evaluated automatically using the same framework as the primary runs, thus the results will be directly comparable to this year's primary run submissions and can be published by the participants in the MT system description paper or in a scientific paper.

Concerning the evaluation of translation results, all submitted MT outputs will be scored using standard automatic evaluation metrics (BLEU, METEOR, etc.) using up to 7 reference translations for the BTEC task and up to 4 reference translations for the DIALOG task. Human assessment (ranking) of the translation results will be carried out for all primary run submissions of the DIALOG and BTEC tasks. In addition, translation quality in terms of adequacy in the context of the given dialogs will be evaluated for the top-ranked primary run submissions of the DIALOG task.