Download

 

IWSLT 2010 Corpus Release: (for IWSLT 2010 participants only)

 

End-User Agreement

  • PDF

Download

DIALOG Task BTEC Task
  • Chinese ↔ English
  • Arabic → English
  • French → English
  • Turkish → English

In order to get access to the corpus, please follow the procedure below. Access will be enabled AFTER we received your original signed end-user agreement.

  1. download the post-workshop end-user agreement (click on the PDF link above), sign it, and send two signed copies to:
    Michael Paul
    National Institute of Information and Communications Technology
    Knowledge Creating Communciation Research Center
    MASTAR Project
    Multilingual Translation Laboratory
    3-5 Hikaridai, "Keihanna Science City"
    Kyoto 619-0289, Japan
  2. download the corpus files using the ID and Password you obtained for the download of the training data files for IWSLT 2010.
 

Evaluation Results:

 

You can now access the preliminary automatic evaluation scores (full testset) and MT system rankings (based on the average of all normalized metric scores) for the BTEC and DIALOG tasks using the UserID/PassID you obtained at registration time.

  • PDF
  • XLS
 

Corpus Data Files:

 

In order to get access to the supplied resources listed below, you must first register your MT system online.

 

TALK

English → French
train dev test tools
TGZ (without lattices)
TGZ

(SLF lattices)
TGZ

(without lattices)
TGZ*

(SLF lattices)
TGZ

DIALOG

Chinese ↔ English
train dev test tools
TGZ TGZ TGZ TGZ

BTEC

Arabic → English
train dev test tools
TGZ TGZ TGZ TGZ
French → English
train dev test tools
TGZ TGZ TGZ TGZ
Turkish → English
train dev test tools
TGZ TGZ TGZ TGZ

* The NBEST lists of the TALK task are currently not included in the testset data release. The organizers will upload the NBEST lists and contact the TALK task participants as soon as the NBEST lists are available. Sorry for causing any inconveniences.

- train:

  • (TALK) 75K sentences pairs (approx 300 talk) of TED specific data.
  • (DIALOG) 10K sentence pairs of translation examples with dialog annotations and the BTEC@train resources.
  • (BTEC)20K sentence pairs of translation examples with case and punctuation information segmented according to utilized ASR engine.

- dev:

  • (TALK) Transcripts and ASR output results of 1-2 hours of TED talks (approx 10-20K words)
  • (DIALOG) 10 dialogs comprised of 210 English sentences and 200 Chinese utterances with multiple references and the respective ASR output data files (= develop data set of IWSLT09 CHALLENGE task)
  • (BTEC) Up to 6 evaluation data sets (= testsets of previous IWSLT evaluation campaigns) containing 500 source language sentences with multiple references each

- test:

  • (TALK) 500 source language sentences and ASR output data files (= input of run-submissions of this year's evaluation campaign)
  • (DIALOG) Two dialog data sets (IWSLT10 testset, IWSLT09 testset) comprised of 37/27 dialogs containing 453/393 English sentences and 532/405 Chinese utterances and the respective ASR output data files
  • (BTEC) Two data sets (IWSLT10 testset, IWSLT09 testset) consisting of 500 source language sentences each

- tools:

  • Preprocessing scripts (tokenization, NBEST extraction, etc.) used to prepare the data sets

For data set details, click on the translation direction name tag.

Templates for LaTeX/MSWord:

If your paper will be typeset using LaTeX, please download the template package  herethat will generate the proper format. To extract files under UNIX run: $ unzip latex_template_iwslt10.tar.gz

  • LaTeX style:  iwslt09.sty
  • Example document:  template.tex
  • Example document PS:  template.ps
  • Example document PDF:  template.pdf
  • Bibliography style:  IEEEtran.bst
  • MS-Word template:  template.doc