Talk Task

The TALK task is carried out using the TED Talkscollection plus other parallel corpora distributed by the ACL 2010 Workshop on Statistical Machine Translation.  

TALK: TED French-English Training Corpus:

  • TED English-French ver. 1.1  (740k words)
  • Copyright: TED Conference LLC
  • License: Creative Commons Attribution-NonCommercial-NoDerivs 3.0
  • Data format:
    • Two parallel files, one for each language
    • XML annotation including
      • url: source of the TED Talk
      • description: free text describing the talk
      • keywords: general topics covered by the talk 
      • id: numeric identifier of the talk
      • title: title of the talk
      • transcript: verbatim transcript or translation of the talk
    • Languages: English, French
    • Coding: UTF-8
    • Text: case-sensitive with punctuation

TALK: Additional Training Corpora:

TALK: DEV Data:

  • Translation Direction: English -> French
  • Source Side
    • ASR 1st-best output from the KIT Quaero Evaluation System: with segmentation as given by the ASR system
    • ASR 20-best  output from the KIT Quaero Evaluation System: with segmentation as given by the ASR system
    • lattices in SLF
    • Reference Transcription
  • Target Side
    • 1 French reference translation as provided by the TED website, including some minor cleaning. References include true casing and punctuation
  • Data Format: NIST XML Format

TALK: Test Data:

  • Input:
    • ASR 1st-best output from the KIT Quaero Evaluation System: with segmentation as given by the ASR system
    • ASR 20-best  output from the KIT Quaero Evaluation System: with segmentation as given by the ASR system
    • lattices in SLF
    • Reference Transcription
  • Data Format: NIST XML Format
  • Submission Format: NIST XML format. Translations should be in true case and should contain punctuation. A sample translation file for the development set can be found here.
  • Scoring will be performed for puntcuation and true-casing, without punctuation but with true-casing, and without punctuation and casing.