BTEC Task

The BTEC task is carried out using the (Basic Travel Expression Corpus), a multilingual speech corpus containing tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. The monolingual and bilingual language resources that should be used to train the translation engines for the primay runs are limited to the supplied corpus for all BTEC translation tasks. This includes all supplied development sets, i.e., you are free to use these data sets as you wish for tuning model parameters or as training bitext, etc. All other languages resources besides the ones for the given translation task, such as any additional dictionaries, word lists, bitext corpora such as the ones provided by LDC, should be treated as "additional language resources". In addition, if participants take part in multiple BTEC translation tasks, the supplied BTEC resources for other input languages should also be treated as "additional language resources".

BTEC Training Corpus:

  • Data format:
    • each line consists of three fields divided by the character '\'
    • sentences consisting of words divided by single spaces
    • format: <SENTENCE_ID>\01\<MT_TRAINING_SENTENCE>
    • Field_1: sentence ID
    • Field_2: paraphrase ID
    • Field_3: MT training sentence
  • Example:

    TRAIN_00001\01\This is the first training sentence.
    TRAIN_00002\01\This is the second training sentence.
  • Languages:
    • Arabic-English (AE)
    • French-English (FE)
    • Turkish-English (TE)
      • 20K sentences randomly selected from the BTEC corpus
  • Corpus specifications:
    • coding: UTF-8
    • text is case-sensitive and includes punctuation

BTEC Develop Corpus:

  • Text input, reference translations of BTEC sentences
  • Data format:
    • each line consists of three fields divided by the character '\'
    • sentences consisting of words divided by single spaces
    • format: <SENTENCE_ID>\<PARAPHRASE_ID>\<TEXT>
    • Field_1: sentence ID
    • Field_2: paraphrase ID
    • Field_3: MT develop sentence / reference translation
  • Text input example:

    DEV_001\01\This is the first develop sentence.
    DEV_002\01\This is the second develop sentence.
  • Reference translation example:

    DEV_001\01\1st reference translation for 1st input
    DEV_001\02\2nd reference translation for 1st input
    ...
    DEV_002\01\1st reference translation for 2nd input
    DEV_002\02\2nd reference translation for 2nd input
    ...
  • Languages:
    • Arabic-English
      • CSTAR03 testset: 506 sentences, 16 reference translations
      • IWSLT04 testset: 500 sentences, 16 reference translations
      • IWSLT05 testset: 506 sentences, 16 reference translations
      • IWSLT07 testset: 489 sentences, 6 reference translations
      • IWSLT08 testset: 507 sentences, 16 reference translations
    • French-English
      • CSTAR03 testset: 506 sentences, 16 reference translations
      • IWSLT04 testset: 500 sentences, 16 reference translations
      • IWSLT05 testset: 506 sentences, 16 reference translations
    • Turkish-English
      • CSTAR03 testset: 506 sentences, 16 reference translations
      • IWSLT04 testset: 500 sentences, 16 reference translations
  • Corpus specifications:
    • coding: UTF-8
    • text is case-sensitive and includes punctuation

BTEC Test Corpus:

  • Text input
  • Data format: → see BTEC Develop Corpus
  • Languages:
    • Arabic-English
    • French-English
    • Turkish-English
      • progress testset: 469 sentences of the IWSLT 2009 BTEC Task
      • IWSLT10 testset: 464 unseen sentences of the BTEC evaluation corpus
  • Corpus specifications
    • coding: UTF-8
    • text is case-sensitive and includes punctuation