ASPEC ( Asian Scientific Paper Excerpt Corpus )

jst logo image nict logo image

Notice:

ASPEC data can be used during the fiscal year (April 1 to March 31) by applying on an annual basis. You can also continue to use the data by submitting an application to the contact email address at least one month before the end of the fiscal year.

Now, on April 1, 2021, the Terms of Use will be revised to simplify the application process.
In accordance with this, those who wish to use the data from FY2021 onward are requested to apply through this new website.
Also if you would like to continue using the data in FY2021, please apply again through this website.

INTRODUCTION

ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010 (see this Japanese page). Before using ASPEC, please read and accept the terms of the following license agreement (English form or Japanese form).

English form or Japanese form

With the increasing number of scientific papers published worldwide, there is a demand for machine translation of scientific papers. ASPEC is the first parallel corpus to focus on this. ASPEC aims to promote machine translation research in the domain of scientific papers.

DETAIL

ASPEC includes:

  • ASPEC-JE: Japanese-English paper abstract corpus
  • ASPEC-JC: Japanese-Chinese paper excerpt corpus

The numbers of sentences are as follows:

Parallel Corpus Data Type File Name Number of sentences
ASPEC-JE TRAIN train1.txt 1,000,000
train2.txt 1,000,000
train3.txt 1,008,500
DEV dev.txt 1,790
DEVTEST devtest.txt 1,784
TEST test.txt 1,812
ASPEC-JC TRAIN train.txt 672,315
DEV dev.txt 2,090
DEVTEST devtest.txt 2,148
TEST test.txt 2,107

ASPEC-JE was constructed from Japanese-English scientific paper abstracts, which are the property of the Japan Science and Technology Agency (JST). The National Institute of Information and Communications Technology (NICT) created the 1-to-1 sentence alignments using the method of (Utiyama and Isahara, MT summit XI, 2007).

ASPEC-JC was constructed by manually translating Japanese scientific papers into Chinese. The Japanese scientific papers are either the property of the Japan Science and Technology Agency (JST) or stored in Japan's Largest Electronic Journal Platform for Academic Societies (J-STAGE). The unit of manual translation is the paragraph, and the paragraphs are selected so as to maximize the coverage of word types.

CAUTION: This page aggregates the points where you had better to pay attention when you use this corpus.

WORKSHOP ON ASIAN TRANSLATION

Using ASPEC, a new open evaluation campaign for machine translation of scientific papers named WAT, Workshop on Asian Translation has been held every year since 2014.

HOW TO OBTAIN

  1. Please read the Terms of Use and click the “Agree to the Terms of Use and go to the application form” button.
  1. Terms of Use English form or Japanese form
  1. Complete the application form and press the send button.
  2. After receiving your application, JST will send the download link for the ASPEC data to your e-mail address. Please note that submission is not automatic, and it may take a few days for us to review your application.

AGREEMENT

Please cite the following paper when you use the ASPEC.

 @InProceedings{NAKAZAWA16.621,
 author = {Toshiaki Nakazawa and Manabu Yaguchi and Kiyotaka Uchimoto and Masao Utiyama and Eiichiro Sumita and Sadao Kurohashi and Hitoshi Isahara},
 title = {ASPEC: Asian Scientific Paper Excerpt Corpus},
 booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2016)},
 year = {2016},
 month = {may},
 date = {26-31},
 address = {Portorož, Slovenia},
 editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
 publisher = {European Language Resources Association (ELRA)},
 isbn = {978-2-9517408-9-1},
 language = {english},
 pages = {2204-2208}
}

CONTACT

For questions, comments, etc., please send an email to "aspec -at- jst.go.jp".

CHANGE LOG

Last Modified: 2021-02-01