ASPEC data can be used during the fiscal year (April 1 to March 31) by applying on an annual basis. You can also continue to use the data by submitting an application to the contact email address at least one month before the end of the fiscal year.
Now, on April 1, 2021, the Terms of Use will be revised to simplify the application process.
In accordance with this, those who wish to use the data from FY2021 onward are requested to apply through this new website.
Also if you would like to continue using the data in FY2021, please apply again through this website.
ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010 (see this Japanese page). Before using ASPEC, please read and accept the terms of the following license agreement (English form or Japanese form).
With the increasing number of scientific papers published worldwide, there is a demand for machine translation of scientific papers. ASPEC is the first parallel corpus to focus on this. ASPEC aims to promote machine translation research in the domain of scientific papers.
ASPEC includes:
The numbers of sentences are as follows:
Parallel Corpus |
Data Type |
File Name |
Number of sentences |
ASPEC-JE |
TRAIN |
train1.txt |
1,000,000 |
train2.txt |
1,000,000 |
||
train3.txt |
1,008,500 |
||
DEV |
dev.txt |
1,790 |
|
DEVTEST |
devtest.txt |
1,784 |
|
TEST |
test.txt |
1,812 |
|
ASPEC-JC |
TRAIN |
train.txt |
672,315 |
DEV |
dev.txt |
2,090 |
|
DEVTEST |
devtest.txt |
2,148 |
|
TEST |
test.txt |
2,107 |
ASPEC-JE was constructed from Japanese-English scientific paper abstracts, which are the property of the Japan Science and Technology Agency (JST). The National Institute of Information and Communications Technology (NICT) created the 1-to-1 sentence alignments using the method of (Utiyama and Isahara, MT summit XI, 2007).
ASPEC-JC was constructed by manually translating Japanese scientific papers into Chinese. The Japanese scientific papers are either the property of the Japan Science and Technology Agency (JST) or stored in Japan's Largest Electronic Journal Platform for Academic Societies (J-STAGE). The unit of manual translation is the paragraph, and the paragraphs are selected so as to maximize the coverage of word types.
CAUTION: This page aggregates the points where you had better to pay attention when you use this corpus.
Using ASPEC, a new open evaluation campaign for machine translation of scientific papers named WAT, Workshop on Asian Translation has been held every year since 2014.
Please cite the following paper when you use the ASPEC.
@InProceedings{NAKAZAWA16.621,
author = {Toshiaki Nakazawa and Manabu Yaguchi and Kiyotaka Uchimoto and Masao Utiyama and Eiichiro Sumita and Sadao Kurohashi and Hitoshi Isahara},
title = {ASPEC: Asian Scientific Paper Excerpt Corpus},
booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2016)},
year = {2016},
month = {may},
date = {26-31},
address = {Portorož, Slovenia},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-9-1},
language = {english},
pages = {2204-2208}
}
For questions, comments, etc., please send an email to "aspec -at- jst.go.jp".
JST (Japan Science and Technology Agency)
Last Modified: 2021-02-01