ASPEC ( Asian Scientific Paper Excerpt Corpus )

Notice:

ASPEC data can be used during the fiscal year (April 1 to March 31) by applying on an annual basis. You can also continue to use the data by submitting an application to the contact email address at least one month before the end of the fiscal year.

Now, on April 1, 2021, the Terms of Use will be revised to simplify the application process.
In accordance with this, those who wish to use the data from FY2021 onward are requested to apply through this new website.
Also if you would like to continue using the data in FY2021, please apply again through this website.

INTRODUCTION

ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010 (see this Japanese page). Before using ASPEC, please read and accept the terms of the following license agreement (English form or Japanese form).

English form or Japanese form

With the increasing number of scientific papers published worldwide, there is a demand for machine translation of scientific papers. ASPEC is the first parallel corpus to focus on this. ASPEC aims to promote machine translation research in the domain of scientific papers.

DETAIL

ASPEC includes:

ASPEC-JE: Japanese-English paper abstract corpus
ASPEC-JC: Japanese-Chinese paper excerpt corpus

The numbers of sentences are as follows:

Parallel Corpus	Data Type	File Name	Number of sentences
ASPEC-JE	TRAIN	train1.txt	1,000,000
		train2.txt	1,000,000
		train3.txt	1,008,500
	DEV	dev.txt	1,790
	DEVTEST	devtest.txt	1,784
	TEST	test.txt	1,812
ASPEC-JC	TRAIN	train.txt	672,315
	DEV	dev.txt	2,090
	DEVTEST	devtest.txt	2,148
	TEST	test.txt	2,107

ASPEC-JE was constructed from Japanese-English scientific paper abstracts, which are the property of the Japan Science and Technology Agency (JST). The National Institute of Information and Communications Technology (NICT) created the 1-to-1 sentence alignments using the method of (Utiyama and Isahara, MT summit XI, 2007).

ASPEC-JC was constructed by manually translating Japanese scientific papers into Chinese. The Japanese scientific papers are either the property of the Japan Science and Technology Agency (JST) or stored in Japan's Largest Electronic Journal Platform for Academic Societies (J-STAGE). The unit of manual translation is the paragraph, and the paragraphs are selected so as to maximize the coverage of word types.

CAUTION: This page aggregates the points where you had better to pay attention when you use this corpus.

WORKSHOP ON ASIAN TRANSLATION

Using ASPEC, a new open evaluation campaign for machine translation of scientific papers named WAT, Workshop on Asian Translation has been held every year since 2014.

HOW TO OBTAIN

Please read the Terms of Use and click the “Agree to the Terms of Use and go to the application form” button.

Complete the application form and press the send button.
After receiving your application, JST will send the download link for the ASPEC data to your e-mail address. Please note that submission is not automatic, and it may take a few days for us to review your application.

AGREEMENT

Please cite the following paper when you use the ASPEC.

@InProceedings{NAKAZAWA16.621,
author = {Toshiaki Nakazawa and Manabu Yaguchi and Kiyotaka Uchimoto and Masao Utiyama and Eiichiro Sumita and Sadao Kurohashi and Hitoshi Isahara},
title = {ASPEC: Asian Scientific Paper Excerpt Corpus},
booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2016)},
year = {2016},
month = {may},
date = {26-31},
address = {Portorož, Slovenia},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-9-1},
language = {english},
pages = {2204-2208}
}

CONTACT

For questions, comments, etc., please send an email to "aspec -at- jst.go.jp".

CHANGE LOG

2021/03/15

updated Terms of Use and application process. Also this web site has been moved to the JST server.
2015/02/26

updated the WAT information.
2014/06/19

updated the AGREEMENT to be more accommodating to researchers working for companies.
2014/03/07

added the announcement about the special treatment for applicants up to March 2014
2014/01/30

added the mailing address for the original copy of the agreement
2014/01/22

site open

Last Modified: 2021-02-01