This dataset is cleaned version of JESC by handling misplelled English words and doing word segmentation using:
English => Spacy English Tokenizer
Japanesse => Jannome Tokenizer
Dataset | Pharse Pairs |
train | 2371921 |
test | 1992 |
dev | 1992 |
The dataset is on .tsv format so you can read it by seperating with '\t'
# Python Code import pandas as pd df = pd.read_csv('./train', sep = '\t')
JESC aims to support the research and development of machine translation systems, information extraction, and other language processing techniques.
JESC is the product of a collaboration between Stanford University, Google Brain, and Rakuten Institute of Technology. It was created by crawling the internet for movie and tv subtitles and aligining their captions. It is one of the largest freely available EN-JA corpus, and covers the poorly represented domain of colloquial language.
You can download the scripts, tools, and crawlers used to create this dataset on Github.
You can read the paper here.
A large corpus consisting of 2.8 million sentences.
Translations of casual language, colloquialisms, expository writing, and narrative discourse. These are domains that are hard to find in JA-EN MT.
Pre-processed data, including tokenized train/dev/test splits.
Code for making your own crawled datasets and tools for manipulating MT data.
Many thanks for Stanford University, Google Brain, and Rakuten Institute of Technology and especially for its authors Pryzant R. and Chung Y. and Jurafsky D. and Britz D.
Official Site
Japanese-English Subtitle Corpus
@ARTICLE{pryzant_jesc_2018, author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.}, title = "{JESC: Japanese-English Subtitle Corpus}", journal = {Language Resources and evaluation Conference (LREC)}, keywords = {Computer Science - Computation and Language}, year = 2018 }
