公开数据集
数据结构 ? 1.46G
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
For a detailed description of this corpus, please read:
Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.
Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (ps,pdf).
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.
Release v7
On 15 May 2012 we released a further expanded and improved version of the corpus. Previous versions are available here. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.Changes since v6
- added 01/2011 - 11/2011 data, now up to around 60 million words per language
- further refined preprocessing, cleaning
) mark-up on a separate line. The data is stored in one file per day, and in smaller units for newer data.
Some documents have the SPEAKER tag attribute LANGUAGE which indicates what language the original speaker was using.
To use the parallel corpora with tools like GIZA++, you want to:
- tokenize the text (required)
- lowercase the text (recommended)
- strip empty lines and their correspondences (required)
- remove lines with XML-Tags (starting with "<") (required)
Size of the Corpus
Sizes for single-language data after removing XML.
Language | Sentences | Words |
---|---|---|
Bulgarian | 411,636 | - |
Czech | 668,595 | 13,195,311 |
Danish | 2,323,099 | 47,761,381 |
German | 2,176,537 | 47,236,849 |
Greek | 1,517,141 | - |
English | 2,218,201 | 53,974,751 |
Spanish | 2,123,835 | 54,806,927 |
Estonian | 692,210 | 11,358,009 |
Finnish | 2,119,515 | 33,708,706 |
French | 2,190,579 | 54,202,850 |
Hungarian | 658,824 | 12,606,986 |
Italian | 2,081,669 | 50,259,169 |
Lithuanian | 678,665 | 11,512,131 |
Latvian | 666,026 | 12,085,228 |
Dutch | 2,333,816 | 53,487,257 |
Polish | 387,490 | 7,087,016 |
Portuguese | 2,121,889 | 52,300,149 |
Romanian | 402,904 | 9,663,544 |
Slovak | 674,359 | 13,116,301 |
Slovene | 634,488 | 12,665,974 |
Swedish | 2,241,386 | 45,665,947 |
Sizes for parallel corpora after sentence aligning and removing XML.
Parallel Corpus (L1-L2) | Sentences | L1 Words | English Words |
---|---|---|---|
Bulgarian-English | 406,934 | - | 9,886,291 |
Czech-English | 646,605 | 12,999,455 | 15,625,264 |
Danish-English | 1,968,800 | 44,654,417 | 48,574,988 |
German-English | 1,920,209 | 44,548,491 | 47,818,827 |
Greek-English | 1,235,976 | - | 31,929,703 |
Spanish-English | 1,965,734 | 51,575,748 | 49,093,806 |
Estonian-English | 651,746 | 11,214,221 | 15,685,733 |
Finnish-English | 1,924,942 | 32,266,343 | 47,460,063 |
French-English | 2,007,723 | 51,388,643 | 50,196,035 |
Hungarian-English | 624,934 | 12,420,276 | 15,096,358 |
Italian-English | 1,909,115 | 47,402,927 | 49,666,692 |
Lithuanian-English | 635,146 | 11,294,690 | 15,341,983 |
Latvian-English | 637,599 | 11,928,716 | 15,411,980 |
Dutch-English | 1,997,775 | 50,602,994 | 49,469,373 |
Polish-English | 632,565 | 12,815,544 | 15,268,824 |
Portuguese-English | 1,960,407 | 49,147,826 | 49,216,896 |
Romanian-English | 399,375 | 9,628,010 | 9,710,331 |
Slovak-English | 640,715 | 12,942,434 | 15,442,233 |
Slovene-English | 623,490 | 12,525,644 | 15,021,497 |
Swedish-English | 1,862,234 | 41,508,712 | 45,703,795 |
Test Sets
Several test sets have been released for the Europarl corpus. In general, the Q4/2000 portion of the data (2000-10 to 2000-12) should be reserved for testing. All released test sets have been selected from this quarter. The shared tasks for the 2006 and 2007 ACL Workshops on Statistical Machine Translation provide test sets from the Europarl corpus.The original common test set from the Koehn/Och/Marcu ACL 2003 Paper is available in the archives.
Extended versions of these test sets are available in the evaluation Matrix of the EuroMatrix project.
Known Bugs
- Some special HTML entities and noisy characters are not removed from the data.
- Some recent Greek data has only parts of transcripts in the files.
Terms of Use
We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact pkoehn@inf.ed.ac.uk. Please let us know if you find problems with the data or if you want the data for other language pairs. We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data.Acknowledgments
The work was in part supported by the EuroMatrixPlus project funded by the European Commission (7th framework Programme).
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
- 分享你的想法
全部内容
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。