More Data Available in the Public DGT Translation Memory

2016-03-31

The 2016 update release of the DGT Translation Memory (DGT-TM) is now available for download.  

DGT-TM is an extraction of the translation memory of the European Institutions built from the 'Acquis Communautaire' for all 24 official EU languages Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish. It is produced by the European Commission’s Directorate General for Translation (DGT) and distributed by the Joint Research Centre (JRC).

This public data release is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information.

Translation memories are sentences aligned with their manually produced translations. They can be fed into computer assisted translation software to support human translators in their work. As it is a large parallel corpus in electronic form, DGT-TM can furthermore be used by specialists in computational linguistics to train statistical machine translation software, to generate multilingual dictionaries, to train and test multilingual information extraction software, and more.

The ‘Acquis Communautaire’ is the entire body of European legislation, comprising all the treaties, regulations and directives adopted by the European Union (EU). Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation has been translated into 23 official languages. For the 24th official EU language, Irish, the Acquis has not been translated on a regular basis; which is why DGT-TM includes less data in Irish. The Acquis Communautaire was split into sentences and aligned automatically at sentence level, to build DGT-TM.  Small parts of the alignment data have been corrected by translators. The text data is accompanied by software that allows to extract all sentences and their translations for any of the 276 possible language pair combinations.

DGT-TM-2016, as is the name of this year's release, adds 11 million translation units (~ sentences) or 184 million words to the collection. This corresponds to about 480K new translation units per language with Croatian (HR) data more than doubling in size. With this update, a total of 100 million translation units is now available for download, equivalent to 1.65 billion words. It follows the previous releases DGT-TM (2007), DGT-TM-2011, DGT-TM-2012, DGT-TM-2013, DGT-TM-2014 and DGT-TM-2015. 

For more language technology resources released on the JRC web site.

DGT-TM is also accessible through the EU open data portal.