Funding: EU – CEF (Connecting Europe Facility) / Telecommunications sector

Total eligible costs: 959,999.72 EUR

Estimated CEF contribution: 719,999.79 EUR

Duration: 6-2020 – 5-2022 (24 months)

Coordinator: Hungarian Research Centre for Linguistics (NYTK)

The overall objective of the Curated Multilingual Language Resources for CEF AT Action is to compile curated datasets in seven languages targeted by the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to the European Digital Service Infrastructures (DSIs) with a view to enhance the Automated Translation.

The prime source of data are the national corpora of the above-mentioned languages. The data will cover domains relevant for some of the CEF DSIs, such as eHealth, Europeana and eGovernment in general. The Action will deliver at least 14 Million sentences (estimated to contain at least 140 Million words) from domains including culture, education, health and science. Moreover, the Action will address the gap in machine translation technology, which crucially depends on the provision of domain specific quality language resources for the under-resourced languages.

By delivering seven large size monolingual datasets, which themselves will facilitate the improvement of the CEF Automated Translation core service platform, the Action will enable international users to access information about the relevant EU Member States, including information about local companies and investment opportunities. Thus, the Action will also support the economic growth in Europe, by supporting the CEF AT core service platform for exchanging information in multiple languages.

The Action consists of the following activities:

  1. Aggregation and data preparation

    The aim of this activity is to collect the relevant parts of the national monolingual corpora for the seven languages covered by the consortium: Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian. The final corpus will contain at least 20 million words per language.

  2. Additional collection and IPR clearance

    The objective of this activity is to clarify the legal status of the data and if necessary to obtain Intellectual property rights (IPR) to distribute the texts in the original unedited format, when possible. At the same time, the activity will also focus on identifying the unbalanced domain distribution across the targeted languages and on collecting additional text data.

  3. Anonymisation

    This activity aims at removing or anonymizing all personal and sensitive data from the language resources collected in Activities 1 and 2. By performing this, it will be ensured that personal and sensitive data are anonymised in the textual data obtained from a wide range of sources.

  4. Terminology enrichment

    This activity aims at (1) enriching the documents in the seven monolingual corpora with IATE terms and (2) at the terminology enrichment of the seven monolingual corpora. Moreover, it will also focus on the recognition of words and multiword expressions which fulfil the criteria for domain-specific terms.

  5. Metadata harmonisation

    This activity aims at homogenizing all the individual metadata schemes used by the seven large-scale monolingual corpora. Under this activity, it will be necessary to find a set of common attributes describing the relevant text properties e.g. text style and text domain. Based on this, specific translation models will be trained and the domains will be adapted for specific translation tasks.

  6. Dissemination

    The objective of this Activity is to promote the Action results and important achievements in line with the strategy adopted in the dedicated dissemination plan.

  7. Management

    The overall objective of this activity is to ensure efficient coordination between the consortium partners and the related activities.