Information access at the time of COVID-19 pandemics is hampered by the amount and the reliability of information, as well as the many languages in which the information is provided. Language technologies can help. The COVID-19 MLIA initiative, endorsed by the European Commission's DG CNECT and coordinated by the University of Padua and ELRA/ELDA, has been launched in June 2020 to improve Multilingual Information Access in this specific context.
The 1st round of evaluation has been completed in early 2021. The initiative has triggered a large interest: 14 teams from 10 countries actually submitted runs for the 3 tasks. Many more teams had registered and are expected to join for round 2 and 3.
- For the Task 1, Information Extraction, 4 teams took part in the 1st round (2 companies and 2 academic institutions). The languages covered were English, German, Modern Greek, Italian, Spanish. The main objective of this task is to identify relevant medical information in texts related to the COVID-19 issue.
- For the Task 2, Multilingual Semantic Search, 4 academic participants submitted runs, covering English, French, German, Italian, Modern Greek, Spanish, Swedish and Ukrainian for both the monolingual and bilingual runs. The goal of the Multilingual Semantic Search task is to collect relevant information for the community, the general public including other stakeholders, when searching for health content in different languages and with different levels of knowledge about the specific topic.
- For the Task 3, Machine Translation, 8 teams, including eTranslation, took part in this round, covering the following language pairs from English to each of the following languages: German, French, Spanish, Italian, Modern Greek and Swedish. The goal of Machine Translation Task is to assess the capabilities of the MT systems to translate texts related to Covid-19, comprising new terms and expressions.
Within the Data Acquisition task, the collection was done in 2 parts.
For Machine Translation, the parallel data was built from well-known web sources in the domain of Health and Medicine, and enriched with identified COVID-19 dataset. The size of the resulting corpora ranges from 810K to 1.1M sentence pairs depending on the language pairs (English to German, French, Spanish, Italian, Modern Greek and Swedish). The processed language resources have been cleared and will be progressively made available as an evaluation package from the ELRC-Share repository.
For the Information Extraction and Multilingual Semantic Search, the Europe Media Monitoring (EMM) system developed by the European Commission Joint Research Center was used and tuned to collect metadata automatically extracted from news articles related to Covid-19. This set of metada is available as the 2020 Medisys COVID-19 Dataset on the Open Data Portal.
The second round will run on March and April 2021. For this round, the initiative is also looking into adding new topics, improving the language coverage by extending the number of less-resourced EU languages and fostering cross-fertilization between the tasks.
Finally, it can be noted that similar initiatives tackling the information access issue are being conducted throughout the world. CORD-19 dataset, a collection of health-related literature, from biomedical data sources with the support of WHO (World Health Organization) is one of them. TREC-COVID evaluation of search systems run by NIST (National Institute of Standards and Technology) and using the CORD-19 document is another one.
COVID-19 MLIA is supported by:
- University of Padua and ELRA/ELDA
- LIMSI) for the Information Extraction Task and LISN (former
- University of Padua and CLARIN ERIC for the Multilingual Semantic Search Task
- Universitat Politècnica de València and Pangeanic for the Machine Translation Task
- ILSP and JRC for Data Acquisition and Engineering
All the resources produced during the evaluation rounds are available on the git repositories of the initiative, under CC-BY-SA 4.0 license.