Recently, the portfolio of the CEF language tools was completed by a tool for automatic text anonymisation (we reported in our June newsletter). But how do individual Member States deal with this issue? Let us have a closer look at the Polish examples and use cases. Protection of personal data remains critical also to the traditional, human translators. The grey areas in this field led to the organisation of a multi-national Translating Europe Workshop, originated by a number of Polish and international stakeholders. In the works of the expert group took part also MAPA project representatives, so that the developed guidelines for translators could pay due draw attention to anonymisation techniques.
During the 1st Webinar on Anonymisation and pseudonymisation of judicial decisions organised by DG JUST in March 2021, NeuroCourt – a tool making the contents of the Polish common courts judgment publicly available – was presented. Developed since 2011 by a commercial company, the tool offers accuracy > 97% at the sentence/segment level, with human-in-the loop approach to publish just 2% of the judgements on the portal run by the Polish Ministry of Justice. However, weaknesses persist, among which the need to improve manual work, reliability of automatic anonymisation and little use of contemporary AI technologies.
In 2021, the OPI PIB (a scientific unit building IT systems for the Ministry of Education and Science in Poland) decided, after an analysis conducted by the Laboratory of Databases and Business Analysis Systems on the Oracle Data Safe tool, to anonymise sensitive data processed in development environments. To do so, OPI PIB fully implemented services of hiding test data, as well as data auditing production data. The main element of the project was the creation of a strict policy of anonymisation and auditing of sensitive data, as well as the creation of appropriate algorithms for this purpose, and the identification of places in the databases where data subject to observation is located. OPI PIB analysts selected a set of attributes and developed appropriate algorithms to ensure that data processed in the cloud would be accepted in all applications and modules in which it is processed.
In contrast to the above mentioned use cases built with the participation of the commercial sector and foreign companies, the project EZD RP implemented by NASK, a National Research Institute supervised by the Chancellery of the Prime Minister of the Republic of Poland, deserves attention and appreciation. Responding to the actual business needs of the administration, the main objective of the project is to improve the work of government administration units through the construction and provision of modern and universal digital back-office solutions in electronic document management. To this end, NASK AI Technology Team developed several text-based operations and functionalities, including text classification, information extraction, and anonymisation. As NASK’s tool “Anonimizator” is basically a single-language solution, the Institute performed quality tests under the MAPA project using a multilingual solution.
The report and positive recommendations following comparative tests were presented to the EC by the ELRC Public Services National Anchor Point during an external review of the project in January 2022. Due to the Covid-19 pandemic, a fully on-site dissemination activity could not be organised, but the project and its deliverables were brought to the attention of IT managers in public administration at the 7th Forum of IT Managers in Administration, a major national event, in April 2022. Joint efforts of the NASK team and European project stakeholders will hopefully mark the beginning of an increased involvement of Polish public entities in EU co-funded projects in AI and NLP field, including language data resources and large language models (LLM), in view of the just published call for tenders on a Common European Language Data Space.