Too often, research is hindered by a lack of cooperation between academia and industry. Different goals, priorities, the way of working, and last but not least financial conditions make it difficult to reconcile these two worlds and work together on a single project. However, when they overcome the obstacles synergy appears and great results are achieved.
This kind of collaboration recently happened in the Polish NLP community. The ML Research team at Allegro.pl (a popular e-commerce marketplace and the third largest company on the Warsaw Stock Exchange) has started work on developing a BERT-based model for Polish language understanding (NLU) as a part of their NLP infrastructure. The main issue that arose was the lack of a large, diverse, and high-quality corpus that could be used to train the model. Such criteria are met by the National Corpus of Polish (NKJP), which consists of texts from many different sources, such as classic literature, books, newspapers, journals, transcripts of conversations, and texts crawled from the Internet.
The R&D NKJP project was a joint initiative of four scientific institutions: Institute of Computer Science at the Polish Academy of Sciences (ICS PAS, coordinator), Institute of Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódź, and was financed by the Ministry of Science and Higher Education.
NKJP can be explored in a dedicated search engine. However, the collection of source texts is not publicly available due to copyright reasons and may only be used by these four members of the consortium. Thanks to the joint work of Allegro and ICS PAS legal teams, as well as obtaining consent from PWN, the owner of a large part of the texts, all formal obstacles in using the corpus were overcome.
The cooperation resulted in training and open-sourcing HerBERT, a BERT-based model for Polish language understanding. The conducted experiments confirmed its high performance on a set of eleven diverse linguistic tasks, as HerBERT turned out to be the best on eight of them. In particular, it is the best model for Polish NLU model according to the KLEJ Benchmark. The model and its empirical evaluation are presented in the article by Mroczkowski et al. (2021, to appear at BSNLP).
Both HerBERT Base and HerBERT Large are released under CC BY-SA 4.0 licence as a part of the transformers library. Since its appearance in the HuggingFace repository, the model has been very popular. HerBERT Base has been downloaded over 13,500 times in the last month.