Getting stuck on a problem used to mean searching documentation or forums. Today, it often means opening an LLM. It works surprisingly well. But many real-world problems are not conversational. We may need to classify thousands of documents, cluster articles by topic or build a retrieval system for Slovene legal texts. In these cases, the largest available model is often not the best choice. Smaller embedding models can be faster, cheaper and, when chosen correctly, more accurate.
The problem? Choosing the right embedding model for Slovene is difficult.
That is why we built Lestvica embeddingov za slovenščino (LES), a Slovene-focused embedding benchmark based on the MTEB evaluation framework. LES evaluates embedding models across classification, clustering and retrieval tasks, using exclusively Slovene datasets. This allows us to measure how well different models capture semantic similarity in Slovene text.
In this first post, we introduce the LES benchmark and the datasets used in its evaluation. In upcoming posts, we will present the initial results, compare model performance on Slovene tasks, and discuss what drives these differences.