LES: What the Benchmark Results Actually Tell You

In our previous post we introduced the LES benchmark and described the datasets it contains. If you haven't read it yet, you can find it here.

In this blog post we explain why single language benchmarks matter and how to interpret the metrics used in LES, starting with a small but surprisingly important detail: how models tokenize Slovene text.

Why Slovene Is a Special Case

Slovene is a morphologically rich language with limited representation in large multilingual corpora. Many embedding models were not designed with Slovene as a priority language. One of the biggest hidden issues is tokenization. Tokenizers determine how text is split into subword units before it is processed. If a tokenizer splits Slovene words into tiny fragments, sometimes nearly character-by-character, the model starts from a weaker representation [1]. And weaker representations usually lead to weaker downstream performance.

Let's look at a simple example.

A Simple Sentence, Three Languages

Original sentence:

"Today is a beautiful and sunny day."

Slovene translation:

"Danes je lep in sončen dan."

Spanish translation:

"Hoy es un día muy lindo y soleado."

What Happens During Tokenization?

Now we tokenize these sentences using different embedding models.

intfloat/multilingual-e5-small

	Original Sentence	Tokens
1	Today is a beautiful and sunny day.	▁Today▁is▁a▁beautiful▁and▁sunny▁day.
2	Hoy es un día muy lindo y soleado.	▁Hoy▁es▁un▁día▁muy▁lindo▁y▁soleado.
3	Danes je lep in sončen dan.	▁Danes▁je▁lep▁in▁sončen▁dan.

Clean splits. Mostly word-level units.

EMBEDDIA/SloBERTa

	Original Sentence	Tokens
1	Today is a beautiful and sunny day.	▁Today▁is▁a▁beautiful▁and▁sunny▁day.
2	Hoy es un día muy lindo y soleado.	▁Hoy▁es▁un▁día▁muy▁lindo▁y▁soleado.
3	Danes je lep in sončen dan.	▁Danes▁je▁lep▁in▁sončen▁dan.

The Slovene model handles Slovene very well, but struggles heavily with English and Spanish.

sentence-transformers/all-MiniLM-L6-v2

	Original Sentence	Tokens
1	Today is a beautiful and sunny day.	todayisabeautifulandsunnyday.
2	Hoy es un día muy lindo y soleado.	ho##yesundiamu##ylin##doysole##ado.
3	Danes je lep in sončen dan.	danesjele##pinson##cendan.

The model handles English well and Slovene reasonably well, but struggles somewhat with Spanish.

DeepPavlov/rubert-base-cased-sentence

	Original Sentence	Tokens
1	Today is a beautiful and sunny day.	Todayisabe##aut##ifulandsu##n##nyday.
2	Hoy es un día muy lindo y soleado.	Ho##yesund##íamu##yli##nd##oysol##ead##o.
3	Danes je lep in sončen dan.	Dan##esjele##pinson##č##endan.

Notice how frequently Slovene words are broken into small fragments? The model does the same for Spanish, however, English yields the best overall performance. When words are split into many small fragments, the model must reconstruct meaning from less informative pieces. This often leads to weaker semantic representations.

Does Tokenization Really Matter?

If tokenization really affects the quality of representations, those differences should eventually show up in downstream performance. Benchmark results give us a way to test that hypothesis.

Model	Overall Score (mean accuracy in %)
intfloat/multilingual-e5-small	51.56
EMBEDDIA/sloberta	28.29
sentence-transformers/all-MiniLM-L6-v2	25.67
DeepPavlov/rubert-base-cased-sentence	18.64

The model with the most fragmented Slovene tokenization performs worst overall. But here is the important nuance:

• Good tokenization does not guarantee good performance.
• But poor tokenization almost always limits it.

Tokenization is necessary, not sufficient. This is exactly the kind of issue that language-specific benchmarks like LES help reveal.

The Trap of "Overall Best"

At first glance the leaderboard suggests a simple conclusion: pick the model with the highest overall score. But the overall number hides an important detail; it averages performance across very different tasks. When we break the results down by task type, a different picture emerges.

Model	Overall	Classification	Clustering	Retrieval
intfloat/multilingual-e5-small	50.41	47.64	20.85	82.72
EMBEDDIA/sloberta	28.29	49.81	18.55	16.52
sentence-transformers/all-MiniLM-L6-v2	25.67	32.66	14.60	29.76
DeepPavlov/rubert-base-cased-sentence	18.64	32.55	14.94	8.43

Surprise:

• SloBERTa wins in classification
• E5 dominates in retrieval
• MiniLM outperforms SloBERTa in retrieval

The "best" model therefore depends entirely on the task you care about. If you are building a Slovene sentiment classifier, a monolingual model might be ideal. If you are building a retrieval system for RAG, contrastive models clearly win.

Why Do These Differences Exist?

The differences in the table are not random. They reflect how each model was originally trained. Training objectives shape what kinds of relationships the embedding space learns to represent, which in turn affects how well the model performs on different tasks.

Contrastive Models (E5)

Contrastive training explicitly teaches the model to bring related texts closer in embedding space, such as with E5 [3]. This makes them naturally strong at retrieval.

Masked Language Models (SloBERTa)

SloBERTa is trained with masked language modeling [2]. It learns internal linguistic structure but is not optimized for embedding alignment between queries and documents.

The training objective shapes the geometry of the embedding space, which determines how well related texts are placed near each other.

What LES Actually Evaluates

Understanding these differences requires evaluating models across multiple tasks and domains. This is exactly what LES was designed to do.

All models are evaluated in a zero-shot setting across:

• 7 classification datasets
• 7 clustering datasets
• 5 retrieval datasets

Domains include:

• Hate speech detection
• Sentiment analysis
• Parliamentary speech
• Academic theses
• News articles
• Wikipedia QA
• Legal documents

Task categories tell us what a model is evaluated on. The metrics determine how that performance is measured.

Depending on your application, different metrics may matter more than the overall score shown on the leaderboard. Final rankings in LES are computed by averaging metric scores per task and then averaging across tasks. However, understanding the underlying metrics is crucial if you want to choose the right model for your specific use case.

Classification

Classification evaluates how well embeddings separate categories such as sentiment, genre, hate speech, or academic field. All texts are first embedded. A logistic regression classifier (scikit-learn) is then trained on top of those embeddings. The classifier itself is simple which ensures we are measuring the quality of the embeddings, not the sophistication of a downstream model.

Accuracy: The Big Picture

The primary metric is accuracy:

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Accuracy measures the proportion of correct predictions. It treats false positives (FP) and false negatives (FN) equally. In balanced datasets, this is often sufficient. But many real-world datasets are not balanced and this is where accuracy can be misleading.

Precision and Recall: Understanding Mistakes

To better understand model behavior, we also report the precision and recall.

Precision:

\text{Precision} = \frac{TP}{TP + FP}

Precision answers: When the model predicts a class, how often is it correct? High precision means few false positives.

Recall:

\text{Recall} = \frac{TP}{TP + FN}

Recall answers: Of all the true instances of a class, how many did the model find? High recall means few false negatives.

These two metrics matter depending on your risk tolerance:

• Hate speech detection may prioritize recall (do not miss harmful content).
• Academic classification may prioritize precision (avoid incorrect labeling).

F1 Score: Balancing the Trade-Off

To balance precision and recall, we use the F1 score:

\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

F1 penalizes models that are strong in one metric but weak in the other.

Clustering

Clustering evaluates something slightly different: Can embeddings naturally group similar texts together, without labels?

Here we reuse the classification datasets. Embeddings are grouped using KMeans, where the number of clusters equals the number of underlying classes. The evaluation metric is V-measure:

V = \frac{(1 + \beta) \cdot \text{homogeneity} \cdot \text{completeness}}{(\beta \cdot \text{homogeneity} + \text{completeness})}

Where:

• Homogeneity means each cluster contains only one class.
• Completeness means all members of a class are assigned to the same cluster.

We set $\beta = 1$ , meaning both are equally important.

Why does this matter? A model could trivially achieve perfect homogeneity by creating one cluster per data point. Or perfect completeness by placing everything in a single cluster. V-measure prevents these degenerate solutions by balancing both properties. Clustering reveals whether the embedding space itself has meaningful structure, even without supervision.

Retrieval

Retrieval is arguably the most practically important task, especially in the era of Retrieval-Augmented Generation (RAG). Here the question is: Given a query, can the embedding space retrieve the most relevant document?

Our datasets contain question and answer datasets as well as abstract and body texts. We embed both queries and documents and rank documents by cosine similarity. But evaluating ranking quality requires more nuanced metrics.

nDCG: Ranking Quality Matters

The primary metric is nDCG (Normalized Discounted Cumulative Gain):

\text{DCG}@K = \sum_{k=1}^{K} \frac{2^{\mathrm{rel}(k)} - 1}{\log_2(k + 1)}

\text{nDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K}

nDCG rewards models for placing highly relevant documents near the top of the ranking. It penalizes relevant documents appearing lower in the list. In retrieval systems, ranking order is critical. A relevant document at position 1 is far more valuable than at position 20.

MRR: How Soon Do We Get the Right Answer?

Mean Reciprocal Rank (MRR):

\text{MRR} = \frac{1}{Q} \sum_{q=1}^{Q} \frac{1}{\text{rank}_q}

MRR focuses on the rank of the first correct result. If the correct document appears at position 1, the query receives the maximum score. If it appears at position 5, the contribution is smaller.

This metric is especially relevant for:

• Question answering systems
• Search interfaces
• RAG pipelines

MAP: Consistency Across All Results

Mean Average Precision (MAP) evaluates how well a model retrieves and ranks relevant documents across a set of queries:

\text{MAP} = \frac{1}{Q} \sum_{q=1}^{Q} \text{AP}_q

Unlike MRR, which only looks at the first correct answer, MAP rewards systems that can find multiple relevant documents and rank them consistently near the top.

A high MAP score indicates the system:

• Ranks relevant documents early in the list
• Retrieves a large portion of relevant items
• Maintains strong performance across the entire query set

This metric is especially useful when there are multiple correct answers per query.

Why Multiple Metrics Matter

No single metric captures the full picture.

• Accuracy shows overall correctness.
• Precision and recall reveal error types.
• F1 balances trade-offs.
• V-measure evaluates structural quality.
• nDCG measures ranking strength.
• MRR captures early precision.
• MAP measures consistency.

Depending on your application, you might prioritize one over the others. That is why LES does not reduce performance to a single number, even if the leaderboard displays one. The details matter.

Final Thought

Embedding models are not interchangeable. Tokenizer design, training objectives, and language representation all shape downstream behavior. For smaller languages like Slovene, evaluation at the language level is not optional—it is necessary. LES is a step toward making that evaluation visible.

Try LES Yourself

📊View Leaderboard→

Have a dataset we should include or a model to test? Contact us at info@valira.ai

References

[1] Petrov, A., La Malfa, E., Torr, P. H. S., & Bibi, A. (2023). Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 36. http://arxiv.org/abs/2305.15425
[2] Ulčar, M., & Robnik-Šikonja, M. (2021). SloBERTa: Slovene monolingual large pretrained masked language model. In Information Society 2021 Ljubljana, Slovenia. https://aile3.ijs.si/dunja/SiKDD2021/Papers/Ulcar+Robnik.pdf
[3] Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024). Multilingual E5 text embeddings: A technical report. arXiv. https://arxiv.org/abs/2402.05672