Entity-Aware Machine Translation Leaderboard

Overview

This leaderboard showcases the performance of various systems on the EA-MT shared task, which has been organized as part of the SemEval 2025 workshop.

The results are still provisional and subject to change.

Task Description

The task is to translate a given input sentence from the source language (English) to the target language, where the input sentence contains named entities that may be challenging for machine translation systems to handle. The named entities may be entities that are rare, ambiguous, or unknown to the machine translation system. The task is to develop machine translation systems that can accurately translate such named entities in the input sentence to the target language.

Learn more about the task on the EA-MT shared task page.

Scoring

The leaderboard is based on three main scores:

M-ETA Score: A score that evaluates the translation quality of named entities in the input sentence.
COMET Score: A score that evaluates the translation quality at the sentence level.
Overall Score: The harmonic mean of the M-ETA and COMET scores.

Legend

🟠: Uses gold data, i.e., the gold Wikidata ID or information derived from it, at test time.
🔍: Uses RAG (Retrieval-Augmented Generation) for named entity translation.
🤖: Uses an LLM (Large Language Model) for named entity translation.
📚: The system (LLM and/or MT model) is finetuned on additional data.

Filters and Controls

Use the dropdowns and checkboxes to filter the leaderboard scores.

🟠 : Display only those systems that do not use gold information at **test time**.

🔍 : Display only those systems that do not use RAG.

🤖 : Display only those systems that do not use LLM.

📚 : Display only those systems that are not finetuned.

Team Name

System Name

LLM Name

Leaderboard Scores

You can view the leaderboard scores for each system based on the following metrics:

M-ETA Score: A score that evaluates the translation quality of named entities in the input sentence.
COMET Score: A score that evaluates the translation quality at the sentence level.
Overall Score: The harmonic mean of the M-ETA and COMET scores. Switch between the tabs to view the scores for each metric.

Note: You can sort the leaderboard by clicking on the column headers. For example, click on the "it_IT" column to sort by the Italian language scores.

Overall Score Leaderboard

Overall Score Leaderboard
Rank	Team	System	Uses Gold	Uses RAG	Uses LLM	LLM Name	Finetuned	ar_AE	de_DE	es_ES	fr_FR	it_IT	ja_JP	ko_KR	th_TH	tr_TR	zh_TW	overall
10	The Five Forbidden Entities	LoRA-nllb-distilled-200-distilled-600M	🟠	🔍	🤖	Llama-3.3-70B-Instruct + DeepSeek-R1	📚	92.68	90.03	92.54	92.92	94.39	93.34	92.77	92.35	89.54	87.36	91.79

Rank	Team	System	Uses Gold	Uses RAG	Uses LLM	LLM Name	Finetuned	ar_AE	de_DE	es_ES	fr_FR	it_IT	ja_JP	ko_KR	th_TH	tr_TR	zh_TW	overall
1	pingan_team	Qwen2.5-72B-LoRA	🟠		🤖	Qwen2.5-72B	📚	92.68	90.03	92.54	92.92	94.39	93.34	92.77	92.35	89.54	87.36	91.79
2	pingan_team	Qwen2.5-72B-LoRA + zhconv	🟠		🤖	Qwen2.5-72B	📚	92.81	89.68	92.54	92.9	94.52	93.25	92.77	92.35	89.43	87.18	91.74
3	Deerlu	Qwen2.5-Max-Wiki	🟠	🔍	🤖	Qwen2.5-Max		92.89	89.92	92.63	92.43	94.3	93.55	92.88	92.28	89.2	87.11	91.72
4	RAGthoven	GPT-4o + WikiData + RAG	🟠	🔍	🤖	GPT-4o		93.24	89.46	92.42	92.5	94.33	92.55	92.92	92.46	88.82	87.51	91.62
5	pingan_team	Phi4-FullFT	🟠		🤖	Phi-4	📚	92.09	89.61	92.5	92.72	94.31	93.53	92.98	91.27	89.35	87.02	91.54
6	UAlberta	WikiEnsemble	🟠	🔍	🤖	GPT-4o		93.25	89.52	92.24	91.89	93.79	93.01	92.95	92.02	89.12	87.22	91.5
7	CHILL	GPT4o-RAG-Refine	🟠	🔍	🤖	GPT-4o		93.03	89.43	92.37	91.71	94.01	93.17	92.98	92.87	89.93	85.06	91.46
8	UAlberta	WikiGPT4o	🟠	🔍	🤖	GPT-4o		93.24	89.43	92.22	91.91	93.79	92.98	92.89	91.99	88.17	87.26	91.39
9	RAGthoven	GPT-4o + Wikidata	🟠		🤖	GPT-4o		93.24	89.46	92.41	92.5	94.23	91.38	91.49	91.39	86.75	87.31	91.02
10	Lunar	LLaMA-RAFT-Plus-Gold	🟠	🔍	🤖	Llama-3.1-8B-Instruct	📚	91.38	89.95	92.4	92.13	94.17	93.52	93.03	92.35	89.2	79	90.71
11	YNU-HPCC	LLaMA + MT	🟠		🤖	Llama-3.3-70B-Instruct		91.47	88.09	91.72	90.04	92.35	91.54	91.41	89.84	86.05	86.91	89.94
11	YNU-HPCC	Qwen2.5-32B	🟠		🤖	Qwen2.5-32B	📚	91.47	88.09	91.72	90.04	92.35	91.54	91.41	89.84	86.05	86.91	89.94
13	arancini	WikiGemmaMT	🟠		🤖	gemma-2-9b-it		91.37	89.02	92.2	92.04	93.87	92.31	92.46	91.49	87.92	65.31	88.8
14	Lunar	LLaMA-RAFT-Gold	🟠	🔍	🤖	Llama-3.1-8B-Instruct	📚	88.86	86.83	90.54	81.7	92.18	91.31	90.62	88.09	86.64	70.85	86.76
15	SALT 🧂	Salt-Full-Pipeline + Gold	🟠	🔍		N/A	📚	90.83	87.56	88.27	88.12	91.54	88.43	87.81	81.19	88.82	65.3	85.78
16	Howard University-AI4PC	DoubleGPT	🟠	🔍	🤖	gpt-4o-2024-08-06		89.3	84.55	89.73	85.28	87.25	89.9	90.15	88.25	82.2	57.84	84.44
17	SALT 🧂	Salt-Full-Pipeline		🔍	🤖	GPT-4o-mini	📚	87.29	83.04	87.49	85.11	86.14	85.77	85.97	82.59	85.11	67.82	83.63
18	SALT 🧂	Salt-MT-Pipeline		🔍		N/A	📚	87.09	82.02	83.01	82.43	84.77	81.3	82.56	76.11	84.76	60.19	80.42
19	FII-UAIC-SAI	Qwen2.5-Wiki-MT			🤖	Qwen2.5-72B-Instruct-AWQ		76.91	77.27	81.22	80.52	83.4	78.11	77.14	75.16	77.77	74.19	78.17
20	Lunar	LLaMA-RAFT-Plus		🔍	🤖	Llama-3.1-8B-Instruct	📚	77.7	72.11	77.61	77.4	82.28	69.39	73.96	77.02	81.08	54.02	74.26
21	YNU-HPCC	Qwen2.5 + M2M			🤖	Qwen2.5-32B		75.39	72.64	77.99	75.89	76.44	76.23	72.9	69.06	70.87	72.02	73.94
22	FII the Best	mBERT-WikiNEuRal			🤖	Gemini 1.0 Pro		77.54	73.56	79.1	77.5	76.7	77.26	75.13	67.15	69.77	40.71	71.44
23	Lunar	LLaMA-RAFT		🔍	🤖	Llama-3.1-8B-Instruct	📚	73.77	69.75	74.99	67.35	80.33	67.23	70.86	66.96	76.24	40.82	68.83
24	UAlberta	PromptGPT			🤖	GPT4o		59.45	64.33	70.76	64.77	66.49	67.14	64.17	39.19	63.56	54.83	61.47
25	The Five Forbidden Entities	MBart-KnowledgeAware			🤖	MBart	📚	69.61	67.16	66.33	62.79	67.87	74.61	67.37	25.86	63.53	39.84	60.5
26	RAGthoven	GPT-4o + RAG		🔍	🤖	GPT-4o		58.93	61.04	67.12	61.13	63.6	62.17	62.04	45.69	65.3	57.51	60.45
27	The Five Forbidden Entities	Embedded Entities			🤖	MBart	📚	65.96	65.41	66.8	61.91	67.75	73.6	52.57	22.38	60.53	30.64	56.76
28	Zero	FineTuned-MT				N/A	📚	53.08	55.81	61.83	48.32	54.92	51.09	51.68	23.58	62.18	15.37	47.79
29	HausaNLP	Gemini-0shot			🤖	gemini-1.5-flash		47.72	53.47	62.95	53.89	55.68	50.67	50.17	30.68	56.63	15.55	47.74
30	HausaNLP	Gemini-few-shot	🟠		🤖	gemini-1.5-flash		49.48	53.55	63.46	50.56	54.91	50.68	49.28	30.49	57.4	14.82	47.46
31	Muhandro_HSE	NER-LLM			🤖	Llama-3.1-8B, Llama-3.1-70B	📚	47.17	53.37	56.86	51	48.25	37.53	33.99	17.37	55.69	11.61	41.28
32	silp_nlp	GPT-4o			🤖	GPT-4o		42.82	1.81	4.3	2.94	48.21	46.63	43.86	0.17	16.22	0.23	20.72
33	silp_nlp	GPT-4o-mini			🤖	GPT-4o-mini		42.14	1.18	4.3	3.26	49.85	45.4	40.92	0.23	15.13	0.23	20.26
34	sakura	Rakuten7b-PO10			🤖	Rakuten/RakutenAI-7B-chat	📚						44.53
34	SHEF	Llama-Wiki-DeepSeek			🤖	Llama-3.3-70B-Instruct + DeepSeek-R1	📚		89.05	92.17	90.98	93.84
34	Transcreate	Chatgpt-4o-mini-llm			🤖	gpt-4o-mini-2024-07-18								44.08
34	silp_nlp	NER-M2M100	🟠		🤖	M2M100	📚						13.27
34	CUET_DeepLearners	Spacy-NLLB				N/A
34	Transcreate	Claude-haiku-llm			🤖	claude-3-5-haiku-20241022								42.1
34	Transcreate	Gemini-pro-llm			🤖	gemini-1.5-pro								63.11
34	Transcreate	Chatgpt-o1-llm			🤖	o1-2024-12-17								53.3
34	VerbaNexAI Lab	TransNER-SpEn	🟠			N/A	📚			38.38
34	Transcreate	Chatgpt-4o-llm			🤖	gpt-4o-2024-08-06								55.08
34	Transcreate	Chatgpt-o1-mini-llm			🤖	o1-mini-2024-09-12								48.64
34	Transcreate	Llama-llm			🤖	Llama-3.1-8B-Instruct								10.22
34	Transcreate	Gemini-flash-llm			🤖	gemini-1.5-flash								48.58
34	JNLP	Multi-task-mT5				N/A	📚		22.42	20.57	20.75
34	silp_nlp	T5-MT-Instruct	🟠		🤖	T5-base	📚	0	0	0	0.15	10.41	0
34	AMM_CUET	EA-MT-GPT4o-FR-IT-NER			🤖	GPT-4o	📚
34	GinGer	LoRA-nllb-distilled-200-distilled-600M				N/A	📚	30.39				39.69	0
34	Transcreate	Claude-sonnet-llm			🤖	claude-3-5-sonnet-20241022								53.88
34	ASL_CUET	GPT-4o-EntityAware-FR-IT			🤖	GPT-4o	📚
34	HausaNLP	FT-NLLB	🟠		🤖	NLLB-200-600M	📚	33.39	33.76	48.2	36.03	41.82	22.29

M-ETA Score Leaderboard

M-ETA Score Leaderboard
Rank	Team	System	Uses Gold	Uses RAG	Uses LLM	LLM Name	Finetuned	ar_AE	de_DE	es_ES	fr_FR	it_IT	ja_JP	ko_KR	th_TH	tr_TR	zh_TW	overall
11	The Five Forbidden Entities	LoRA-nllb-distilled-200-distilled-600M	🟠	🔍	🤖	Llama-3.3-70B-Instruct + DeepSeek-R1	📚	88.78	82.59	88.42	86.59	89.88	87.74	88.17	86.51	79.28	80.64	85.86

Rank	Team	System	Uses Gold	Uses RAG	Uses LLM	LLM Name	Finetuned	ar_AE	de_DE	es_ES	fr_FR	it_IT	ja_JP	ko_KR	th_TH	tr_TR	zh_TW	overall
11	YNU-HPCC	LLaMA + MT	🟠		🤖	Llama-3.3-70B-Instruct		88.78	82.59	88.42	86.59	89.88	87.74	88.17	86.51	79.28	80.64	85.86
20	Lunar	LLaMA-RAFT-Plus		🔍	🤖	Llama-3.1-8B-Instruct	📚	67.54	59.55	66.58	67.15	73.58	55.66	61.49	67.22	71.5	38.79	62.9
34	sakura	Rakuten7b-PO10			🤖	Rakuten/RakutenAI-7B-chat	📚						29.5
34	SHEF	Llama-Wiki-DeepSeek			🤖	Llama-3.3-70B-Instruct + DeepSeek-R1	📚		85.57	90.5	90.05	93.02
11	YNU-HPCC	Qwen2.5-32B	🟠		🤖	Qwen2.5-32B	📚	88.78	82.59	88.42	86.59	89.88	87.74	88.17	86.51	79.28	80.64	85.86
8	UAlberta	WikiGPT4o	🟠	🔍	🤖	GPT-4o		91.66	85.06	89.34	89.68	91.74	90.33	90.34	89.96	81.65	81.22	88.1
18	SALT 🧂	Salt-MT-Pipeline		🔍		N/A	📚	81.72	73.77	74.58	74.77	77.62	72.2	74.24	65.59	76.86	45.27	71.66
34	Transcreate	Chatgpt-4o-mini-llm			🤖	gpt-4o-mini-2024-07-18								29.14
34	silp_nlp	NER-M2M100	🟠		🤖	M2M100	📚						7.18
31	Muhandro_HSE	NER-LLM			🤖	Llama-3.1-8B, Llama-3.1-70B	📚	32.35	37.85	41.21	35.77	33.03	23.71	20.98	9.78	39.86	6.23	28.08
34	CUET_DeepLearners	Spacy-NLLB				N/A
34	Transcreate	Claude-haiku-llm			🤖	claude-3-5-haiku-20241022								28.49
9	RAGthoven	GPT-4o + Wikidata	🟠		🤖	GPT-4o		91.88	85.07	89.83	91.02	92.64	87.57	88.08	89.21	79.61	81.46	87.64
34	Transcreate	Gemini-pro-llm			🤖	gemini-1.5-pro								48.33
21	YNU-HPCC	Qwen2.5 + M2M			🤖	Qwen2.5-32B		64.09	60.16	67.27	65.31	65.4	64.19	59.92	56.37	57.84	59.17	61.97
19	FII-UAIC-SAI	Qwen2.5-Wiki-MT			🤖	Qwen2.5-72B-Instruct-AWQ		66.42	66.98	72.35	72.46	75.79	67.03	66.02	65.25	67.56	62.5	68.24
33	silp_nlp	GPT-4o-mini			🤖	GPT-4o-mini		27.64	0.6	2.21	1.67	34.48	30.07	26.37	0.12	8.52	0.12	13.18
22	FII the Best	mBERT-WikiNEuRal			🤖	Gemini 1.0 Pro		68.11	62.63	69.91	68.11	67.67	66.68	64.11	55.41	56.9	26.46	60.6
34	Transcreate	Chatgpt-o1-llm			🤖	o1-2024-12-17								37.52
30	HausaNLP	Gemini-few-shot	🟠		🤖	gemini-1.5-flash		34.18	38.14	48.3	35.32	39.39	34.93	33.75	18.62	41.54	8.09	33.22
28	Zero	FineTuned-MT				N/A	📚	37.5	40.32	46.46	33.16	39.37	35.28	35.97	13.75	46.5	8.41	33.67
34	VerbaNexAI Lab	TransNER-SpEn	🟠			N/A	📚			24.62
34	Transcreate	Chatgpt-4o-llm			🤖	gpt-4o-2024-08-06								39.51
27	The Five Forbidden Entities	Embedded Entities			🤖	MBart	📚	53.24	51.79	53.24	48.14	54.69	61.65	39.2	13.64	48.65	18.93	44.32
34	Transcreate	Chatgpt-o1-mini-llm			🤖	o1-mini-2024-09-12								33.06
14	Lunar	LLaMA-RAFT-Gold	🟠	🔍	🤖	Llama-3.1-8B-Instruct	📚	86.5	81.72	87.47	73.19	90.04	88.27	88.51	87.03	80.51	57.99	82.12
2	pingan_team	Qwen2.5-72B-LoRA + zhconv	🟠		🤖	Qwen2.5-72B	📚	91.47	85.72	90.13	91.44	93.19	90.92	90.24	91.18	84.08	81.09	88.95
34	Transcreate	Llama-llm			🤖	Llama-3.1-8B-Instruct								5.63
34	Transcreate	Gemini-flash-llm			🤖	gemini-1.5-flash								33.16
15	SALT 🧂	Salt-Full-Pipeline + Gold	🟠	🔍		N/A	📚	88.19	82.56	82.99	84.04	88.68	83.34	82.62	72.93	83.39	51	79.98
1	pingan_team	Qwen2.5-72B-LoRA	🟠		🤖	Qwen2.5-72B	📚	91.73	86.35	90.13	91.56	93.02	91.41	90.24	91.18	84.13	81.26	89.1
17	SALT 🧂	Salt-Full-Pipeline		🔍	🤖	GPT-4o-mini	📚	83.15	76.07	82.47	79.82	80.74	79.91	80.48	76.18	77.93	54.59	77.13
34	JNLP	Multi-task-mT5				N/A	📚		13.07	12.01	11.93
34	silp_nlp	T5-MT-Instruct	🟠		🤖	T5-base	📚	0	0	0	0.07	5.63	0
29	HausaNLP	Gemini-0shot			🤖	gemini-1.5-flash		32.66	38.16	47.92	38.77	40.31	35.1	34.67	18.8	40.82	8.53	33.57
13	arancini	WikiGemmaMT	🟠		🤖	gemma-2-9b-it		90.15	84.8	89.58	90.7	92.43	90.74	90.73	90.8	82.29	50.79	85.3
34	AMM_CUET	EA-MT-GPT4o-FR-IT-NER			🤖	GPT-4o	📚
6	UAlberta	WikiEnsemble	🟠	🔍	🤖	GPT-4o		91.69	85.23	89.36	89.62	91.74	90.39	90.44	90.02	83.21	81.15	88.28
24	UAlberta	PromptGPT			🤖	GPT4o		43.99	49.57	57.03	50.39	51.98	52.37	48.86	25.27	48.2	39.04	46.67
34	GinGer	LoRA-nllb-distilled-200-distilled-600M				N/A	📚	18.41				25.52	0
4	RAGthoven	GPT-4o + WikiData + RAG	🟠	🔍	🤖	GPT-4o		91.88	85.07	89.88	91.02	92.78	89.57	90.22	90.48	82.76	81.47	88.51
23	Lunar	LLaMA-RAFT		🔍	🤖	Llama-3.1-8B-Instruct	📚	62.66	56.74	63.15	53.67	70.93	52.94	58.17	55.15	64.81	26.57	56.48
3	Deerlu	Qwen2.5-Max-Wiki	🟠	🔍	🤖	Qwen2.5-Max		91.53	85.94	90.26	91.07	92.84	91.5	90.69	90.98	83.88	80.76	88.95
34	Transcreate	Claude-sonnet-llm			🤖	claude-3-5-sonnet-20241022								39.69
16	Howard University-AI4PC	DoubleGPT	🟠	🔍	🤖	gpt-4o-2024-08-06		85.18	77.18	85.16	79.01	81.82	85.1	86.09	84.22	73.33	42.18	77.93
25	The Five Forbidden Entities	MBart-KnowledgeAware			🤖	MBart	📚	58.08	54.29	52.92	49.42	55.34	62.61	56.63	16.3	50.77	26.23	48.26
32	silp_nlp	GPT-4o			🤖	GPT-4o		28.24	0.92	2.21	1.5	32.97	31.15	28.85	0.09	9.21	0.12	13.52
26	RAGthoven	GPT-4o + RAG		🔍	🤖	GPT-4o		43.57	45.93	52.55	46.29	48.71	46.61	46.48	30.81	50.1	41.78	45.28
34	ASL_CUET	GPT-4o-EntityAware-FR-IT			🤖	GPT-4o	📚
10	Lunar	LLaMA-RAFT-Plus-Gold	🟠	🔍	🤖	Llama-3.1-8B-Instruct	📚	88.81	85.96	89.83	90.12	92.64	91.44	90.73	91.24	83.9	68.33	87.3
5	pingan_team	Phi4-FullFT	🟠		🤖	Phi-4	📚	91.22	85.5	90.09	91.27	92.86	91.27	90.89	91.12	83.93	80.88	88.9
34	HausaNLP	FT-NLLB	🟠		🤖	NLLB-200-600M	📚	20.61	20.86	32.75	22.85	27.29	12.74
7	CHILL	GPT4o-RAG-Refine	🟠	🔍	🤖	GPT-4o		91.86	85.23	89.88	89.95	92.43	90.86	90.85	91.53	84.86	77.77	88.52

Comet Score Leaderboard

Comet Score Leaderboard
Rank	Team	System	Uses Gold	Uses RAG	Uses LLM	LLM Name	Finetuned	ar_AE	de_DE	es_ES	fr_FR	it_IT	ja_JP	ko_KR	th_TH	tr_TR	zh_TW	overall
11	The Five Forbidden Entities	LoRA-nllb-distilled-200-distilled-600M	🟠	🔍	🤖	Llama-3.3-70B-Instruct + DeepSeek-R1	📚	94.33	94.38	95.28	93.77	94.96	95.68	92.79	93.43	94.09	94.24	94.51