Critical Questions Generation Leaderboard

Task

Critical Questions Generation is the task of automatically generating questions that can unmask the assumptions held by the premises of an argumentative text.

This leaderboard aims at benchmarking the capacity of language technology systems to create Critical Questions (CQs). That is, questions that should be asked in order to judge if an argument is acceptable or fallacious.

The task consists on generating 3 Useful Critical Questions per argumentative text.

All details on the task, the dataset, and the evaluation can be found in the paper Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models or in the Shared Task

Data

The CQs-Gen dataset gathers 220 interventions of real debates. Divided between:

validation: which contains 186 interventions and can be used for training or validation, as it has ~25 reference questions per intervention already evaluated accoding to their usefulness (either Useful, Unhelpful or Invalid).
test: which contains 34 interventions. The reference questions of this set (~70) are kept private to avoid data contamination. The questions generated using the test set are what should be submitted to this leaderboard.

Evaluation

The evaluation of each question is computer by comparing each of the 3 newly generated question to the reference questions of the test set using Semantic Text Similarity, and inheriting the label of the most similar reference given the threshold of 0.65. Questions where no reference is found are considered Invalid. See the evaluation function here, or find more details in the paper.

Leaderboard


Winner of the CQs-Gen 2025 Shared Task	Deepseek-r1 & Gemma2-27b	Midhun Kanadan	81.373	2025-09-15 00:00:00.000	10.8


gemma2-27b-mistral-7b_modernbert_reranking_prompt16_test	Gemma 2 & Mistral	Midhun Kanadan	81.373	2026-01-14	4.9
deepseek-r1-32b-gemma2-27b-reranking-modernbert	Deepseek-r1 & Gemma2-27b	Midhun Kanadan	75.49	2025-11-17	3.9
agentic-phi4-14b-gpt-4o_test	Phi-4 & GPT-4o	Midhun Kanadan	70.588	2025-10-08	5.9
Winner of the CQs-Gen 2025 Shared Task	GPT-4	ELLIS Alicante	67.647	2025-09-25	0
gemma2-27b_prompt16_zeoshot	Gemma 2	Midhun Kanadan	65.686	2025-10-01	4.9
gpt-4.1_prompt15_test_zero-shot	GPT-4.1	Midhun Kanadan	64.706	2025-10-27	2.9
mistral-7b_prompt13_test_zeroshot	Mistral	Midhun Kanadan	62.745	2025-10-31	2.9
claude-sonnet-with-prompt-tuned	claude-sonnet		60.784	2026-03-10	8.8
gemma2-27b_gemma2-9b_reranking	Gemma 2	Midhun Kanadan	57.843	2025-10-07	10.8
claude-3-5-sonnet-20241022 zero-shot	Claude 3.5 Sonnet	Original Paper	55.882	2025-09-26	8.8
multi-candidate-reranker	claude, gemini, gpt		55.882	2026-03-10	3.9
claude-baseline	claude		54.902	2026-03-09	10.8
GPT-4o zero-shot	GPT-4	Original Paper	53.922	2025-09-26	13.7
Meta-Llama-3-70B-Instruct zero-shot	Llama-3	Original Paper	53.873	2025-09-15 00:00:00.000	2.9
gemma-2-27b-it zero-shot	Gemma-2	Original Paper	53.272	2025-09-15 00:00:00.000	9.8
gemma-2-9b-it	Gemma 2	Original Paper	52.941	2025-09-26	7.8
04-mini zero-shot	GPT-4	Original Paper	51.961	2025-09-26	9.8
gemma-3-27b-it zero-shot	Gemma 3	HiTZ	51.961	2025-10-01	10.8
cqsgen-qwen14b-refine-pipeline-rag	Qwen3	Vicomtech	50.98	2025-11-12	3.9
cqsgen-qwen14b-refine-pipeline-rag-prompt2	Qwen3	Vicomtech	50	2025-11-12	3.9
gemma-3-12b-it	Gemma 3	HiTZ	49.02	2025-10-01	17.6
claude-baseline-sonnet	claude		49.02	2026-03-09	12.7
Metal-Llama-3-8B-Instruct zero-shot	Llama 3	Original Paper	49.02	2025-09-26	10.8
Qwen2.5-VL-7B-Instruct zero-shot	Qwen 2.5	Original Paper	48.039	2025-09-26	15.7
DeepSeek-R1-Distill-Llama-70B zero-shot	DeepSeek	Original Paper	44.444	2025-09-26	13.5
cqsgen-qwen14b-refine-pipeline	Qwen3	Vicomtech	42.157	2025-11-12	12.7
DeepSeek-R1-Distill-Llama-8B zero-shot	DeepSeek-R1	Original Paper	36.185	2025-09-15 00:00:00.000	17.7

Submissions

Results can be submitted for the test set only.

We expect submissions to be json files with the following format:

{
    "CLINTON_1_1": {
        "intervention_id": "CLINTON_1_1",
        "intervention": "CLINTON: "The central question in this election is really what kind of country we want to be and what kind of future we 'll build together
Today is my granddaughter 's second birthday
I think about this a lot
we have to build an economy that works for everyone , not just those at the top
we need new jobs , good jobs , with rising incomes
I want us to invest in you
I want us to invest in your future
jobs in infrastructure , in advanced manufacturing , innovation and technology , clean , renewable energy , and small business
most of the new jobs will come from small business
We also have to make the economy fairer
That starts with raising the national minimum wage and also guarantee , finally , equal pay for women 's work
I also want to see more companies do profit-sharing"",
        "dataset": "US2016",
        "cqs": [
            {
                "id": 0,
                "cq": "What does the author mean by "build an economy that works for everyone, not just those at the top"?"
            },
            {
                "id": 1,
                "cq": "What is the author's definition of "new jobs" and "good jobs"?"
            },
            {
                "id": 2,
                "cq": "How will the author's plan to "make the economy fairer" benefit the working class?"
            }
        ]
    },
...
}

After clicking 'Submit Eval' wait for a couple of minutes before trying to refresh.

If you find any issues, please email blanca.calvo@ehu.eus

Split

test

Submission name

Model family

System prompt example

Url to submission information

Team name

Contact email (will be stored privately, & used if there is an issue with your submission)

File

Status

Copy the following snippet to cite these results