Agentic → Multi-step tool use, planning, and self-correction. Models are scored on completing goal-directed tasks that require calling tools, reading results, and adapting — not just single-turn answers.
1 Opus 4.7 19,760
2 Step 3.7 Flash (NVFP4) 16,503
3 Claude Fable 5 15,905
Coding → Code generation, debugging, and repository-level edits across languages. Graded on correctness, idiomatic style, and whether the produced code actually runs against held-out tests.
1 Seed 2.0 Lite 7,390
2 GPT OSS 120B (BF16) 7,166
3 QWEN 3.7 MAX 7,115
Creative Writing → Open-ended prose, narrative voice, and stylistic range. Judged on coherence, originality, and the ability to sustain tone over long passages without collapsing into cliché.
1 STEP 3.7 FLASH (BF16) 7,710
2 MiniMax M3 (BF16) 7,420
3 KIMI K2.6 (BF16) 7,396
Engineering → Systems design, architecture reasoning, and applied technical problem-solving beyond raw coding — trade-off analysis, infra, and specification-level thinking.
1 STEP 3.7 FLASH (BF16) 14,811
2 STEP 3.7 FLASH (BF16) 14,614
3 GPT OSS 120B (BF16) 14,593
Factual Hallucination → Resistance to confidently stating false facts. Higher is better: a high score means the model rarely fabricates and reliably says 'I don't know' when it should.
1 GLM 4.5 Air (BF16) 7,125
2 MiniMax M3 (BF16) 6,960
3 Sonnet 4.6 6,959
Instruction Following → Adherence to explicit constraints — format, length, forbidden words, step ordering. Measures how literally and completely a model obeys a detailed brief.
1 Claude Fable 5 4,988
2 Opus 4.7 4,985
3 Opus 4.8 4,824
Knowledge → Breadth and depth of world knowledge across domains, from science to law to culture. A closed-book exam across thousands of expert-written questions.
1 STEP 3.7 FLASH (BF16) 7,394
2 STEP 3.7 FLASH (BF16) 7,287
3 Step 3.7 Flash (NVFP4) 7,071
Long Context Reasoning → Retrieval and reasoning over very long inputs (100k+ tokens). Scores reward finding the needle, synthesizing across distant passages, and resisting mid-context drift.
1 STEP 3.7 FLASH (BF16) 6,677
2 Step 3.7 Flash (NVFP4) 6,572
3 Opus 4.7 6,375
Math → Multi-step quantitative reasoning, proofs, and competition-style problems. Graded on the final answer and the validity of the chain that produced it.
1 QWEN 3.7 MAX 7,602
2 QWEN 3.6 27B (BF16) 7,567
3 QWEN 3.6 A3B KIMI2.6- Distilled (BF16) 7,413
RAG Hallucination → Faithfulness to provided source documents. Higher is better: the model answers strictly from supplied context and refuses to invent unsupported claims.
1 GLM 4.5 Air (BF16) 6,380
2 Sonnet 4.6 6,155
3 Claude Fable 5 6,147
Voice → Spoken-style interaction quality — latency-friendly phrasing, turn-taking, and natural conversational tone when responses are read aloud.
1 Step 3.7 Flash (NVFP4) 5,084
2 KIMI K2.6 (BF16) 5,053
3 QWEN 3.7 MAX 4,979