Test categories

63 dimensions of capability

Each category isolates one capability and scores every model against held-out questions with per-question rubrics — points earned only for answers good enough to rely on. Click any model row to highlight it across all categories at once.

Overall

The composite Nautilus Score across all Core Evaluation categories. Use it as a first-glance ranking, then drill into the category that matches your workload.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

161,548

OPUS 4.7 · Reasoning: Default

Anthropic · Claude Code

85,944

STEP 3.7 FLASH · Reasoning: Default · NVFP4

StepFun · Pablo Grant

84,587

Agentic

This category tests agent-style behaviors: tool-call generation, tool-use orchestration, multi-step task planning, debugging methodology, system design, and information extraction. The judge does NOT execute tools or code. For tool-call questions, criteria reward correctly-shaped JSON, proper parameter extraction, sensible decision logic about when to call vs ask. For planning/design questions, criteria reward structured thinking, realistic constraints, and addressing the non-obvious dimensions. For debugging questions, criteria reward methodology over guessing.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

28,573

OPUS 4.7 · Reasoning: Default

Anthropic · Claude Code

19,760

STEP 3.7 FLASH · Reasoning: Default · NVFP4

StepFun · Pablo Grant

16,503

Coding

This category tests language-level coding, snippet generation, code translation, debugging, and specific framework/library tasks (largely arena-hard-derived). The judge does NOT run code — assessments must be based on reading the produced code for apparent correctness, idiomatic usage, and adherence to the prompt's specific requirements. Where the prompt asks for a specific language, framework, or API, using the wrong one drops the score significantly even if the alternative would work. Where prompts contain ambiguity or impossible asks (e.g., 'graph navigation with linear complexity'), acknowledging the impossibility/ambiguity is rewarded.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

12,407

SEED 2.0 LITE · Reasoning: Default

Bytedance · Openrouter

7,390

GPT-OSS 120B · Reasoning: Default · BF16

OpenAI · Openrouter

7,166

Creative Writing

This category was inherited from arena-hard's 'creative_writing' label, but in practice contains a mix of actual creative writing (song lyrics, story scenes, fanfic), technical writing (recipes, guides, schemas), coding tasks, and educational explanations. The criteria for each question are tuned to what the question actually asks. For creative-writing prompts, criteria reward craft, voice, and constraint adherence. For technical prompts, criteria reward correctness and completeness. The judge does not execute code.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

14,479

STEP 3.7 FLASH · Reasoning: Default · BF16

StepFun · Pollinations

7,710

MINIMAX M3 · Reasoning: Default · BF16

MiniMax · Openrouter

7,420

Engineering

This category is the NAUTILUS-original modern production engineering set — frameworks named with versions (Next.js 14, Unity 2022.3.5f1, FastAPI, vLLM, tRPC), realistic incident scenarios (vLLM crash after second request, SQL slowdown 2ms→45s, React infinite re-render), and architectural design tasks (multi-tenant schema, rate limiter, structured logging). The judge does NOT execute code; assess by reading the code/solution for apparent correctness, idiomatic usage of the named framework, and engineering judgment. Criteria reward identifying the specific bug/root cause where one exists. NOTE: the first question (Star Trek Holodeck) is likely miscategorized — it's not an engineering task — but criteria are written to assess the answer as a game-master setup.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

27,783

STEP 3.7 FLASH MTP · Reasoning: Default · NVFP4

StepFun · Pablo Grant

14,840

STEP 3.7 FLASH · Reasoning: Default · BF16

StepFun · Pollinations

14,811

Factual Hallucination

This category tests resistance to factual hallucination. The criteria for each question embed the canonical correct answer so the judge has a clear reference rather than relying on memory. The worst outcome here is NOT refusal — it is confidently stated wrong facts. A model that says 'I don't know' or 'I'm not certain, but I believe X' generally scores higher than a model that confidently asserts wrong information. Where multiple defensible answers exist, the criteria specify the acceptable range. The judge should treat 'hallucination with confidence' as the floor condition and reward accurate answers, calibrated uncertainty where warranted, and contextual depth that demonstrates real understanding rather than rote recall.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

12,027

GLM 4.5 AIR · Reasoning: Default · BF16

ZAI · Openrouter

7,125

MINIMAX M3 · Reasoning: Default · BF16

MiniMax · Openrouter

6,960

Instruction Following

All questions in this category test the model's ability to follow explicit, often multi-constraint instructions in the prompt. The judge must verify constraint adherence by reading the answer carefully — counting words, checking for forbidden vocabulary, verifying structural requirements (paragraphs, list format, table format), and validating that specific required elements appear. Two questions in this set (hdc-v1-0245 'summarize OK' and hdc-v1-0244 'no e/a/t/s/n letters') are intentional traps testing whether the model recognizes impossible or near-impossible instructions; scoring for those rewards recognition and articulation of the conflict over naive attempts.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

8,677

Fable 5 · Reasoning: Default

Anthropic · Openrouter

4,988

OPUS 4.7 · Reasoning: Default

Anthropic · Claude Code

4,985

Knowledge

This category is heterogeneous — inherited from arena-hard's 'knowledge' label, in practice it spans domain-specific technical tasks (SAP, mbed, LS-DYNA, Snowpark), business documents (SOW, SMART goals, user stories, STP model), security research, creative writing tasks, and meta-evaluation comparisons. The judge does NOT execute code. Where a question is domain-specific, criteria reward demonstrated familiarity with the actual domain. Where it's a creative task, criteria reward craft and constraint adherence. Where it's a code-review task, criteria reward identifying real bugs over generic 'looks good.'

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

12,859

STEP 3.7 FLASH · Reasoning: Default · BF16

StepFun · Openrouter

7,394

STEP 3.7 FLASH · Reasoning: Default · BF16

StepFun · Pollinations

7,287

Long Context Reasoning

This category tests reasoning over extended context — long code listings to review/modify, lengthy logic puzzles, multi-step math/physics, complex requirements specifications, OCR token dumps to parse, and book-or-topic synthesis tasks. The judge does NOT execute code; assess correctness from reading. Where prompts contain very long context (code listings, OCR dumps, PWS specs), criteria reward accurate ingestion of the specific details rather than generic responses that ignore the provided material.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

11,557

STEP 3.7 FLASH · Reasoning: Default · BF16

StepFun · Openrouter

6,677

STEP 3.7 FLASH · Reasoning: Default · NVFP4

StepFun · Pablo Grant

6,572

Math

This category was inherited from arena-hard's 'math' label, but in practice contains math, physics, programming, finance, business strategy, abstract algebra proofs, and pattern-recognition puzzles. The criteria for each question are tuned to what the question actually asks. The judge does not execute code or run calculators — assess correctness by reading the work shown. Where a problem has a definite numerical answer, the criteria embed that answer. Where a problem is open-ended (strategy, design choice), the criteria assess quality of reasoning rather than 'correct answer.'

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

14,274

QWEN 3.7 Max · Reasoning: Default

Alibaba · Openrouter

7,602

QWEN 3.6 27B · Reasoning: Default · BF16

Alibaba · Openrouter

7,567

RAG Hallucination

This category tests RAG-grounded answering. The model is given documents and must answer ONLY from what those documents contain. Three failure modes are tested: (1) hallucinating facts not in the documents, (2) failing to acknowledge when information is missing, (3) failing to surface conflicts when documents contradict. The criteria for each question identify which type of question it is and what the correct grounded answer looks like. The worst outcome is confident fabrication of details not in the documents. A model that correctly says 'this information is not in the document' is doing the right thing for missing-info questions and should score well.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

9,446

GLM 4.5 AIR · Reasoning: Default · BF16

ZAI · Openrouter

6,380

SONNET 4.6 · Reasoning: High

Anthropic · Claude Code

6,155

Voice

This category tests creative voice, style fidelity, formal constraint adherence in creative work, and the model's ability to engage genuinely with creative prompts (including weird or edgy ones). The judge should score on craft and execution — not on whether the model 'should' have engaged. Refusal of a legitimate creative prompt without articulating why typically lands in the floor band. Where prompts include explicit form constraints (rhyme scheme, syllable count, line count, format tags), constraint adherence is a structural requirement — failing the form drops the score even if the content is interesting. Where prompts ask the impossible ('a style never conceived'), reward awareness of the paradox and genuine attempts at originality.

GLM 5.1 · Reasoning: Default · BF16

ZAI · Pollinations

9,465

STEP 3.7 FLASH · Reasoning: Default · NVFP4

StepFun · Pablo Grant

5,084

KIMI K2.6 · Reasoning: Default · BF16

Moonshot · Pollinations

5,053

63 dimensions of capability

Core Evaluation

Overall

Agentic

Coding

Creative Writing

Engineering

Factual Hallucination

Instruction Following

Knowledge

Long Context Reasoning

Math

RAG Hallucination

Voice

Planning Evaluation

Response Consistency

Verbosity