Methodology

How the Nautilus Score is built

One composite score, earned the same way by every model — frontier or open weight — against prompts none of them have seen. That's what keeps the leaderboard honest.

Not an arena. This isn't crowd-voted and it isn't a popularity contest — no audience preference, no vibes, no name recognition on the scale. Every model is graded against fixed, held-out rubrics on one question: can it actually do the work? It's a hard-nosed read on raw capability, and nothing else.

What it's for. Nautilus answers a practical question: for the task in front of you, which open-weight model gives you the best odds of success — to run as-is, or as the starting point for fine-tuning? Every open model is measured on the same scale as the publicly available proprietary frontier, so the comparison holds up in both directions. Sometimes a closed frontier model is the right call. But when you're going to fine-tune, you want to start from the open model that already gets you closest to the outcome you need — on weights you actually control.

Good is a floor, not a finish line. Mapping capability is critical — right up until the models all clear the same bar. Today, raw capability still separates the field, so measuring it precisely is the whole game. But capability is converging: once every model scores some X that's good enough for the job in front of you, X stops being the number that matters — what matters is simply whether a model clears it. At that point the score is less a ranking than a pulse: it's either there, or the model is dead in the water. Sitting far past the bar buys you nothing; the work only ever needs what the work needs. From there the decision turns on everything else: what you can run on your own hardware, afford at scale, own, fine-tune, and keep private. Nautilus is built for both eras — to measure capability rigorously while it still decides the field, and to surface which models are genuinely sufficient once it doesn't, so you can take the smallest, cheapest, most controllable one that clears the bar.

The Nautilus Score

Every model earns a Nautilus Score in each of eleven categories, plus a composite Overall weighted across them. Points are earned against a fixed, per-question rubric through a deliberately non-linear curve: a failing answer earns nothing, adequate-but-unreliable answers are heavily discounted, and the top of the scale is reserved for demonstrated excellence — where each additional point is worth progressively more. The curve pushes hard against mediocrity instead of averaging it away, and it rewards consistency: a high score is only reachable by clearing the bar repeatedly, not by pairing brilliant answers with weak ones. The result is a score you can read as a prediction — a model that scores high here is one whose next answer you can trust before you've seen it.

Scores come from a validated, repeatable scoring method applied to a held-out question set — each item expertly isolated to a single capability, each with its own focused, evaluated rubric — run from a fixed, immutable system state, so the same answer always earns the same score.

Three views of one result

Every model can be read three ways, and the chart lets you switch between them:

N-Score — the Nautilus Score. The headline composite: both judges, the full weighting curve, the number that decides the board.
A-Score — the Nautilus Governor (Analytical). The primary judge's independent read — a rigorous, criteria-first grade against the rubric.
R-Score — the Nautilus Reflector (Reflective). A second, separately-sourced judge's independent read, reasoning over the same answer and rubric by a different mechanism.

The composite is built from two judges that share no lineage and grade by different mechanisms — so showing each one alone isn't a footnote, it's the audit. When the Analytical and Reflective scores land close, the result is consensus, not one grader's opinion. Switching between A and R is the fastest way to see that agreement for yourself.

Why the score works this way

Every question carries its own rubric. Models are very good at finding the pattern in a scoring system and optimizing to it. A single shared rubric is exactly such a pattern — something a model can reverse-engineer and align to. Giving every question its own criteria removes the stable target, so a high score has to come from answering well, not from learning how the grader thinks.

A failing answer is worth zero, on purpose. In practice there's no difference between a 30 and a 0: both mark an answer you can't trust, and an answer you can't trust can't be used. Nautilus treats competence as a threshold, not a gradient — below the bar, a response moves nothing forward, so it earns nothing. Only answers good enough to actually rely on contribute to the score.

A refusal is an answer. When a model declines a question — whatever its stated reason — that refusal is recorded as its response and judged against the rubric like any other. A model that won't engage with a question cannot perform the task, which makes it strictly less useful than one that answers it correctly. If a model's accuracy suffers because it refuses questions others will take on, that reduction in coverage is measured and scored, not excused. Capability you can't access isn't capability.

Two judges, not one. A single grader is a single point of failure — its blind spots become the score's blind spots. Nautilus grades every answer with two independent judges, the Analytical and the Reflective, drawn from different model lineages and built on different evaluation mechanisms. Each reads an answer more than once — including deliberately strict and deliberately lenient passes — and the extremes are trimmed before the rest is reconciled. One harsh read, one generous read, or one anomaly can't decide a result; only agreement survives. The judges share no lineage with each other or with any model under test, so no contestant is ever graded by its own kind.

Multi-pass, not single-read. Each judge scores every answer several times — not once. The passes include calibrated extremes: a deliberately strict read and a deliberately lenient one, in addition to the core evaluation. The outer results are trimmed before the remainder is reconciled into a final score. A single unusual read — in either direction — cannot move the number. What survives is what every pass agreed on, which means the score reflects a consistent signal, not a single moment of judgment.

Open weights vs. proprietary

Models split into two classes, encoded by shape and color so the distinction reads instantly:

Self-hostable

Open weights you can download and run on your own hardware — full precision. Plotted by file size.

Self-hostable (quantized)

The same weights at reduced precision — NVFP4, FP8, W4A16, and similar. A distinct color makes it easy to compare a full-precision build against its quantized sibling at a glance.

Proprietary

Frontier API models with no public weights. Grouped in the reserve band — no size estimate.

Baseline

A fixed reference model that anchors the scale, so scores stay comparable from one run to the next. Always visible, never colored like the field.

The build is the model

A score should reflect what you'll actually run. The same weights behave differently at different precisions, quantizations, and serving endpoints — so Nautilus tests builds, not abstractions. Where it matters, the same model is measured at more than one precision — for example, a full-precision version against a quantized build you could realistically self-host — and plotted as a direct comparison. The practical question isn't only "which model," it's "which build of it clears the bar on hardware you control." A quantized build that holds its score is often the smartest thing you can deploy: same capability, a fraction of the footprint.

Build labels carry the precision scheme (NVFP4, BF16, W4A16, FP8, and so on) and render consistently across chart, legend, and comparison table.

Reading the scatter

The flagship chart maps Nautilus Score (vertical) against model size (horizontal, log scale) — file size in GB or parameter count. Smaller-but-higher means more capability per gigabyte: the models that punch above their footprint sit in the upper-left.

A configurable GPU-fit zone lets you set a target card and quantity and shades the region that fits in that VRAM budget — weights only, so leave headroom for your engine and KV cache. Treat it as orientation, not a hard limit: unified-memory systems and the latest workstation GPUs keep moving where the real line sits, so the useful question isn't "which side of the line" but how much capability per gigabyte. Frontier models — which have no public size — live in a tinted reserve band on the right, ordered alphabetically or by score.

The scoring pipeline

Curated prompt sets

Every category has a versioned bank of questions, each one expertly isolated to a single capability and paired with its own focused, evaluated rubric. Items are held out from public release to resist contamination, and the full set is locked to a version before any model runs against it.

Same questions, validated sources

Every model answers the same held-out questions. Self-hostable models run on Nautilus GPU nodes; others run through a variety of selected API sources — and the same model is cross-checked across more than one source, so a score reflects the model, not where it ran. Where a model runs inside a harness, it's also tested against its API version where possible, so the harness's effect on the model is measured rather than assumed.

Independent, repeatable scoring

Each answer is graded by two independent judges. The Nautilus Governor (Analytical) is the primary scorer; the Nautilus Reflector (Reflective) is a second, separately-sourced judge that grades the same answer against the same rubric. The two share no lineage with each other or with any model under test, and they assess by different mechanisms — so no single judge's blind spots can decide a score. Each judge reads every answer more than once, including deliberately strict and deliberately lenient passes; the extremes are trimmed and the remainder reconciled to a consensus, so one anomalous read can't move a result. The method is defensible and repeatable: the same answer, rubric, and immutable system state produce the same score every time, and the bar never moves between runs.

Review & aggregation

Per-question scores roll up into each category score and the weighted Overall. Every run is reviewed and finalized, dated, and locked to its prompt-and-rubric version — so any score on the board can be reproduced.

Head to head

Any set of models can be compared directly, category by category, with the per-category leader marked. The comparison shows earned weighted points — the same currency the Nautilus Score is built from — so a head-to-head reflects not just who scored higher overall, but exactly where each model wins. A model can top the composite and still lose individual categories to a specialist; the comparison is where that shows, and it's where you match a model to the specific shape of your task.

The comparison table displays summed weighted points, not the 0–100 N-Score — the large totals are raw currency, not percentages.

Audited against the frontier

The methodology is audited adversarially against a top-tier frontier model — used to attack the design, surface gaps, and correct overreach. Claims that couldn't survive that scrutiny were cut, not softened. It's a standing process, not a one-time stamp.

A note on what we don't publish

Questions and rubrics are set per test, and a model's score within a test never changes — the same answer, the same rubric, the same immutable system state, the same result, every time. New tests and new versions will bring new questions; scores are only ever compared within a version, never across them.

We don't publish our questions or rubrics — and that isn't a gap in our openness, it's the point. A test you can read is a test you can train for; the moment a question set goes public, it stops measuring capability and starts measuring who studied the answer key. Publishing the items wouldn't make a score more honest — it would make it less true. So we keep the questions closed and the method open: how the scoring works, why it's built this way, the whole of this page. If you want to check us, check the method — and check the results against reality, the one audit a model can't study for in advance. This is a tool built to inform real decisions, not to win a public ranking.

The same applies to the judges: we name them — the Nautilus Governor (Analytical) and the Nautilus Reflector (Reflective) — and we explain how they work, how they're kept independent, and why two are better than one. We don't name the models behind them, for the same reason we don't publish the questions: a judge you can profile is a judge you can study toward. The method is open; the targets stay closed.

Want a specific model tested? Reach out through our parent site. If it's something we can run, and the result would help a community still finding its footing, we'll gladly add it.