Nautilus Deep
64 models · updated June 2026
Test categories

Eleven dimensions of capability

Each category isolates one capability and scores every model against held-out questions with per-question rubrics — points earned only for answers good enough to rely on. Open a category to explore the full field.