Skip to content

Model Selection

Each provider auto-selects a sensible default model with automatic fallback to a lighter model on quota or availability errors. Most users should never override the model parameter — the defaults are tuned for quality and the fallback chain handles failures.

Defaults & Fallbacks

ProviderDefaultFallbackTrigger
Geminigemini-3.1-pro-previewgemini-3-flash-previewRESOURCE_EXHAUSTED quota error or "exhausted your capacity" pattern (ADR-044)
Codexgpt-5.5gpt-5.5-miniQuota errors (rate_limit_exceeded, 429, insufficient_quota) (ADR-028, bumped per ADR-067)
Ollamaqwen2.5-coder:7bqwen2.5-coder:1.5bModel-not-found error (e.g., 7b not pulled but 1.5b is)

The fallback fires automatically inside the executor — your client sees a successful response with usage.fellBack: true in the structured output, and a [Gemini stats: ... fell back] annotation in the formatted text.

Choosing a Provider

Different providers excel at different things. Pick by what you're doing, not by which is "best":

TaskSuggested providerWhy
Whole-codebase reviewGemini1M+ token context fits things others can't
Targeted code reasoning, refactor critiqueCodexGPT-5.5's strength is dense code reasoning at moderate context size
Private / air-gapped analysisOllamaRuns locally, nothing leaves your machine
"What do they all think?" comparisonMulti-LLM (multi-llm tool or /compare skill)Parallel dispatch, see all responses side-by-side
Code review with verified findings/multi-review skillGemini + Codex in parallel, then verifies each finding against source

Overriding the Model

Pass model explicitly when you have a reason to:

text
Use ask-llm with provider gemini and model gemini-3-flash-preview to quickly check this CSS file

Or programmatically:

json
{ "name": "ask-llm", "arguments": { "provider": "gemini", "model": "gemini-3-flash-preview", "prompt": "..." } }

For Codex, common overrides:

text
Use ask-codex with model gpt-5.5-mini to summarize this commit

For Ollama, you can request any model you've pulled:

bash
ollama pull deepseek-coder:6.7b
text
Use ask-ollama with model deepseek-coder:6.7b to review this implementation

Token Limits & Cost

ProviderContext windowCost model
Gemini Pro~1M tokens (~250k LOC)Free tier via OAuth, paid tiers via API key
Gemini Flash~1M tokensCheaper than Pro; fallback target for quota relief
Codex GPT-5.5Per OpenAI's published context windowPer OpenAI billing
Codex GPT-5.5-miniSmaller contextCheaper; fallback target
OllamaPer model (e.g., 32k for qwen2.5-coder)Free — runs locally

Track What You're Spending

Token usage is exposed live via:

  • Per-call: result.structuredContent.usage on every ask-* tool response (provider, model, inputTokens, outputTokens, cachedTokens, thinkingTokens, durationMs, fellBack) — see ADR-054
  • Per-session aggregate: call the get-usage-stats MCP tool, or read the usage://current-session MCP Resource for a JSON snapshot
  • In the REPL: type /usage for a markdown-formatted breakdown

This is in-memory only — no persistence to disk, resets when the MCP server restarts.

Recommendations by Use Case

  • General code review → defaults are correct; let the fallback chain handle quota
  • Whole-codebase analysisask-gemini (Pro) — Flash auto-kicks in if Pro is exhausted
  • Quick fixes, fast iteration → request Flash or gpt-5.5-mini explicitly to skip the Pro→fallback round-trip
  • Privacy-sensitive codeask-ollama, never leaves your machine
  • Multi-perspective debatemulti-llm or /brainstorm skill — Claude weighs verified vs inferred

Released under the MIT License.