Skip to content

Model Selection

Hosted providers (Gemini, Codex) auto-select a sensible default model with automatic fallback to a lighter model on quota errors. Ollama runs locally and uses exactly the model you pull — it never substitutes another. Most users should never override the model parameter — the defaults are tuned for quality.

Defaults & Fallbacks

ProviderDefaultFallbackTrigger
Geminigemini-3.1-pro-previewgemini-3.5-flashRESOURCE_EXHAUSTED quota error or "exhausted your capacity" pattern (ADR-044)
Codexgpt-5.5gpt-5.5-miniQuota errors (rate_limit_exceeded, 429, insufficient_quota) (ADR-028, bumped per ADR-067)
Ollamaqwen3.6:27b— (none)Local — no fallback; a missing model returns a clear ollama pull error

For Gemini and Codex, the fallback fires automatically inside the executor — your client sees a successful response with usage.fellBack: true in the structured output, and a [Gemini stats: ... fell back] annotation in the formatted text. Ollama never falls back, so its fellBack is always false.

Choosing a Provider

Different providers excel at different things. Pick by what you're doing, not by which is "best":

TaskSuggested providerWhy
Targeted code reasoning, refactor critiqueCodexGPT-5.5's strength is dense code reasoning at moderate context size
Private / air-gapped analysisOllamaRuns locally, nothing leaves your machine
Subscription-backed second opinion, larger contextAntigravityagy via your Google AI Pro/Ultra plan — the Gemini CLI successor
Whole-codebase review (enterprise seats)Gemini1M+ token context fits what others can't — enterprise-gated from 2026-06-18
"What do they all think?" comparisonMulti-LLM (multi-llm tool or /compare skill)Parallel dispatch, see all responses side-by-side
Code review with verified findings/multi-review skillAntigravity + Codex in parallel, then verifies each finding against source

Overriding the Model

Pass model explicitly when you have a reason to:

text
Use ask-llm with provider gemini and model gemini-3.5-flash to quickly check this CSS file

Or programmatically:

json
{ "name": "ask-llm", "arguments": { "provider": "gemini", "model": "gemini-3.5-flash", "prompt": "..." } }

For Codex, common overrides:

text
Use ask-codex with model gpt-5.5-mini to summarize this commit

For Ollama, you can request any model you've pulled:

bash
ollama pull deepseek-coder:6.7b
text
Use ask-ollama with model deepseek-coder:6.7b to review this implementation

Token Limits & Cost

ProviderContext windowCost model
Gemini Pro~1M tokens (~250k LOC)Gemini Code Assist Standard/Enterprise seat (from 2026-06-18)
Gemini Flash~1M tokensCheaper than Pro; fallback target for quota relief
Codex GPT-5.5Per OpenAI's published context windowPer OpenAI billing
Codex GPT-5.5-miniSmaller contextCheaper; fallback target
OllamaPer model (e.g., 256k for qwen3.6)Free — runs locally

Track What You're Spending

Token usage is exposed live via:

  • Per-call: result.structuredContent.usage on every ask-* tool response (provider, model, inputTokens, outputTokens, cachedTokens, thinkingTokens, durationMs, fellBack) — see ADR-054
  • Per-session aggregate: call the get-usage-stats MCP tool, or read the usage://current-session MCP Resource for a JSON snapshot
  • In the REPL: type /usage for a markdown-formatted breakdown

This is in-memory only — no persistence to disk, resets when the MCP server restarts.

Recommendations by Use Case

  • General code review → defaults are correct; let the fallback chain handle quota
  • Whole-codebase analysisask-gemini (Pro) if you have an enterprise seat, otherwise ask-antigravity for large-context reads without per-token billing
  • Quick fixes, fast iteration → request Flash or gpt-5.5-mini explicitly to skip the Pro→fallback round-trip
  • Privacy-sensitive codeask-ollama, never leaves your machine
  • Multi-perspective debatemulti-llm or /brainstorm skill — Claude weighs verified vs inferred

Released under the MIT License.