Compare output price first
Chat, coding, and agent workloads often spend more on generated tokens than prompt tokens, especially with retries.
Compare mainstream model API prices by input tokens, cached input, output tokens, context window, and billing caveats. Prices are a snapshot for developer planning, not a substitute for the provider pricing page before production.
Default unit is price per 1M tokens. USD and CNY rows are kept in their original billing currency so regional providers can be compared without hiding exchange-rate risk.
| PROVIDER | MODEL | INPUT / 1M | CACHED INPUT | OUTPUT / 1M | CONTEXT | BILLING NOTE | KEY CONSTRAINTS | SOURCE |
|---|---|---|---|---|---|---|---|---|
OpenAIOPENAI-GPT-55 | GPT-5.5 | $5.00 | $0.50 | $30.00 | Standard pricing under 270K context | Batch can reduce token price, while data residency adds a surcharge. | Premium frontier model; output-heavy workloads become expensive quickly. | Go to Site ↗ |
AnthropicANTHROPIC-CLAUDE-SONNET | Claude Sonnet 4.6 | $3.00 | $0.30 cache hits | $15.00 | Long-context coding and agent work | Cache writes and cache hits are priced separately. | Great for code and agents, but cache strategy matters for repeated context. | Go to Site ↗ |
Google GeminiGOOGLE-GEMINI-PRO | Gemini 2.5 Pro | $1.25 / $2.50 | $0.125 / $0.25 | $10.00 / $15.00 | Tier changes above 200K prompt tokens | Free tier exists, but paid rates depend on prompt length and mode. | Long prompts double the input tier; budget RAG chunks carefully. | Go to Site ↗ |
xAIXAI-GROK | Grok 4.3 | $1.25 | Not listed | $2.50 | 1M tokens | Search tools and media APIs are billed outside text tokens. | Strong headline price, but verify tool charges for realtime search workloads. | Go to Site ↗ |
DeepSeekDEEPSEEK-V4-FLASH | DeepSeek V4 Flash | $0.14 | $0.0028 | $0.28 | 1M context, 384K max output | OpenAI-compatible and Anthropic-compatible endpoints are both listed. | Very low price; confirm promotion windows and concurrency limits. | Go to Site ↗ |
Alibaba QwenQWEN-PLUS | Qwen-Plus | ¥0.8 to ¥4.8 | Plan-dependent | ¥2 to ¥64 | Tiered up to 1M prompt length | Thinking mode and longer prompts move to higher output tiers. | Low base price, but tier jumps are large above 128K tokens. | Go to Site ↗ |
Tencent HunyuanTENCENT-HUNYUAN | Hunyuan TurboS / T1 | ¥0.8 to ¥1.0 | Not listed | ¥2.0 to ¥4.0 | Model-dependent TokenHub billing | Public docs point developers to model-specific TokenHub pricing. | Regional billing and model aliases require a console check before launch. | Go to Site ↗ |
Xiaomi MiMoXIAOMI-MIMO | MiMo V2.5 Pro | ¥7.35 | ¥1.47 | ¥22.05 | Tier shown for prompts up to 256K | Also offers Token Plan packages; compare credit rules before coding-agent use. | Attractive for MiMo-specific workflows, but pricing style differs from global APIs. | Go to Site ↗ |
Use for extraction, routing, high-volume chat, and OpenAI-compatible fallback paths where raw cost matters.
Good for RAG, document analysis, and multimodal prototypes, but watch the higher tier above 200K prompt tokens.
A strong default for coding assistants and agent loops when cache reads are intentionally reused.
Use when latency, Chinese-language behavior, RMB billing, or domestic cloud integration matters.
Chat, coding, and agent workloads often spend more on generated tokens than prompt tokens, especially with retries.
Prompt caching is valuable for long system prompts, repositories, and documents, but not for one-off short calls.
Gemini, Qwen, MiMo, and similar providers may change price when prompt length crosses a threshold.
Search, code execution, grounding, images, voice, and batch modes can have separate billing from text tokens.
Add routing, observability, caching, key isolation, and fallback controls before production traffic.
Store embeddings and retrieval context for RAG, semantic search, and knowledge-base applications.
Run model orchestration, webhook handlers, and background AI jobs without managing servers.
Input tokens are the prompt and context you send to the model. Output tokens are generated by the model and usually cost more because they consume inference time while the model produces text.
DeepSeek and several China-region models have very low headline token prices, while Grok 4.3 is also competitive in USD billing. The real answer depends on output length, cache hit rate, concurrency, and regional latency.
No. Use strong frontier models for hard reasoning, coding, or ambiguous tasks, but route extraction, classification, rewriting, and simple chat to cheaper models when quality is sufficient.
Estimate average input tokens, cached input tokens, output tokens, retries, tool calls, and daily active users. Then set per-user caps and alert thresholds before public traffic arrives.