LLM model API pricing

Best Free LLM APIs (Gemini, Groq, Cohere) 2026

Compare mainstream model API prices by input tokens, cached input, output tokens, context window, and billing caveats. Prices are a snapshot for developer planning, not a substitute for the provider pricing page before production.

Pricing snapshot checked: 2026-05-23

Global LLM API price matrix

Default unit is price per 1M tokens. USD and CNY rows are kept in their original billing currency so regional providers can be compared without hiding exchange-rate risk.

PROVIDERMODELINPUT / 1MCACHED INPUTOUTPUT / 1MCONTEXTBILLING NOTEKEY CONSTRAINTSSOURCE
OpenAIOPENAI-GPT-55
GPT-5.5
$5.00
$0.50$30.00Standard pricing under 270K contextBatch can reduce token price, while data residency adds a surcharge.Premium frontier model; output-heavy workloads become expensive quickly.Go to Site
AnthropicANTHROPIC-CLAUDE-SONNET
Claude Sonnet 4.6
$3.00
$0.30 cache hits$15.00Long-context coding and agent workCache writes and cache hits are priced separately.Great for code and agents, but cache strategy matters for repeated context.Go to Site
Google GeminiGOOGLE-GEMINI-PRO
Gemini 2.5 Pro
$1.25 / $2.50
$0.125 / $0.25$10.00 / $15.00Tier changes above 200K prompt tokensFree tier exists, but paid rates depend on prompt length and mode.Long prompts double the input tier; budget RAG chunks carefully.Go to Site
xAIXAI-GROK
Grok 4.3
$1.25
Not listed$2.501M tokensSearch tools and media APIs are billed outside text tokens.Strong headline price, but verify tool charges for realtime search workloads.Go to Site
DeepSeekDEEPSEEK-V4-FLASH
DeepSeek V4 Flash
$0.14
$0.0028$0.281M context, 384K max outputOpenAI-compatible and Anthropic-compatible endpoints are both listed.Very low price; confirm promotion windows and concurrency limits.Go to Site
Alibaba QwenQWEN-PLUS
Qwen-Plus
¥0.8 to ¥4.8
Plan-dependent¥2 to ¥64Tiered up to 1M prompt lengthThinking mode and longer prompts move to higher output tiers.Low base price, but tier jumps are large above 128K tokens.Go to Site
Tencent HunyuanTENCENT-HUNYUAN
Hunyuan TurboS / T1
¥0.8 to ¥1.0
Not listed¥2.0 to ¥4.0Model-dependent TokenHub billingPublic docs point developers to model-specific TokenHub pricing.Regional billing and model aliases require a console check before launch.Go to Site
Xiaomi MiMoXIAOMI-MIMO
MiMo V2.5 Pro
¥7.35
¥1.47¥22.05Tier shown for prompts up to 256KAlso offers Token Plan packages; compare credit rules before coding-agent use.Attractive for MiMo-specific workflows, but pricing style differs from global APIs.Go to Site

Practical picks by workload

How to read LLM API pricing

Compare output price first

Chat, coding, and agent workloads often spend more on generated tokens than prompt tokens, especially with retries.

Cache only helps if prompts repeat

Prompt caching is valuable for long system prompts, repositories, and documents, but not for one-off short calls.

Watch tiered long-context pricing

Gemini, Qwen, MiMo, and similar providers may change price when prompt length crosses a threshold.

Include tools in the budget

Search, code execution, grounding, images, voice, and batch modes can have separate billing from text tokens.

Related categories

LLM pricing FAQ

Why does LLM API pricing separate input and output tokens?+

Input tokens are the prompt and context you send to the model. Output tokens are generated by the model and usually cost more because they consume inference time while the model produces text.

Which provider is cheapest for high-volume text generation?+

DeepSeek and several China-region models have very low headline token prices, while Grok 4.3 is also competitive in USD billing. The real answer depends on output length, cache hit rate, concurrency, and regional latency.

Should I choose the most expensive frontier model by default?+

No. Use strong frontier models for hard reasoning, coding, or ambiguous tasks, but route extraction, classification, rewriting, and simple chat to cheaper models when quality is sufficient.

How can I estimate monthly LLM API cost before launch?+

Estimate average input tokens, cached input tokens, output tokens, retries, tool calls, and daily active users. Then set per-user caps and alert thresholds before public traffic arrives.