LLM model API pricing

Best Free LLM APIs (Gemini, Groq, Cohere) 2026

Compare mainstream model API prices by input tokens, cached input, output tokens, context window, and billing caveats. Prices are a snapshot for developer planning, not a substitute for the provider pricing page before production.

Pricing snapshot checked: 2026-05-23

Global LLM API price matrix

Default unit is price per 1M tokens. USD and CNY rows are kept in their original billing currency so regional providers can be compared without hiding exchange-rate risk.

PROVIDER	MODEL	INPUT / 1M	CACHED INPUT	OUTPUT / 1M	CONTEXT	BILLING NOTE	KEY CONSTRAINTS	SOURCE
OpenAIOPENAI-GPT-55	GPT-5.5	$5.00	$0.50	$30.00	Standard pricing under 270K context	Batch can reduce token price, while data residency adds a surcharge.	Premium frontier model; output-heavy workloads become expensive quickly.	Go to Site ↗
AnthropicANTHROPIC-CLAUDE-SONNET	Claude Sonnet 4.6	$3.00	$0.30 cache hits	$15.00	Long-context coding and agent work	Cache writes and cache hits are priced separately.	Great for code and agents, but cache strategy matters for repeated context.	Go to Site ↗
Google GeminiGOOGLE-GEMINI-PRO	Gemini 2.5 Pro	$1.25 / $2.50	$0.125 / $0.25	$10.00 / $15.00	Tier changes above 200K prompt tokens	Free tier exists, but paid rates depend on prompt length and mode.	Long prompts double the input tier; budget RAG chunks carefully.	Go to Site ↗
xAIXAI-GROK	Grok 4.3	$1.25	Not listed	$2.50	1M tokens	Search tools and media APIs are billed outside text tokens.	Strong headline price, but verify tool charges for realtime search workloads.	Go to Site ↗
DeepSeekDEEPSEEK-V4-FLASH	DeepSeek V4 Flash	$0.14	$0.0028	$0.28	1M context, 384K max output	OpenAI-compatible and Anthropic-compatible endpoints are both listed.	Very low price; confirm promotion windows and concurrency limits.	Go to Site ↗
Alibaba QwenQWEN-PLUS	Qwen-Plus	¥0.8 to ¥4.8	Plan-dependent	¥2 to ¥64	Tiered up to 1M prompt length	Thinking mode and longer prompts move to higher output tiers.	Low base price, but tier jumps are large above 128K tokens.	Go to Site ↗
Tencent HunyuanTENCENT-HUNYUAN	Hunyuan TurboS / T1	¥0.8 to ¥1.0	Not listed	¥2.0 to ¥4.0	Model-dependent TokenHub billing	Public docs point developers to model-specific TokenHub pricing.	Regional billing and model aliases require a console check before launch.	Go to Site ↗
Xiaomi MiMoXIAOMI-MIMO	MiMo V2.5 Pro	¥7.35	¥1.47	¥22.05	Tier shown for prompts up to 256K	Also offers Token Plan packages; compare credit rules before coding-agent use.	Attractive for MiMo-specific workflows, but pricing style differs from global APIs.	Go to Site ↗

Practical picks by workload

Lowest transparent token cost

DeepSeek V4 Flash

Use for extraction, routing, high-volume chat, and OpenAI-compatible fallback paths where raw cost matters.

Balanced long-context reasoning

Gemini 2.5 Pro

Good for RAG, document analysis, and multimodal prototypes, but watch the higher tier above 200K prompt tokens.

Coding and agent workflows

Claude Sonnet 4.6

A strong default for coding assistants and agent loops when cache reads are intentionally reused.

China-region value stack

Qwen / Hunyuan

Use when latency, Chinese-language behavior, RMB billing, or domestic cloud integration matters.

How to read LLM API pricing

Compare output price first

Chat, coding, and agent workloads often spend more on generated tokens than prompt tokens, especially with retries.

Cache only helps if prompts repeat

Prompt caching is valuable for long system prompts, repositories, and documents, but not for one-off short calls.

Watch tiered long-context pricing

Gemini, Qwen, MiMo, and similar providers may change price when prompt length crosses a threshold.

Include tools in the budget

Search, code execution, grounding, images, voice, and batch modes can have separate billing from text tokens.

Related categories

AI Gateway

Add routing, observability, caching, key isolation, and fallback controls before production traffic.

Vector Databases

Store embeddings and retrieval context for RAG, semantic search, and knowledge-base applications.

Serverless Functions

Run model orchestration, webhook handlers, and background AI jobs without managing servers.

LLM pricing FAQ

Why does LLM API pricing separate input and output tokens?+

Input tokens are the prompt and context you send to the model. Output tokens are generated by the model and usually cost more because they consume inference time while the model produces text.

Which provider is cheapest for high-volume text generation?+

DeepSeek and several China-region models have very low headline token prices, while Grok 4.3 is also competitive in USD billing. The real answer depends on output length, cache hit rate, concurrency, and regional latency.

Should I choose the most expensive frontier model by default?+

No. Use strong frontier models for hard reasoning, coding, or ambiguous tasks, but route extraction, classification, rewriting, and simple chat to cheaper models when quality is sufficient.

How can I estimate monthly LLM API cost before launch?+

Estimate average input tokens, cached input tokens, output tokens, retries, tool calls, and daily active users. Then set per-user caps and alert thresholds before public traffic arrives.