Ollama Cloud Pricing: GPU-Time Billing for Hosted Models
Source
1 min read
Summary
Ollama has launched a tiered cloud offering (Free / Pro at $20/month or $200/year / Max) alongside its local-run model support. Usage is GPU-time-based rather than fixed tokens, meaning efficiency gains from better hardware directly benefit users over time. The free tier covers light usage with 1 concurrent cloud model; Pro adds 50x more cloud usage and 3 concurrent models.
Key Insight
- Pricing model is GPU-time-based, not token-capped — as hardware improves, you get more from the same plan. This is a meaningful differentiator vs. OpenAI/Anthropic fixed-rate token billing.
- Concurrency tiers matter for agent workflows: Free = 1, Pro = 3, Max = 10 simultaneous cloud models. Agentic pipelines needing parallel model calls require at least Pro.
- Privacy guarantees: no logging, no training on prompt/response data. NVIDIA Cloud Providers (NCPs) host the models under zero-retention contracts.
- Session limits reset every 5 hours; weekly limits reset every 7 days — relevant for planning continuous automation runs.
- “Additional usage at competitive per-token rates, including cache-aware pricing” is listed as coming soon, which will enable pay-as-you-go overflow.
- Cloud models run native weights (not quantized down), on Blackwell/Vera Rubin NVIDIA hardware with NVFP4 acceleration where available.
- 40,000+ community integrations means drop-in compatibility with most LLM toolchains.