Ollama Cloud Pricing: GPU-Time Billing for Hosted Models

Source 1 min read
ollamallmself-hostingcloud-inferencelocal-aipricingopen-models

Summary

Ollama has launched a tiered cloud offering (Free / Pro at $20/month or $200/year / Max) alongside its local-run model support. Usage is GPU-time-based rather than fixed tokens, meaning efficiency gains from better hardware directly benefit users over time. The free tier covers light usage with 1 concurrent cloud model; Pro adds 50x more cloud usage and 3 concurrent models.

Key Insight

  • Pricing model is GPU-time-based, not token-capped — as hardware improves, you get more from the same plan. This is a meaningful differentiator vs. OpenAI/Anthropic fixed-rate token billing.
  • Concurrency tiers matter for agent workflows: Free = 1, Pro = 3, Max = 10 simultaneous cloud models. Agentic pipelines needing parallel model calls require at least Pro.
  • Privacy guarantees: no logging, no training on prompt/response data. NVIDIA Cloud Providers (NCPs) host the models under zero-retention contracts.
  • Session limits reset every 5 hours; weekly limits reset every 7 days — relevant for planning continuous automation runs.
  • “Additional usage at competitive per-token rates, including cache-aware pricing” is listed as coming soon, which will enable pay-as-you-go overflow.
  • Cloud models run native weights (not quantized down), on Blackwell/Vera Rubin NVIDIA hardware with NVFP4 acceleration where available.
  • 40,000+ community integrations means drop-in compatibility with most LLM toolchains.