Most conversations about self-hosting large language models start with ideology rather than arithmetic. Developers who prefer open-source will find reasons to self-host; teams already in AWS or Azure will find reasons to stay on API. Neither instinct is wrong, but neither is a cost analysis. What actually matters is your token volume, your latency tolerance, your data residency obligations, and whether your team has the operational bandwidth to run GPU infrastructure. This post walks through the real math – not cherry-picked scenarios – so you can figure out which side of the line you sit on. The numbers shift more than most vendors want you to know, and the crossover point depends heavily on factors specific to Canadian buyers: data residency rules under PIPEDA, the availability of Canadian cloud regions, and CAD/USD exchange exposure on per-token pricing.
The Baseline: What You’re Actually Comparing
SaaS API costs are quoted in USD per million tokens. Self-hosting costs are capital or cloud-instance costs amortized over time, plus electricity, plus your ops team’s hours. These are fundamentally different cost structures, and the mistake most teams make is comparing them at a single point in time rather than over a projected usage curve.
A useful framing: SaaS APIs are variable costs that scale linearly with usage. Self-hosting is a mostly-fixed cost with a much flatter marginal cost per token once you’re running. The crossover – where self-hosting becomes cheaper – happens when your token volume is high enough that the variable SaaS cost exceeds your fixed infrastructure amortization plus operating overhead.
Let’s make this concrete. As of mid-2025, mid-tier SaaS API pricing for capable models (think GPT-4o class, Claude 3.5 Sonnet class) runs roughly $3-$15 USD per million input tokens and $12-$60 USD per million output tokens. Budget-tier models (GPT-4o-mini class) run closer to $0.15-$0.60 USD per million input tokens. These numbers move, but the ranges are representative.
On the self-hosting side, a single NVIDIA H100 SXM instance on a major cloud provider runs approximately $3.00-$4.50 USD per GPU-hour. A bare-metal H100 leased from a colocation facility in Canada runs closer to $2.00-$2.80 USD per GPU-hour when amortized over a 1-year contract. Owned hardware – an H100 PCIe card – costs roughly $25,000-$35,000 CAD new in 2025, plus server chassis, networking, and power.
Running the Actual Numbers
Let’s construct a realistic scenario. Suppose your application generates 500 million tokens per month – a moderate internal tool, not a high-traffic consumer product. Let’s use a 3:1 input-to-output ratio, which is fairly typical for RAG-based applications.
That gives you approximately 375 million input tokens and 125 million output tokens per month.
At a mid-tier SaaS rate of $5 USD/M input and $20 USD/M output:
Input cost: 375M × $5.00 = $1,875 USD/month
Output cost: 125M × $20.00 = $2,500 USD/month
Total SaaS: $4,375 USD/month
At 1.37 CAD/USD: ~$5,994 CAD/month
Now compare that to a single leased H100 in a Canadian colocation facility. One H100 can serve a 70B parameter model (like Llama 3.1 70B or Mistral Large 2) at roughly 2,000-4,000 tokens per second in batch throughput, depending on quantization. For interactive use with typical concurrency, assume you’re averaging 500 useful tokens per second. That’s:
500 tokens/sec × 3600 sec/hr × 720 hr/month = 1,296,000,000 tokens/month theoretical capacity
At 500 million tokens actually consumed, you’re at roughly 38% utilization – reasonable. The H100 lease at $2.50 USD/hr:
GPU cost: 720 hrs × $2.50 = $1,800 USD/month
Ops time: ~8 hrs/month × $75/hr CAD = $600 CAD/month (conservative)
Inference server overhead: ~$50 USD/month (storage, networking)
Total self-host: ~$2,516 USD + $600 CAD ≈ ~$4,044 CAD/month
At 500 million tokens per month, self-hosting wins by roughly $2,000 CAD per month. But cut that volume in half – to 250 million tokens – and the SaaS option drops to about $3,000 CAD/month while your self-hosted costs barely change. The GPU doesn’t get cheaper when you use it less.
What we found surprising when first running these numbers: the crossover typically sits between 150-300 million tokens per month for mid-tier model quality, not at some exotic petascale threshold. That’s within reach for internal enterprise tools.
The Hidden Costs Both Sides Don’t Advertise
SaaS API Hidden Costs
- Context window inflation. Many applications stuff large system prompts and retrieved documents into every request. A 4,000-token system prompt on every call adds up fast. At 100,000 requests/day, that’s 400 million extra tokens monthly from prompt overhead alone.
- CAD/USD exposure. All major SaaS LLM APIs price in USD. Canadian buyers carry currency risk. Over 2023-2025, the CAD/USD rate moved from roughly 1.25 to 1.42. That’s a 13% effective price increase with no change in API rates.
- Rate limit engineering. Building retry logic, queue management, and fallback handling for API rate limits takes real engineering hours. This is rarely zero.
- Data egress from Canada. If your data leaves Canadian borders for processing, you carry PIPEDA compliance obligations and potentially sector-specific requirements (PHIPA for health data in Ontario, etc.). Some organizations require legal review for every new API integration. That review costs time and money.
Self-Hosting Hidden Costs
- Model maintenance. Models need updates. Fine-tunes need re-running when the base model updates. Someone has to own this work.
- Cold-start latency and availability. A single H100 has no redundancy. If it goes down at 2am, someone’s paged. SaaS APIs have SLAs; your on-prem box does not unless you build it.
- Quantization quality loss. To fit a 70B model on one or two H100s in production, you’ll likely run 4-bit or 8-bit quantization. Our reading suggests this costs roughly 3-8% on standard benchmarks versus full precision – usually acceptable, but worth testing against your specific task.
- Inference server engineering. Running vLLM, llama.cpp, or TGI in production means understanding batching strategies, memory management, and API compatibility layers. This isn’t beginner work.
Data Residency: The Canadian Angle That Changes the Equation
For many Canadian organizations – especially in healthcare, finance, and government – data residency isn’t a preference, it’s a requirement. Under PIPEDA and increasingly under provincial equivalents, sending personal information to US-based processors requires explicit accountability measures, cross-border transfer agreements, and in some cases is simply prohibited by sector-specific rules.
Major SaaS API providers do offer Canadian data residency in some cases – Azure OpenAI has Canada Central and Canada East regions, for example – but not all models are available in all regions, and regional availability lags behind US options by months to years. If you need a specific model in a Canadian region and it isn’t there yet, you either wait, use a different model, or accept cross-border transfer with the associated compliance work.
Self-hosting in a Canadian colocation facility or Canadian cloud region (there are several options in Calgary, Toronto, and Montreal) gives you deterministic data residency with no ambiguity. The data doesn’t leave. For organizations that would otherwise spend 20-40 hours of legal and compliance time per API integration, that cost avoidance alone can shift the math substantially.
A rough estimate: if your legal team bills at $300/hr CAD and a new AI API integration requires 15 hours of review, that’s $4,500 CAD per integration in legal cost. If you’re integrating multiple SaaS APIs over a year, that compounds. A single self-hosted deployment reviewed once – covering all internal use – is a different compliance posture entirely.
Practical Setup: What Self-Hosting Actually Looks Like in 2025
Assuming you’ve decided the math works, here’s what a minimal production-grade self-hosted setup looks like. This is not a beginner tutorial, but it orients the scope of work.
The most common inference stack for Canadian teams we’re aware of is vLLM on a Linux host (Ubuntu 22.04 LTS), serving a quantized model via an OpenAI-compatible API endpoint. This lets you swap self-hosted models in behind the same client code as your SaaS API, which simplifies the transition.
# Install vLLM (requires Python 3.10+, CUDA 12.1+)
pip install vllm==0.5.4
# Serve a 4-bit quantized Llama 3.1 70B model
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--port 8000
That gives you an OpenAI-compatible endpoint at http://localhost:8000/v1. Your existing code that calls openai.chat.completions.create() needs only a base URL change and a dummy API key.
For monitoring, Prometheus + Grafana with vLLM’s built-in metrics endpoint (/metrics) gives you token throughput, queue depth, GPU utilization, and latency percentiles. You should set alerts on GPU memory saturation and request queue depth – those are your early warning signs before the service degrades.
For availability, the minimum viable setup adds a second GPU node behind a simple load balancer (nginx or HAProxy work fine). That gives you rolling restart capability without full downtime. True HA with automatic failover is more complex and adds cost – factor that in if your use case requires it.
Decision Framework: Which Side Are You On?
Based on the arithmetic above, here’s a practical decision tree. It’s not definitive – your specific numbers will vary – but it covers the primary factors.
- Do you have hard data residency requirements that no SaaS provider can currently satisfy in Canada for your required model? If yes, self-hosting is likely necessary regardless of cost.
- Are you consuming more than 200 million tokens per month today, or confidently projecting that within 6 months? If yes, run the actual numbers for your model tier. Self-hosting likely wins above 300 million tokens/month for mid-tier quality.
- Do you have at least one engineer who has run GPU workloads in production? If no, add $2,000-$5,000 CAD/month equivalent in ramp-up cost and risk. SaaS is cheaper than an outage caused by unfamiliar infrastructure.
- Is your usage highly variable – spikes 10x above baseline? SaaS APIs absorb spikes cheaply. Self-hosted infrastructure sized for your spike is wasteful the rest of the time.
- Is your primary use case evaluation, prototyping, or low volume production? SaaS APIs win clearly below 50 million tokens/month for most model tiers. The operational overhead of self-hosting isn’t justified.
The honest answer for most small Canadian teams in 2025: start on SaaS, instrument your token usage carefully from day one, and revisit the calculation at the 6-month mark with real data. The crossover point isn’t a mystery – it’s just arithmetic you need real numbers to do properly.
From our experience, the teams that regret self-hosting earliest are those that made the decision based on a single month’s usage spike rather than a stable baseline – and those that underestimated how much an ops-capable engineer’s time actually costs when it’s redirected away from product work.
– Auburn AI editorial, Calgary AB
Related Auburn AI Products
Building content or automations around AI? Auburn AI has production-tested kits:
