Skip to content

// Cost · 10 min read

WHAT A PRIVATE LLM STACK
ACTUALLY COSTS IN 2026.

There's a wide gulf between vendor pricing decks and what a private LLM stack really costs to stand up. We've shipped enough of these to write the unsentimental version. All numbers in USD; multiply by 1.35 for CAD or by 0.92 for EUR.

By the Marapone team · Updated 2026

The four cost buckets nobody itemizes

Most "AI cost" estimates collapse into a single line. The honest decomposition has four:

  • Hardware — the box that runs the model.
  • Build — engineering time to ingest your data, fit the workflows, ship the UI.
  • Ongoing ops — backups, model updates, occasional troubleshooting.
  • Optional services — retainer for upgrades, training, new modules.

Each one has a real number in 2026. Here they are.

1. Hardware: $0 to $25k, depending on scale

Workstation deployment ($0 - $5k). A MacBook Pro M3 Max with 64GB unified memory runs a quantized 70B model adequately for one user. If you already own the laptop, the marginal cost is $0. If you don't, $4-5k.

On-prem server ($15-25k). A single tower or 2U server with two RTX 6000 Ada cards (96GB total VRAM), 256GB system RAM, 8TB NVMe storage. This runs the whole stack for a 30-50 person office with comfortable headroom. About a one-day install.

Cloud option ($1-3/hour active GPU time). If you'd rather not buy hardware, an A100/H100 cloud instance runs ~$2/hour. For an office of 30, average ~6 hours/day of active inference, that's about $360/month. Cheaper than buying for the first 18-24 months; more expensive over five years.

A common surprise:

Open-weight models in late 2025 are good enough that you don't need an H100 for most construction workflows. A used dual-RTX-3090 rig for $3-4k handles RFI triage, daily logs, and most retrieval-heavy work just fine.

2. Build: $9k to $25k for a focused scope

This is the line vendors compete on. The Marapone three-module construction bundle (RFI, daily logs, change-order risk) is $9,500 one-time. A bespoke build with custom integrations into Procore, Bluebeam, and Outlook lands around $15-20k. A multi-module enterprise scope can hit $25k.

What makes builds expensive isn't model selection. It's:

  • OCR pipelines for messy drawing scans.
  • Custom ingest from PM tools that don't have clean APIs.
  • Per-trade rule fitting against your specific risk register.
  • UI customization for your branding or existing portal.

3. Ongoing ops: $50-$300/month

Most of this is electricity and minor cloud spend. Concrete numbers:

  • On-prem server power: ~$30-60/month at typical inference load.
  • Backups (offsite encrypted): $20-100/month depending on data volume.
  • Domain + TLS certs (if exposed externally): negligible.
  • Occasional minor upgrades when an open-weight model upgrade ships: zero direct cost.

If you skip backups, ops cost is essentially the electricity. Don't skip backups.

4. Optional retainer: $500-$2,000/month

This buys you us being on call: model swaps when better open-weight models drop, new module add-ons, occasional incident support. Not required — every Marapone build ships with self-serve runbooks — but it exists for shops that prefer a phone number.

Five-year worked example

A 30-person GC adopts the Marapone three-module bundle on an on-prem server. Indicative five-year stack:

  • Hardware: $20,000 once.
  • Build: $9,500 once.
  • Ops: ~$2,400/year × 5 = $12,000.
  • Optional retainer (assume yes, $1,000/mo): $60,000.

Five-year total: ~$101,500 with full retainer support, or ~$41,500 without.

Comparison: 30 users at $50/seat/month for the same five years = $90,000 in pure SaaS, with no asset on your balance sheet at the end and continued payments thereafter.

The lines that aren't on this page

Three costs people sometimes worry about that turn out small:

Model licensing. Llama 3.1 and Qwen 2.5 are free for commercial use under their open-weight licences. Verify your scale fits within Llama's enterprise threshold; for any reasonable construction shop, it does.

Token costs. Zero. The model runs on your hardware. There's no per-token bill from anyone.

Data egress. Zero. Documents never leave your network.

When the numbers don't work

Honest cases where a private stack is the wrong answer:

  • You're under 5 users — SaaS pencils out and the build cost amortizes too slowly.
  • You only need general writing/Q&A, not workflow automation — ChatGPT Enterprise wins.
  • You don't have anyone who can babysit a server, even a little — stick with SaaS.

Outside those, the math usually favours owning. The exact crossover depends on your headcount and SaaS pricing.

WANT YOUR REAL NUMBERS?
WE'LL QUOTE THE WHOLE STACK.

Tell us your headcount, sources, and target workflows. We'll write back a line-by-line cost estimate.

Get a Quote →