Home / Daily Bits / The LLM Deployment Playbook: Cloud, Local, or Both?

The LLM Deployment Playbook: Cloud, Local, or Both?

/

LLM running on laptop screen

You’ve built something cool with an LLM. Maybe it’s summarizing issue logs, maybe it’s crunching legal documents, maybe it’s writing code on command. 

Either way, it works (60% of the time, every time). And now you’re wondering if you should keep using a cloud API or switch to running your own AI agent model locally.

That’s the moment a lot of teams hit. And the answer isn’t as obvious as it used to be.

Cloud LLMs: Less control, more convenience

Sending prompts to a cloud API feels easy. Because it is. You write a string, you get a string back. The inference fairy does the hard part somewhere in Northern Virginia.

But there are tradeoffs.

First, there’s latency. It’s better than it used to be. Still not local. Still not real-time. And if your users are a few continents away from the nearest datacenter, they’ll notice.

Second, there’s cost. Token-based pricing makes sense right up until it doesn’t. When every additional sentence is a line item, you start getting cagey with your prompts. Developers end up trimming context and cutting corners just to stay under budget.

Then there’s model drift. Providers update weights silently. Prompt that worked great last week? Now it stumbles. Sometimes it’s faster, sometimes smarter, occasionally just weirder. 

You’re also trusting a black box. You can’t inspect it. You can’t fine-tune it. And you will never know if your data is actually secure as providers claim.

On-prem setup: it’s now easier than ever to create mess

Running your own model once meant wresting with research code, complex setups, and endless dependencies. That’s no longer the case.

Today, you can spin up a quantized LLaMA 3 model in minutes on your computer, or even something like Raspberry Pi (if you’re patient enough). Projects like vLLM, TGI, and llama.cpp deliver inference that’s fast, efficient, and surprisingly user-friendly.

You get speed, control, transparency and no data leaks. Plus batch processing throughput that makes finance teams smile.

But this freedom is not without cost.

You must maintain it. Debug quirks. Explain to stakeholders why your 13-billion-parameter model just started hallucinating more than someone who confused their morning coffee with ayahuasca. 

And the GPU bill? It’s still real. Even on bare metal. Just smaller.

The hardware reality check

Here’s what nobody tells you about local inference: the hardware requirements are both more and less demanding than expected.

A decent RTX 3060 with 12GB VRAM can run 7B models comfortably, and even squeeze in some 13B models with 4-bit quantization. For around $300, you can grab a used Tesla P40 with 24GB of VRAM, though you’ll need to deal with server-grade cooling and power requirements that could drain Lake Superior.

The sweet spot seems to be around 13B parameters. Smaller models feel too limited for serious work. Larger ones demand hardware that makes your accountant nervous. That 30B model everyone raves about? It needs more VRAM than most people have lying around.

RAM matters too, but not the way you’d think. You need enough system RAM to hold the model weights, but once it’s loaded onto the GPU, the bottleneck shifts to memory bandwidth. DDR5 helps, but most setups work fine with what they have.

When local makes sense

Privacy-sensitive workloads are the obvious candidate. Legal documents, medical records, proprietary code: anything you wouldn’t want sitting in someone else’s logs. Even if cloud providers promise data isolation, local means truly local.

High-volume, predictable workloads favor local deployment. If you’re processing thousands of similar requests daily, the economics flip quickly. Cloud APIs charge per token. Your local setup charges per kilowatt-hour.

Customization needs push toward local too. Fine-tuning a model for your specific domain is much easier when you control the entire stack. That specialized medical terminology or industry jargon? Much easier to train into a model you own.

Development and experimentation benefit from local control. No API limits, no rate throttling, no wondering if your weird edge case will work tomorrow after a provider update.

When cloud still wins

Getting something working quickly matters more than long-term costs or control. The fastest path from idea to demo goes through an API key.

Photo by Aerps.com on Unsplash

Variable or unpredictable workloads make cloud pricing more attractive. If your usage spikes unpredictably or you have long quiet periods, paying only for what you use beats maintaining idle hardware.

Cutting-edge capabilities tend to appear in cloud services first. Want the latest multimodal model or specialized reasoning capabilities? You’ll probably find them in a cloud API months before equivalent open source models appear.

Team coordination becomes easier with cloud services. No need to worry about hardware provisioning, model updates, or infrastructure maintenance. Your developers can focus on application logic instead of becoming accidental ML engineers.

The hybrid middle ground

Smart teams increasingly split the difference. Core functionality runs locally for cost and control. Edge cases, experimentation, and overflow capacity route to cloud APIs.

That daily batch job processing internal documents? Perfect for your local 13B model. The customer-facing chatbot that needs to handle anything users throw at it? Maybe keep that on a cloud API with broader capabilities.

Tools like Ollama make it easier to develop locally and deploy anywhere. Start with a local model for development, then decide per workload whether to keep it local or move to cloud for production.

Final thoughts

Companies like Netflix, The New York Times, Walmart, and Stellantis have already integrated AI agents into their operations. Major tech companies like Apple, Amazon, IBM, Intel, and NVIDIA are developing their own LLMs, even if they’re only used internally. 

Scale of adoption is significant. Around 58 percent of businesses have started to adopt LLMs in different workflows, and the global LLM market size is expected to grow to $40.8 billion by 2029. 

The cloud versus local choice comes down to three questions: How much control do you need? How predictable is your usage? And how much ML operations complexity can your team handle?

High control needs, predictable usage, and technical teams point toward local deployment. Low control requirements, variable usage, and teams focused on application development point toward cloud APIs.

But the most honest answer is that both approaches will probably coexist in your stack. The question isn’t whether to choose cloud or local – it’s which workloads belong where.

The infrastructure is finally mature enough that you don’t have to make a binary choice. Use the right tool for each job, and don’t let ideological preferences override practical considerations.

Your users won’t care where the intelligence comes from, as long as it’s fast, reliable, and gives them what they need.

Vito Pauletic Avatar

This might also interest you .