In the rush to adopt generative AI, most enterprises took the path of least resistance: API calls to providers like OpenAI or Anthropic. That was the right move for prototyping, but as AI applications move into production, the "API-first" honeymoon is starting to fade.
Concerns about data privacy, vendor lock-in, and unpredictable costs are forcing engineering teams to rethink their strategy. The old assumption that high-performance AI requires dependence on third-party black boxes no longer holds. With the rise of high-quality open-weights models such as Llama 3, Mistral, Qwen, and specialized vision-language models, teams now have credible alternatives.
The biggest remaining obstacle is infrastructure. Managing Kubernetes clusters, provisioning NVIDIA GPUs, and tuning inference engines is a full-time job that many product teams cannot justify. That is where serverless GPUs become compelling. By using Modal, we can move beyond managed APIs toward an architecture that offers much of the flexibility of private model serving with the elasticity of the cloud.
The Technical Core: Orchestrating Privacy at Scale
Adopting open-weights models is the foundation of a more private AI strategy, but the harder problem is making them fast and secure without building a full platform team. Pairing vLLM with Modal's serverless orchestration gives you much of the convenience of a managed API while preserving far more control over how data is handled.
Infrastructure as Code (IaC)
Traditionally, deploying an LLM meant managing Dockerfiles, Kubernetes manifests, and NVIDIA driver compatibility. Modal compresses much of that complexity into a Python-native definition. Your infrastructure can live alongside your application code, with CUDA versions, system libraries, and Python dependencies specified directly in the script. When you deploy, Modal builds the container and schedules it onto its GPU fleet.
The Engine: vLLM
Serving a model is not just about loading weights into VRAM; it is about maximizing throughput. We use vLLM as the inference engine for two main reasons:
- PagedAttention: Much like virtual memory in an operating system, it manages the KV cache efficiently, reducing memory waste and allowing for larger effective batch sizes.
- Batching continuo: It does not wait for an entire batch to finish. Instead, it injects new requests as capacity opens up, which reduces latency under load.
Solving the Cold Start: Modal Volumes
One of the main challenges with serverless LLMs is the time it takes to pull large models from remote hubs. Modal Volumes reduce that penalty by acting as a persistent cache. With a Volume mounted, model weights stay close to the runtime, so containers can move from "cold" to active inference much faster.
This is what makes scale-to-zero practical. Containers can hibernate during periods of inactivity, which avoids paying for idle GPU time. When a new request arrives, Modal can provision a new container on demand, preserving elasticity without requiring a 24/7 cluster.
Pushing Cold Starts Lower: Memory Snapshots
For real-world deployments, Volumes are often only the first step. Modal's memory snapshots push startup time lower by serializing warmed CPU and GPU memory after the server has loaded weights, compiled kernels, and completed warmup requests. In practice, you are not just caching files on disk; you are preserving a much more ready-to-serve runtime state.
This pattern works especially well with vLLM sleep mode. You start the server, warm it up, put it to sleep, snapshot the container, and then wake it quickly on the next cold start. It adds some implementation complexity, but it is one of the most effective ways to make serverless inference feel responsive in production.
Protecting the Endpoint: Proxy Auth Tokens
A private model is not truly private if its HTTP endpoint is publicly callable. Modal's Proxy Auth Tokens let you require credentials at the platform edge before a request ever reaches your app. Instead of exposing an unauthenticated .modal.run URL to the open internet, you can enforce token-based access with a single decorator argument.
This is a simple but important production control. It gives internal tools, backend services, and trusted clients a straightforward way to call the model while blocking unauthorized traffic before it touches your inference stack.
