Stop Sending Your Data Away: How to Build a Private, Scalable LLM Infrastructure

In the previous article, we focused on private AI infrastructure and how to run a private model at scale. In this one, we show what that infrastructure can actually do: build a production-ready document classification pipeline with no fine-tuning and no training data.

The Naive Approach

The most common mistake when building with LLMs is the "one-shot trap": you take a file, define a list of categories, and ask the model a direct question like this:

Classify this file into one of these categories:
- driver license
- identity card
- passport
- other
Return only the category name.

On a clean demo set, this can look surprisingly good. A modern Vision-Language Model can often infer enough from the first page to produce a plausible label. That is usually the point where teams start believing they have a classifier.

In production, however, that illusion does not last very long. Real document streams are messy: a single file can contain multiple pages, multiple documents, rotated scans, blank pages and so on. In those conditions, the question "what category is this file?" is too broad. It forces the model to solve many smaller problems at once.

That immediately creates three bottlenecks:

Task Overload: A file is often a container, not a single document. One PDF can include multiple document types, blank pages, photos, or scans that should be separated before classification. If the model is asked for one label too early, the input has already been framed incorrectly.
No Structural Decomposition: A file is often a container, not a single document. One PDF can include multiple document types, blank pages, photos, or scans that should be separated before classification. If the model is asked for one label too early, the input has already been framed incorrectly.
High Latency and Cost: You end up spending expensive model inference on a broad, noisy problem when most of the work should have been broken into cheaper, narrower decisions. Sending 30,000 tokens to a 27B model just to get a one-word answer such as "Invoice" is a massive waste of compute.

A Better Framing

The fix is simple: break the problem into smaller problems.

You can get strong results from a smaller private model if you break the workflow into narrow decisions and let deterministic code orchestrate the sequence. The model should not be asked to "understand the whole intake flow." It should be asked a series of constrained questions:

Is this page empty, a photo, or a document?
Does this image contain more than one document in it?
Is the page rotated?
What text can be extracted from it?
Which pages belong together?

Each of these questions is easier than the full classification problem. Each can be answered with a short prompt and a constrained output format. When those local decisions are combined, the overall system becomes much more reliable than a monolithic classifier.

This is exactly the pattern that makes private models practical. You are not competing with frontier models on open-ended reasoning. You are designing a system in which the model only needs to make bounded, high-signal decisions.

The Pipeline

In our implementation, the classifier is built as a staged pipeline. If you follow the actual execution flow, it looks like this:

1. Render every input into page units

PDFs are rendered page by page. Images are converted into a standard in-memory image format. At the end of this stage, the system has a list of page units that can all be processed the same way.

2. Detect multi-instance pages and crop them

Each page is checked to see whether it contains more than one visible item. If it does, the system extracts separate crops so each document or image can be processed independently.This is one of the clearest reasons the naive classifier breaks: if a single upload contains both a driving licence and an identity card, asking for one label is already the wrong question.

3. Detect page type and drop empty pages

The next decision is not the final business label. First, the system decides whether each unit is a document, a photo, or an empty page.Empty pages are removed. Photos are immediately routed to a photo category. Document pages continue through the document pipeline.

4. Detect and correct rotation

Only document pages go through orientation detection. If a page is rotated, it is corrected before OCR runs. That matters because downstream extraction becomes much more reliable when the page is upright.

5. Run OCR and assign an initial category

After rotation, the system runs OCR on each document page. It extracts structured evidence from the page and then assigns an initial category.In this implementation, the semantic work is front-loaded into the vision stages and OCR, while the final business label is assigned conservatively from extracted evidence using deterministic rules. If the evidence is weak, the page is left as unknown. That is an important production principle: it is better to abstain than to force a confident but wrong label.

5. Run OCR and assign an initial category

If some pages are still unknown, the pipeline runs a grouping step to determine which pages belong together. The grouping result is validated, and then unknown pages can inherit context from neighboring pages in the same subset when there is strong enough evidence.

This grouping stage is not the primary classifier. It is a recovery mechanism that helps ambiguous pages after the first pass. Once every page has a usable category, the pipeline consolidates pages by final label so related outputs can be handled together downstream.

This is what turns the classifier from a prompt into a workflow. The model is not making one grand decision; it is making a sequence of smaller ones, and the surrounding code gives those decisions structure, validation, fallback behavior, and a recovery path.

How to Build a Production-Ready Zero-Shot Document Classifier

The Naive Approach

A Better Framing

The Pipeline

1. Render every input into page units

2. Detect multi-instance pages and crop them

3. Detect page type and drop empty pages

4. Detect and correct rotation

5. Run OCR and assign an initial category

5. Run OCR and assign an initial category

Why This Works with Small Private Models

A Practical Mental Model

Conclusion

Continue exploring the future of mobility and innovation