In the previous article, we focused on private AI infrastructure and how to run a private model at scale. In this one, we show what that infrastructure can actually do: build a production-ready document classification pipeline with no fine-tuning and no training data.

The Naive Approach

The most common mistake when building with LLMs is the "one-shot trap": you take a file, define a list of categories, and ask the model a direct question like this:

Classify this file into one of these categories:
- driver license
- identity card
- passport
- other
Return only the category name.


On a clean demo set, this can look surprisingly good. A modern Vision-Language Model can often infer enough from the first page to produce a plausible label. That is usually the point where teams start believing they have a classifier.

In production, however, that illusion does not last very long. Real document streams are messy: a single file can contain multiple pages, multiple documents, rotated scans, blank pages and so on. In those conditions, the question "what category is this file?" is too broad. It forces the model to solve many smaller problems at once.

That immediately creates three bottlenecks: