The Pipeline
In our implementation, the classifier is built as a staged pipeline. If you follow the actual execution flow, it looks like this:
1. Render every input into page units
PDFs are rendered page by page. Images are converted into a standard in-memory image format. At the end of this stage, the system has a list of page units that can all be processed the same way.
2. Detect multi-instance pages and crop them
Each page is checked to see whether it contains more than one visible item. If it does, the system extracts separate crops so each document or image can be processed independently.This is one of the clearest reasons the naive classifier breaks: if a single upload contains both a driving licence and an identity card, asking for one label is already the wrong question.
3. Detect page type and drop empty pages
The next decision is not the final business label. First, the system decides whether each unit is a document, a photo, or an empty page.Empty pages are removed. Photos are immediately routed to a photo category. Document pages continue through the document pipeline.
4. Detect and correct rotation
Only document pages go through orientation detection. If a page is rotated, it is corrected before OCR runs. That matters because downstream extraction becomes much more reliable when the page is upright.
5. Run OCR and assign an initial category
After rotation, the system runs OCR on each document page. It extracts structured evidence from the page and then assigns an initial category.In this implementation, the semantic work is front-loaded into the vision stages and OCR, while the final business label is assigned conservatively from extracted evidence using deterministic rules. If the evidence is weak, the page is left as unknown. That is an important production principle: it is better to abstain than to force a confident but wrong label.
6. Group related pages, recover unknowns, and consolidate by category
If some pages are still unknown, the pipeline runs a grouping step to determine which pages belong together. The grouping result is validated, and then unknown pages can inherit context from neighboring pages in the same subset when there is strong enough evidence.
This grouping stage is not the primary classifier. It is a recovery mechanism that helps ambiguous pages after the first pass. Once every page has a usable category, the pipeline consolidates pages by final label so related outputs can be handled together downstream.
This is what turns the classifier from a prompt into a workflow. The model is not making one grand decision; it is making a sequence of smaller ones, and the surrounding code gives those decisions structure, validation, fallback behavior, and a recovery path.
Why This Works with Small Private Models
This architecture works because the stages are real model calls, each with a narrow prompt and a constrained output. Instead of asking one model to solve the whole intake flow, you ask it to make a sequence of bounded decisions.
That makes smaller models practical. The model handles judgment, while deterministic code handles validation, routing, fallback behavior, and conservative final labeling.
It also makes the system safer. Pages can be dropped, routed as photos, or left as unknown until more context is available, instead of forcing an answer too early.
And it is compatible with private deployment. No stage requires task-specific fine-tuning, and failures stay controlled: if crop detection, rotation, OCR, or grouping fails, the pipeline can still continue conservatively.
That, in our experience, is the real lesson.
The question is rarely "how do we make a small model behave like a huge one?" The better question is "how do we redesign the problem so a smaller model can succeed reliably?"
A Practical Mental Model
If you are building your own classifier, think of the system in three layers.
- NormalizationTurn messy files into consistent processing units.
- Zero-shot micro-decisionsUse the model for bounded tasks such as type detection, segmentation, rotation, OCR, and grouping.
- Conservative business routingConvert extracted evidence into final categories, and keep a safe path for uncertain cases.
This is a much more realistic production architecture than a single giant classification prompt. It is easier to debug, easier to evaluate, and easier to improve over time because every stage can be measured independently.