Computer Vision in Production: Real Problems, Real Solutions
How I build vision features into real products — document parsing, object detection, OCR pipelines — and the gap between demo accuracy and production accuracy.
Computer vision wasn't something I planned to work with. It found me through the problems I was trying to solve. Users wanted to photograph their medical device screens instead of typing numbers. Finance features needed to extract data from uploaded bank statements. Creative tools needed to process images from users with phones, not studios.
Every time, the lesson was the same: demo accuracy and production accuracy are different numbers, and the gap between them is your primary engineering challenge.
Where Vision Shows Up in Real Products
| Product | Problem | Vision approach |
|---|---|---|
| GlucosePro | Users photograph CGM device screens | Object detection + OCR |
| AviWealth | Upload bank statements → extract transactions | Document layout parsing |
| Nishabdham | Generate thumbnails for poetry recordings | Key frame extraction + CLIP |
| General | User-uploaded images need classification | CLIP zero-shot |
In each case, I didn't choose vision because I wanted to work with vision. The user's natural input was visual, and building a text-based alternative would have made the product worse. Users photograph things. Building systems that work with photographs is the right call.
The Tool Stack
| Task | Tool | Why I chose it |
|---|---|---|
| Document parsing / invoice extraction | LayoutLMv3 / Donut | Purpose-built for structured document understanding |
| General OCR (clean text) | Tesseract | Fast, free, good enough for high-contrast printed text |
| OCR (difficult conditions) | PaddleOCR | Better on rotated text, low contrast, non-Latin scripts |
| Object detection | YOLOv8 (Ultralytics) | Best accuracy-to-deployment ratio; excellent Python API |
| Image classification (no labels) | CLIP zero-shot | Works without training data; great for prototyping |
| Image classification (fine-tuned) | CLIP or EfficientNet | When zero-shot accuracy isn't good enough |
| Key frame extraction | PySceneDetect | Fast scene detection for thumbnail generation |
| Image annotation/labeling | Label Studio | Open-source; self-hostable; good export formats |
| Data augmentation | Albumentations | Fast, composable augmentations for training data |
| GPU inference | Modal | Managed GPU; pay-per-second; no idle costs |
Core Concepts Every Builder Needs
The Vision Pipeline
Every production vision system follows the same basic pattern:
flowchart LR
A[User image] --> B[Preprocessing]
B --> C[Detection / Localization]
C --> D[Crop / Region of Interest]
D --> E[Recognition / Extraction]
E --> F[Post-processing + Validation]
F --> G{Valid output?}
G -->|Yes| H[Return to user]
G -->|No| I[Graceful fallback\n+ correction UI]Each stage can fail independently. Building robust production systems means handling failures at every stage, not just at the end.
OCR vs Document Understanding
OCR (Optical Character Recognition): Extracts raw text from an image. Doesn't understand structure or meaning. Tesseract gives you "42.7 Total: $" with no understanding of which number is the total.
Document Understanding: Extracts structured data while understanding document layout (tables, headers, form fields). Tools like LayoutLMv3 and Donut know that a number to the right of "Total:" is the total amount, not just a random number.
For anything beyond simple text extraction — invoices, bank statements, forms — document understanding is what you actually need.
Model Accuracy ≠ Feature Reliability
A model that's 94% accurate means 6% of predictions are wrong. In production:
- Users don't know which 6% is wrong without checking
- The error distribution concentrates on hard cases — bad lighting, unusual formats, edge cases
- These are exactly the cases your most frustrated users will encounter
For every vision feature I build, the critical design question is: what happens in the 6%? A usable fallback is often more important than improving from 94% to 96%.
Production Example: GlucosePro CGM Reading Capture
The user flow: photograph the CGM device screen → extract the glucose reading → log it with timestamp.
The challenge: CGM screens vary by device, the text is small and sometimes against a curved surface, and users photograph them in all lighting conditions.
Stage 1: Device Detection (YOLOv8)
I needed to crop to just the device screen before running OCR. Without this, Tesseract would try to read everything in the photo — clothing patterns, background text, noise.
from ultralytics import YOLO
import cv2
import numpy as np
class CGMDetector:
def __init__(self, model_path: str):
self.model = YOLO(model_path)
self.conf_threshold = 0.5
def detect_screen(self, image_path: str) -> dict | None:
"""
Detect CGM device screen region in a photo.
Returns bounding box or None if no device detected.
"""
results = self.model(image_path, conf=self.conf_threshold)
if not results[0].boxes:
return None
# Get highest-confidence detection
best = max(results[0].boxes, key=lambda b: b.conf.item())
return {
"bbox": best.xyxy[0].tolist(), # [x1, y1, x2, y2]
"confidence": best.conf.item(),
}
def crop_screen(self, image_path: str, bbox: list, padding: int = 10) -> np.ndarray:
"""Crop image to detected screen with slight padding."""
img = cv2.imread(image_path)
x1, y1, x2, y2 = [int(c) for c in bbox]
x1 = max(0, x1 - padding)
y1 = max(0, y1 - padding)
x2 = min(img.shape[1], x2 + padding)
y2 = min(img.shape[0], y2 + padding)
return img[y1:y2, x1:x2]Training data: I collected 200 photos (my own devices + stock images + beta user contributions), labeled them in Label Studio (~4 hours), and augmented to 400 examples.
import albumentations as A
# Augmentations specifically simulate real-world conditions
augmentation = A.Compose([
A.RandomBrightnessContrast(brightness_limit=0.3, p=0.5), # Poor lighting
A.GaussNoise(var_limit=(10, 50), p=0.3), # Camera noise
A.Blur(blur_limit=3, p=0.2), # Hand shake
A.Perspective(scale=(0.05, 0.1), p=0.3) # Angled photos
])Stage 2: Preprocessing for OCR
def preprocess_for_ocr(cropped_image: np.ndarray) -> np.ndarray:
gray = cv2.cvtColor(cropped_image, cv2.COLOR_BGR2GRAY)
# Upscale 2x — Tesseract performs better on larger text
scaled = cv2.resize(gray, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
# Adaptive thresholding — handles uneven lighting better than global threshold
thresh = cv2.adaptiveThreshold(
scaled, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2
)
return cv2.fastNlMeansDenoising(thresh, h=10)Stage 3: OCR + Validation
import pytesseract
def extract_glucose_reading(preprocessed: np.ndarray) -> dict:
config = "--psm 8 --oem 3 -c tessedit_char_whitelist=0123456789."
text = pytesseract.image_to_string(preprocessed, config=config).strip()
try:
reading = float(text)
if 2.0 <= reading <= 30.0: # Plausible blood glucose range in mmol/L
return {"reading": reading, "confidence": "high"}
else:
return {"reading": None, "confidence": "failed", "reason": f"Out of range: {reading}"}
except ValueError:
return {"reading": None, "confidence": "failed", "reason": "Parse error", "raw_text": text}The Results
| Condition | Accuracy |
|---|---|
| Controlled (good lighting, direct angle) | 94% |
| Typical real user conditions | 78% |
The correction flow was as important as the model. When OCR failed, users saw their photo with an edit field pre-filled with the best OCR result. They corrected it in 5 seconds. Without this, 22% failure rate = feature-breaking bug. With correction UI, it's minor friction.
Document Parsing: Bank Statements
AviWealth needed to extract transactions from bank statement PDFs/photos. LayoutLMv3 handles this better than raw OCR because it understands document structure.
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
from PIL import Image
import torch
class DocumentParser:
def __init__(self):
self.processor = LayoutLMv3Processor.from_pretrained(
"microsoft/layoutlmv3-base"
)
self.model = LayoutLMv3ForTokenClassification.from_pretrained(
"your-fine-tuned-bank-statement-model"
)
def extract_transactions(self, image_path: str) -> list[dict]:
image = Image.open(image_path).convert("RGB")
# Processor handles OCR + layout encoding automatically
encoding = self.processor(
image, return_tensors="pt", truncation=True
)
with torch.no_grad():
outputs = self.model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
return self._decode_to_transactions(predictions, encoding)The honest reality: Bank statement formats vary enormously. I have specific parsers for the 8 most common Australian bank formats, and a fallback generic parser. Generic achieves ~70% extraction accuracy. Specific parsers achieve 90-95%. Invest in format-specific handling for your highest-volume document types.
Production Deployment
Vision models are heavy. Design for async:
# Modal for GPU inference — pay-per-second, no idle costs
import modal
stub = modal.Stub("vision-service")
image = modal.Image.debian_slim().pip_install(
"ultralytics", "pytesseract", "opencv-python-headless"
)
@stub.function(gpu="T4", memory=4096, timeout=60)
def process_image(image_bytes: bytes, task: str) -> dict:
import numpy as np, cv2
nparr = np.frombuffer(image_bytes, np.uint8)
img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
if task == "cgm_reading":
return extract_cgm_reading(img)
elif task == "bank_statement":
return parse_document(img)Design principle: Never make users wait for GPU cold starts (10-30 seconds). Show "analyzing your photo..." and process in background. Update with results asynchronously.
What I Learned the Hard Way
"Real user images" is a completely different category. Every model demo uses clean, well-lit, high-contrast inputs. Real users photograph things at night, upside-down, through glass, with smeared lenses. My "terrible photos" test set — intentionally bad photos of test content — is the most valuable dataset I have. Test ugly inputs before building a production system.
OCR quality is a UX problem, not just an accuracy problem. When OCR fails, users need a graceful way to correct the result. Build the correction flow before spending time on accuracy improvements. The correction UX has higher ROI than most model improvements.
Document parsing is harder than it looks. LayoutLMv3 is impressive in demos. Bank statement parsing in the wild hits edge cases every week — unusual table formats, scanned vs digital-native PDFs, multi-page documents with inconsistent headers. Budget 3x your initial estimate.
The compute cost of vision is significant. Image models are 10-50x more expensive to run than text models at equivalent throughput. Design the UX for async processing: "Your photo is being analyzed" rather than blocking the user.
Collecting 200 labeled images is doable. The assumption that "I don't have training data" is often wrong. For the CGM detector, 200 labeled images took 4 hours total — 2 hours photographing and 2 hours labeling in Label Studio. This is accessible, not research-scale work.