Machine Learning in Production: A Practitioner's Guide

How I actually use classical ML in real products — when to reach for it, what stack to use, and the lessons that come from shipping, not studying.

I'm a practitioner, not a researcher. Everything here comes from building ML features into real products — AviWealth, GlucosePro, Thinki.sh — not from reading papers or running benchmarks.

The question I get asked most: "Should I use ML or just use an LLM?" The answer is usually both, applied to different parts of the problem. Understanding when each tool fits is the thing that separates expensive over-engineering from the right call.

When ML Beats LLMs (and Vice Versa)

The mental model I use:

flowchart TD
    A[New prediction task] --> B{Have labeled data?}
    B -->|Yes, 1000+ examples| C{Structured input?}
    B -->|No or little data| D[Start with prompting]
    C -->|Tabular/structured| E[Classical ML first]
    C -->|Unstructured text/image| F{Volume + latency critical?}
    F -->|Yes| G[Fine-tune a model]
    F -->|No| D
    D --> H{Works well enough?}
    H -->|Yes| I[Ship it]
    H -->|No| J{What's the bottleneck?}
    J -->|Consistency| G
    J -->|Knowledge| K[RAG]
    J -->|Task structure| E

Reach for classical ML when:

You have labeled training data (thousands of examples, not dozens)
The input is structured/tabular (numbers, categories, dates)
You need sub-50ms latency at scale
The prediction task is narrow and stable (doesn't change often)
Explainability matters — stakeholders need to understand why

Reach for LLMs when:

The input is unstructured text and volume is moderate
You need flexibility over consistency
You don't have (or can't afford to build) a labeled dataset
The task changes frequently and retraining would be constant
You need to handle a long tail of edge cases

The honest version: Start with prompting. It's faster and often good enough. When it's not — when you need consistency, speed, or you're hitting API costs at scale — that's when ML earns its place.

The Tools I Use and Why

Problem	Tool	Why I chose it
Tabular prediction (churn, fraud, classification)	CatBoost	Handles mixed types without preprocessing; best defaults out of the box
Time-series forecasting	Prophet + NeuralForecast	Prophet for trend/seasonality decomposition; NeuralForecast when you need neural accuracy
Anomaly detection	Isolation Forest	Unsupervised; works when you don't have labeled anomalies
Recommendation (collaborative filtering)	Implicit (matrix factorization)	Strong baseline before anything more complex
Feature engineering	Featuretools + pandas	Automated feature generation for tabular data
Data versioning	LakeFS	Git for datasets — reproducibility without a full data platform
Experiment tracking	Weights & Biases	Visual training runs; better than MLflow for small teams
Model serving	BentoML → AWS Lambda	Low-ops path from model artifact to HTTP endpoint
Drift monitoring	Evidently AI	Production data vs training data comparison with minimal setup

Core Concepts Every Builder Needs

Bias-Variance Tradeoff

The fundamental tension in ML: a model that's too simple underfits (high bias, misses real patterns), a model that's too complex overfits (high variance, memorizes training data and fails on new data).

In practice, this means:

Gradient boosting (CatBoost/XGBoost) — good balance; handles heterogeneous tabular data well
Deep networks — high capacity, tend to overfit without regularization
Linear models — low variance, underfit complex relationships but highly interpretable

For most product ML problems, CatBoost is the right first answer. It regularizes well by default and performs strongly on medium-sized tabular datasets without hyperparameter tuning.

Evaluation Metrics That Actually Matter

Accuracy is almost always the wrong metric. The right metric depends on the cost asymmetry:

Scenario	What to optimize	Why
Fraud detection	Recall (catch all fraud)	False negatives (missed fraud) are expensive
Spam filter	Precision (avoid false positives)	False positives (blocking legitimate mail) damage user trust
Medical screening	AUC-ROC	Balance across all thresholds; choose threshold based on clinical context
Recommendation	NDCG, precision@k	Rank quality matters more than classification accuracy
Anomaly detection	F1 at operating threshold	Balance precision and recall at the threshold you'll deploy

I define the evaluation metric before I train anything. It forces clarity about what "better" means for this specific use case.

The Data Problem

Most ML failures I've seen are data problems. The model is fine. The data is the problem.

Common failure modes:

Label leakage — the target variable is causally related to a feature (e.g., using "account closed" as a feature to predict "account will close")
Training/serving skew — the feature distribution at training time doesn't match deployment time
Survivorship bias — your training data only includes customers who stayed, so you can't predict churn accurately
Class imbalance — 0.1% fraud rate means a model that predicts "not fraud" for everything gets 99.9% accuracy

Production Pattern: The AviWealth Anomaly Detector

AviWealth's core value is helping immigrants understand their Australian finances. One feature: flagging months where spending patterns are unusually off.

The wrong first approach:

# Naive: flag anything above 2 standard deviations
threshold = mean + 2 * std
flagged = monthly_spend > threshold

This fires too often (normal variance triggers it constantly) and misses patterns (three modestly elevated months in a row signals something, but none individually breach the threshold).

What I actually built:

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
 
def build_anomaly_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Build features that capture spend patterns, not just spend levels.
    """
    features = pd.DataFrame()
 
    for category in SPEND_CATEGORIES:
        # Current month vs 90-day baseline
        features[f'{category}_vs_baseline'] = (
            df[f'{category}_spend_30d'] /
            (df[f'{category}_spend_90d'] / 3 + 1)  # +1 to avoid div by zero
        )
 
        # Month-over-month change
        features[f'{category}_mom_change'] = (
            df[f'{category}_spend_30d'] - df[f'{category}_spend_30d'].shift(1)
        )
 
        # Seasonality adjustment (same month last year)
        features[f'{category}_seasonal_ratio'] = (
            df[f'{category}_spend_30d'] /
            (df[f'{category}_spend_ytd_avg'] + 1)
        )
 
    return features
 
def train_anomaly_detector(user_history: pd.DataFrame):
    features = build_anomaly_features(user_history)
    scaler = StandardScaler()
    scaled = scaler.fit_transform(features)
 
    model = IsolationForest(
        contamination=0.05,  # Expect ~5% anomalous months
        random_state=42
    )
    model.fit(scaled)
    return model, scaler
 
# Serving: runs as a Lambda triggered on monthly aggregation
def score_month(user_id: str, month_features: dict) -> dict:
    model, scaler = load_model(user_id)
    features = pd.DataFrame([month_features])
    scaled = scaler.transform(features)
    score = model.score_samples(scaled)[0]  # More negative = more anomalous
 
    return {
        "anomaly_score": float(score),
        "flagged": score < -0.3,  # Threshold tuned on beta user feedback
        "top_contributors": identify_top_contributors(features, model)
    }

Results:

Runs in 120ms per user per month
False positive rate in beta: 8% (users accepted it)
One beta user found a fraudulent direct debit through the alert

The isolation forest was specifically chosen because I didn't have labeled "anomalous months" — it's unsupervised. If I'd had labeled data, I would have used a classifier.

The ML Pipeline in Practice

flowchart LR
    A[Raw Data] --> B[Feature Engineering]
    B --> C[Train/Val/Test Split]
    C --> D[Model Training]
    D --> E[Evaluation vs Baseline]
    E --> F{Good enough?}
    F -->|No| G[Feature analysis + iterate]
    G --> B
    F -->|Yes| H[Package with BentoML]
    H --> I[Shadow deployment]
    I --> J{Matches expectations?}
    J -->|Yes| K[Production rollout]
    J -->|No| L[Investigate serving skew]
    K --> M[Drift monitoring]
    M --> N{Drift detected?}
    N -->|Yes| D

Shadow deployment is the step most people skip. Before routing real traffic to a new model, run it in parallel with the existing system for 2-4 weeks. Compare outputs without affecting users. This catches serving skew and edge cases that you'll never see in test data.

Deployment: BentoML → Lambda

My default serving pattern for models that don't need real-time:

# service.py
import bentoml
import numpy as np
 
@bentoml.service
class AnomalyDetector:
    def __init__(self):
        self.model = bentoml.sklearn.load_model("anomaly_detector:latest")
        self.scaler = bentoml.sklearn.load_model("anomaly_scaler:latest")
 
    @bentoml.api
    def score(self, features: dict) -> dict:
        import pandas as pd
        X = pd.DataFrame([features])
        X_scaled = self.scaler.transform(X)
        score = self.model.score_samples(X_scaled)[0]
        return {
            "anomaly_score": float(score),
            "flagged": bool(score < -0.3)
        }

# Build container
bentoml build
bentoml containerize anomaly_detector:latest
 
# Deploy to Lambda via ECR
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
docker push $ECR_URI/anomaly-detector:latest

For models that need <50ms latency at high QPS, I serve on a dedicated instance instead of Lambda. Lambda cold starts add 200-500ms which is unacceptable for synchronous user-facing predictions.

Drift Monitoring with Evidently

The model I didn't set up monitoring for degraded silently for 3 months. Setup cost: 2 hours. Cost of not having it: weeks of debugging.

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
 
def run_weekly_drift_check(reference_data: pd.DataFrame, production_data: pd.DataFrame):
    report = Report(metrics=[
        DataDriftPreset(),
        DataQualityPreset()
    ])
 
    report.run(
        reference_data=reference_data,
        current_data=production_data
    )
 
    drift_results = report.as_dict()
 
    # Alert if >20% of features have drifted
    drifted_features = drift_results["metrics"][0]["result"]["number_of_drifted_columns"]
    total_features = drift_results["metrics"][0]["result"]["number_of_columns"]
 
    drift_ratio = drifted_features / total_features
 
    if drift_ratio > 0.2:
        alert_slack(f"Model drift detected: {drifted_features}/{total_features} features drifted")
 
    return drift_results

What I Learned the Hard Way

Start with a rule, then upgrade. My first anomaly detector was "flag if this month's spend is 40% above the 3-month average." One day to build. It caught 70% of what the ML model catches. The model took 3 weeks. For most use cases, ship the rule, learn from real user feedback, then build the model.

Data quality > model quality. Every hour I've spent improving features has outperformed every hour I've spent tuning model hyperparameters. Bad features cannot be compensated by a better model. Good features make even simple models perform well.

The model is not the product. I spent 3 weeks improving model accuracy by 4 percentage points. Then I spent 3 days improving how the alert was displayed — clearer message, specific category breakdown, "this is unusual because..." explanation. The UX improvement drove 5x more adoption than the model improvement.

Monitor drift from day one. I didn't set up Evidently until 3 months after launch. By then, the model had silently degraded because user spending patterns shifted when I changed the expense categorization logic. Drift monitoring from launch would have caught it in week 2.

Don't fine-tune what you can prompt. I spent two weeks fine-tuning a classification model for a task that a well-structured prompt on Claude handled in a day. Prompting should always be the first attempt when the input is text.