Machine Learning in Production: A Practitioner's Guide
How I actually use classical ML in real products — when to reach for it, what stack to use, and the lessons that come from shipping, not studying.
I'm a practitioner, not a researcher. Everything here comes from building ML features into real products — AviWealth, GlucosePro, Thinki.sh — not from reading papers or running benchmarks.
The question I get asked most: "Should I use ML or just use an LLM?" The answer is usually both, applied to different parts of the problem. Understanding when each tool fits is the thing that separates expensive over-engineering from the right call.
When ML Beats LLMs (and Vice Versa)
The mental model I use:
flowchart TD
A[New prediction task] --> B{Have labeled data?}
B -->|Yes, 1000+ examples| C{Structured input?}
B -->|No or little data| D[Start with prompting]
C -->|Tabular/structured| E[Classical ML first]
C -->|Unstructured text/image| F{Volume + latency critical?}
F -->|Yes| G[Fine-tune a model]
F -->|No| D
D --> H{Works well enough?}
H -->|Yes| I[Ship it]
H -->|No| J{What's the bottleneck?}
J -->|Consistency| G
J -->|Knowledge| K[RAG]
J -->|Task structure| EReach for classical ML when:
- You have labeled training data (thousands of examples, not dozens)
- The input is structured/tabular (numbers, categories, dates)
- You need sub-50ms latency at scale
- The prediction task is narrow and stable (doesn't change often)
- Explainability matters — stakeholders need to understand why
Reach for LLMs when:
- The input is unstructured text and volume is moderate
- You need flexibility over consistency
- You don't have (or can't afford to build) a labeled dataset
- The task changes frequently and retraining would be constant
- You need to handle a long tail of edge cases
The honest version: Start with prompting. It's faster and often good enough. When it's not — when you need consistency, speed, or you're hitting API costs at scale — that's when ML earns its place.
The Tools I Use and Why
| Problem | Tool | Why I chose it |
|---|---|---|
| Tabular prediction (churn, fraud, classification) | CatBoost | Handles mixed types without preprocessing; best defaults out of the box |
| Time-series forecasting | Prophet + NeuralForecast | Prophet for trend/seasonality decomposition; NeuralForecast when you need neural accuracy |
| Anomaly detection | Isolation Forest | Unsupervised; works when you don't have labeled anomalies |
| Recommendation (collaborative filtering) | Implicit (matrix factorization) | Strong baseline before anything more complex |
| Feature engineering | Featuretools + pandas | Automated feature generation for tabular data |
| Data versioning | LakeFS | Git for datasets — reproducibility without a full data platform |
| Experiment tracking | Weights & Biases | Visual training runs; better than MLflow for small teams |
| Model serving | BentoML → AWS Lambda | Low-ops path from model artifact to HTTP endpoint |
| Drift monitoring | Evidently AI | Production data vs training data comparison with minimal setup |
Core Concepts Every Builder Needs
Bias-Variance Tradeoff
The fundamental tension in ML: a model that's too simple underfits (high bias, misses real patterns), a model that's too complex overfits (high variance, memorizes training data and fails on new data).
In practice, this means:
- Gradient boosting (CatBoost/XGBoost) — good balance; handles heterogeneous tabular data well
- Deep networks — high capacity, tend to overfit without regularization
- Linear models — low variance, underfit complex relationships but highly interpretable
For most product ML problems, CatBoost is the right first answer. It regularizes well by default and performs strongly on medium-sized tabular datasets without hyperparameter tuning.
Evaluation Metrics That Actually Matter
Accuracy is almost always the wrong metric. The right metric depends on the cost asymmetry:
| Scenario | What to optimize | Why |
|---|---|---|
| Fraud detection | Recall (catch all fraud) | False negatives (missed fraud) are expensive |
| Spam filter | Precision (avoid false positives) | False positives (blocking legitimate mail) damage user trust |
| Medical screening | AUC-ROC | Balance across all thresholds; choose threshold based on clinical context |
| Recommendation | NDCG, precision@k | Rank quality matters more than classification accuracy |
| Anomaly detection | F1 at operating threshold | Balance precision and recall at the threshold you'll deploy |
I define the evaluation metric before I train anything. It forces clarity about what "better" means for this specific use case.
The Data Problem
Most ML failures I've seen are data problems. The model is fine. The data is the problem.
Common failure modes:
- Label leakage — the target variable is causally related to a feature (e.g., using "account closed" as a feature to predict "account will close")
- Training/serving skew — the feature distribution at training time doesn't match deployment time
- Survivorship bias — your training data only includes customers who stayed, so you can't predict churn accurately
- Class imbalance — 0.1% fraud rate means a model that predicts "not fraud" for everything gets 99.9% accuracy
Production Pattern: The AviWealth Anomaly Detector
AviWealth's core value is helping immigrants understand their Australian finances. One feature: flagging months where spending patterns are unusually off.
The wrong first approach:
# Naive: flag anything above 2 standard deviations
threshold = mean + 2 * std
flagged = monthly_spend > thresholdThis fires too often (normal variance triggers it constantly) and misses patterns (three modestly elevated months in a row signals something, but none individually breach the threshold).
What I actually built:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
def build_anomaly_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Build features that capture spend patterns, not just spend levels.
"""
features = pd.DataFrame()
for category in SPEND_CATEGORIES:
# Current month vs 90-day baseline
features[f'{category}_vs_baseline'] = (
df[f'{category}_spend_30d'] /
(df[f'{category}_spend_90d'] / 3 + 1) # +1 to avoid div by zero
)
# Month-over-month change
features[f'{category}_mom_change'] = (
df[f'{category}_spend_30d'] - df[f'{category}_spend_30d'].shift(1)
)
# Seasonality adjustment (same month last year)
features[f'{category}_seasonal_ratio'] = (
df[f'{category}_spend_30d'] /
(df[f'{category}_spend_ytd_avg'] + 1)
)
return features
def train_anomaly_detector(user_history: pd.DataFrame):
features = build_anomaly_features(user_history)
scaler = StandardScaler()
scaled = scaler.fit_transform(features)
model = IsolationForest(
contamination=0.05, # Expect ~5% anomalous months
random_state=42
)
model.fit(scaled)
return model, scaler
# Serving: runs as a Lambda triggered on monthly aggregation
def score_month(user_id: str, month_features: dict) -> dict:
model, scaler = load_model(user_id)
features = pd.DataFrame([month_features])
scaled = scaler.transform(features)
score = model.score_samples(scaled)[0] # More negative = more anomalous
return {
"anomaly_score": float(score),
"flagged": score < -0.3, # Threshold tuned on beta user feedback
"top_contributors": identify_top_contributors(features, model)
}Results:
- Runs in 120ms per user per month
- False positive rate in beta: 8% (users accepted it)
- One beta user found a fraudulent direct debit through the alert
The isolation forest was specifically chosen because I didn't have labeled "anomalous months" — it's unsupervised. If I'd had labeled data, I would have used a classifier.
The ML Pipeline in Practice
flowchart LR
A[Raw Data] --> B[Feature Engineering]
B --> C[Train/Val/Test Split]
C --> D[Model Training]
D --> E[Evaluation vs Baseline]
E --> F{Good enough?}
F -->|No| G[Feature analysis + iterate]
G --> B
F -->|Yes| H[Package with BentoML]
H --> I[Shadow deployment]
I --> J{Matches expectations?}
J -->|Yes| K[Production rollout]
J -->|No| L[Investigate serving skew]
K --> M[Drift monitoring]
M --> N{Drift detected?}
N -->|Yes| DShadow deployment is the step most people skip. Before routing real traffic to a new model, run it in parallel with the existing system for 2-4 weeks. Compare outputs without affecting users. This catches serving skew and edge cases that you'll never see in test data.
Deployment: BentoML → Lambda
My default serving pattern for models that don't need real-time:
# service.py
import bentoml
import numpy as np
@bentoml.service
class AnomalyDetector:
def __init__(self):
self.model = bentoml.sklearn.load_model("anomaly_detector:latest")
self.scaler = bentoml.sklearn.load_model("anomaly_scaler:latest")
@bentoml.api
def score(self, features: dict) -> dict:
import pandas as pd
X = pd.DataFrame([features])
X_scaled = self.scaler.transform(X)
score = self.model.score_samples(X_scaled)[0]
return {
"anomaly_score": float(score),
"flagged": bool(score < -0.3)
}# Build container
bentoml build
bentoml containerize anomaly_detector:latest
# Deploy to Lambda via ECR
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
docker push $ECR_URI/anomaly-detector:latestFor models that need <50ms latency at high QPS, I serve on a dedicated instance instead of Lambda. Lambda cold starts add 200-500ms which is unacceptable for synchronous user-facing predictions.
Drift Monitoring with Evidently
The model I didn't set up monitoring for degraded silently for 3 months. Setup cost: 2 hours. Cost of not having it: weeks of debugging.
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
def run_weekly_drift_check(reference_data: pd.DataFrame, production_data: pd.DataFrame):
report = Report(metrics=[
DataDriftPreset(),
DataQualityPreset()
])
report.run(
reference_data=reference_data,
current_data=production_data
)
drift_results = report.as_dict()
# Alert if >20% of features have drifted
drifted_features = drift_results["metrics"][0]["result"]["number_of_drifted_columns"]
total_features = drift_results["metrics"][0]["result"]["number_of_columns"]
drift_ratio = drifted_features / total_features
if drift_ratio > 0.2:
alert_slack(f"Model drift detected: {drifted_features}/{total_features} features drifted")
return drift_resultsWhat I Learned the Hard Way
Start with a rule, then upgrade. My first anomaly detector was "flag if this month's spend is 40% above the 3-month average." One day to build. It caught 70% of what the ML model catches. The model took 3 weeks. For most use cases, ship the rule, learn from real user feedback, then build the model.
Data quality > model quality. Every hour I've spent improving features has outperformed every hour I've spent tuning model hyperparameters. Bad features cannot be compensated by a better model. Good features make even simple models perform well.
The model is not the product. I spent 3 weeks improving model accuracy by 4 percentage points. Then I spent 3 days improving how the alert was displayed — clearer message, specific category breakdown, "this is unusual because..." explanation. The UX improvement drove 5x more adoption than the model improvement.
Monitor drift from day one. I didn't set up Evidently until 3 months after launch. By then, the model had silently degraded because user spending patterns shifted when I changed the expense categorization logic. Drift monitoring from launch would have caught it in week 2.
Don't fine-tune what you can prompt. I spent two weeks fine-tuning a classification model for a task that a well-structured prompt on Claude handled in a day. Prompting should always be the first attempt when the input is text.