Observability Tools: An Opinionated Stack
The tools I actually use for observability — logs, metrics, traces, and cost — with honest takes on when each earns its place and when you should skip it.
There are too many observability tools. Every vendor promises a single pane of glass, complete visibility, and AI-powered insights. Most of it is noise.
This is my actual stack — what I use, why I use it, and what I've found genuinely valuable vs what I've added out of best-practice FOMO.
The Core Principle
Before picking tools, establish the principle: buy observability, don't build it.
Observability infrastructure is not a differentiator. Your logging format isn't a competitive advantage. The time spent building custom dashboards is time not spent on the product.
Pick a managed stack that gives you:
- Structured log ingestion
- Metrics with alerting
- Distributed traces
- Cost you can afford as you scale
The tools below are what I've landed on after going through too many alternatives.
The Stack
flowchart LR
subgraph Application
A[Your Service] -->|OTel SDK| C[OpenTelemetry Collector]
end
subgraph Telemetry Pipeline
C -->|Traces| H[Honeycomb]
C -->|Metrics| P[Prometheus]
C -->|Logs| L[Loki]
end
subgraph Visualization
P --> G[Grafana]
L --> G
H --> H2[Honeycomb UI]
end
subgraph Alerting
G -->|SLO alerts| PD[PagerDuty]
H -->|Trace-based alerts| PD
PD -->|Page| OC[On-call engineer]
endOpenTelemetry: The Foundation
What it is: The vendor-neutral standard for collecting traces, metrics, and logs from your application.
Why it matters: It's the exit ramp from vendor lock-in. Instrument your code once with OTel, then point the collector at any backend. When you want to switch from Datadog to Honeycomb (or vice versa), you change collector config, not application code.
Setup (Node.js):
// instrumentation.ts — load before everything else with NODE_OPTIONS
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { resourceFromAttributes } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: resourceFromAttributes({
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'unknown',
[ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',
environment: process.env.NODE_ENV ?? 'development',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
headers: { 'x-honeycomb-team': process.env.HONEYCOMB_API_KEY ?? '' }
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter(),
exportIntervalMillis: 30000,
}),
instrumentations: [getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
})],
});
sdk.start();# package.json start script
NODE_OPTIONS='--require ./instrumentation.js' node dist/index.jsMy take: Use OTel from day one, even if you're only pointing at one backend. The marginal cost is zero and the optionality is valuable.
Traces: Honeycomb
What it is: Distributed tracing backend optimized for high-cardinality queries and exploratory debugging.
Why I use it over alternatives: Honeycomb's data model treats every trace event as a structured event you can query with any field. This means I can ask: "Show me traces where user_id=123 AND response_time > 2000ms AND error occurred in the payment service" — in real time, without pre-aggregating.
Datadog can do this too, but at Honeycomb's price point it's 3-5x cheaper for the same query capability.
When it earns its place: The first time you're debugging a production issue and you can trace a specific user's broken request through 8 services in 30 seconds instead of 3 hours of log grepping, you'll understand why this is worth paying for.
Honest limitation: Honeycomb's SLO and alerting UI is less mature than Grafana's. I use Grafana for SLO dashboards and Honeycomb for trace-based investigation.
Metrics: Prometheus + Grafana
What it is: Prometheus is the time-series metrics database. Grafana is the visualization layer.
Why this combination: It's the industry standard for a reason. Deep ecosystem (every tool has a Prometheus exporter), powerful query language (PromQL), and Grafana can visualize any data source — Prometheus, Loki, Tempo, InfluxDB, Postgres.
Setup for a Node.js service:
// metrics.ts — shared metrics configuration
import client from 'prom-client';
// Collect default Node.js metrics (memory, GC, event loop lag)
client.collectDefaultMetrics({ prefix: 'node_' });
// HTTP request metrics
export const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_ms',
help: 'Duration of HTTP requests in ms',
labelNames: ['method', 'route', 'status_code'],
buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
});
export const activeConnections = new client.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});Grafana dashboard essentials:
Every service should have:
- Request rate (req/s) over time
- Error rate (%) over time — alert when > 1%
- Latency p50/p95/p99 over time
- Error budget burn rate gauge
- Infrastructure metrics (CPU, memory, active connections)
Managed vs self-hosted: For small teams, Grafana Cloud free tier covers most needs. For production with higher cardinality, either self-host on a $20/month VM or pay for Grafana Cloud's paid tier. Don't self-host Prometheus in production without dedicated DevOps time.
Logs: Loki + S3
What it is: Loki is a log aggregation system from Grafana Labs. Unlike Elasticsearch, it indexes only metadata (labels), not the full log content — which makes it dramatically cheaper at scale.
Why Loki over Elasticsearch/OpenSearch: Cost. Elasticsearch indexes everything, which means high storage and memory costs. Loki stores raw logs in S3 and only indexes labels. For equivalent log volume, Loki costs roughly 1/10th of Elasticsearch.
The tradeoff: Full-text search is slower on Loki. If you're searching for a specific string inside a log message, Loki has to grep through the raw storage rather than querying an index. For most debugging workflows this is fine. For compliance or security log analysis with frequent full-text searches, Elasticsearch is worth the cost.
Log shipping with Vector:
# vector.yaml
sources:
app_logs:
type: file
include: ["/var/log/app/*.log"]
transforms:
parse_json:
type: remap
inputs: ["app_logs"]
source: |
. = parse_json!(.message)
.timestamp = now()
sinks:
loki:
type: loki
inputs: ["parse_json"]
endpoint: "http://loki:3100"
labels:
service: "$service"
environment: "$environment"
level: "$level"Key principle: Log structured JSON from your application. If you're writing console.log("User logged in"), change it to logger.info({ event: "user.login", userId, ip }). Structured logs are queryable; unstructured logs are grep-able at best.
Alerting: PagerDuty
What it is: On-call management and alert routing.
Why not just use Slack: Slack alerts get buried. PagerDuty wakes people up. When a P1 fires at 3 AM, you need a system that actually pages someone, escalates if they don't respond, and has clear runbook links. PagerDuty does this well.
Alert routing setup:
# pagerduty-rules.yaml (conceptual)
services:
- name: "Order Service"
escalation_policy: "Backend On-Call"
alert_grouping:
type: "time"
timeout: 2 # Group alerts within 2 minutes
integrations:
- name: "Grafana"
type: "events_api_v2"
- name: "Honeycomb"
type: "events_api_v2"What to page on (P1/P2):
- Error rate > 5% for 5+ minutes
- p95 latency > 2x SLO for 10+ minutes
- Error budget burn rate > 14x (will exhaust budget in <2 days)
- Complete service unavailability
What to notify but not page on:
- Error rate 1-5% (investigate next business day)
- Latency elevation below SLO breach
- Cost anomaly (important but not urgent)
Frontend Observability: Replay.io
What it is: Session replay with DevTools-level debugging. Unlike FullStory or LogRocket, Replay.io records a deterministic replay — you can open DevTools on the recording and step through the JavaScript execution.
When I use it: When users report frontend bugs that I can't reproduce. I send affected users a link to record their session, they trigger the bug, and I get a full replay with console logs, network requests, and the ability to add console.logs retroactively.
Honest take: This is a premium debugging tool, not basic observability. Skip it if budget is tight. Use it if frontend bugs are a recurring pain point that's hard to reproduce.
Cost Observability
Infrastructure cost is an SLO, not a finance problem. Here's how I track it.
AWS Cost Anomaly Detection (free with AWS):
# Create monitor and alert via CLI
aws ce create-anomaly-monitor \
--anomaly-monitor '{"MonitorName":"ProductMonitor","MonitorType":"DIMENSIONAL","MonitorDimension":"SERVICE"}'
aws ce create-anomaly-subscription \
--anomaly-subscription '{
"SubscriptionName": "CostSpike",
"MonitorArnList": ["arn:aws:ce::123456789:anomalymonitor/XXXXX"],
"Subscribers": [{"Address": "your-sns-topic-arn", "Type": "SNS"}],
"Threshold": 20,
"Frequency": "IMMEDIATE"
}'Tagging strategy for cost allocation:
// Every AWS resource gets these tags
const standardTags = {
Product: 'promptlib', // Which product
Environment: 'production', // Env
Team: 'backend', // Owning team
CostCenter: 'engineering', // Finance allocation
};Tag every resource. Without tags, you get one bill with no breakdown. With tags, you get per-product, per-environment cost that you can trend and alert on.
The Minimum Viable Observability Stack
If you're starting from zero and have limited time:
| Priority | Tool | Cost | Time to set up |
|---|---|---|---|
| 1 | Structured logging (Pino/Winston → stdout) | Free | 1 hour |
| 2 | Error tracking (Sentry) | Free tier available | 30 minutes |
| 3 | Uptime monitoring (Better Uptime) | Free tier | 15 minutes |
| 4 | Prometheus metrics + Grafana Cloud | Free tier | 2 hours |
| 5 | Honeycomb tracing | Free tier (20GB/month) | 2 hours |
Start with 1-3. You can debug most production issues with structured logs and error tracking. Add metrics and tracing when you have multiple services or when debugging takes more than an hour on a regular basis.
Don't buy Datadog until you know exactly what you need from it. At $15-25/host/month plus per-metric and per-log-GB charges, it adds up fast for small teams. Prometheus + Grafana Cloud + Honeycomb costs a fraction and covers 90% of the same use cases.