Observability Tools: An Opinionated Stack

The tools I actually use for observability — logs, metrics, traces, and cost — with honest takes on when each earns its place and when you should skip it.

There are too many observability tools. Every vendor promises a single pane of glass, complete visibility, and AI-powered insights. Most of it is noise.

This is my actual stack — what I use, why I use it, and what I've found genuinely valuable vs what I've added out of best-practice FOMO.

The Core Principle

Before picking tools, establish the principle: buy observability, don't build it.

Observability infrastructure is not a differentiator. Your logging format isn't a competitive advantage. The time spent building custom dashboards is time not spent on the product.

Pick a managed stack that gives you:

Structured log ingestion
Metrics with alerting
Distributed traces
Cost you can afford as you scale

The tools below are what I've landed on after going through too many alternatives.

The Stack

flowchart LR
    subgraph Application
        A[Your Service] -->|OTel SDK| C[OpenTelemetry Collector]
    end
 
    subgraph Telemetry Pipeline
        C -->|Traces| H[Honeycomb]
        C -->|Metrics| P[Prometheus]
        C -->|Logs| L[Loki]
    end
 
    subgraph Visualization
        P --> G[Grafana]
        L --> G
        H --> H2[Honeycomb UI]
    end
 
    subgraph Alerting
        G -->|SLO alerts| PD[PagerDuty]
        H -->|Trace-based alerts| PD
        PD -->|Page| OC[On-call engineer]
    end

OpenTelemetry: The Foundation

What it is: The vendor-neutral standard for collecting traces, metrics, and logs from your application.

Why it matters: It's the exit ramp from vendor lock-in. Instrument your code once with OTel, then point the collector at any backend. When you want to switch from Datadog to Honeycomb (or vice versa), you change collector config, not application code.

Setup (Node.js):

// instrumentation.ts — load before everything else with NODE_OPTIONS
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { resourceFromAttributes } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
 
const sdk = new NodeSDK({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'unknown',
    [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',
    environment: process.env.NODE_ENV ?? 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
    headers: { 'x-honeycomb-team': process.env.HONEYCOMB_API_KEY ?? '' }
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter(),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
  })],
});
 
sdk.start();

# package.json start script
NODE_OPTIONS='--require ./instrumentation.js' node dist/index.js

My take: Use OTel from day one, even if you're only pointing at one backend. The marginal cost is zero and the optionality is valuable.

Traces: Honeycomb

What it is: Distributed tracing backend optimized for high-cardinality queries and exploratory debugging.

Why I use it over alternatives: Honeycomb's data model treats every trace event as a structured event you can query with any field. This means I can ask: "Show me traces where user_id=123 AND response_time > 2000ms AND error occurred in the payment service" — in real time, without pre-aggregating.

Datadog can do this too, but at Honeycomb's price point it's 3-5x cheaper for the same query capability.

When it earns its place: The first time you're debugging a production issue and you can trace a specific user's broken request through 8 services in 30 seconds instead of 3 hours of log grepping, you'll understand why this is worth paying for.

Honest limitation: Honeycomb's SLO and alerting UI is less mature than Grafana's. I use Grafana for SLO dashboards and Honeycomb for trace-based investigation.

Metrics: Prometheus + Grafana

What it is: Prometheus is the time-series metrics database. Grafana is the visualization layer.

Why this combination: It's the industry standard for a reason. Deep ecosystem (every tool has a Prometheus exporter), powerful query language (PromQL), and Grafana can visualize any data source — Prometheus, Loki, Tempo, InfluxDB, Postgres.

Setup for a Node.js service:

// metrics.ts — shared metrics configuration
import client from 'prom-client';
 
// Collect default Node.js metrics (memory, GC, event loop lag)
client.collectDefaultMetrics({ prefix: 'node_' });
 
// HTTP request metrics
export const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in ms',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
});
 
export const activeConnections = new client.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});
 
// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

Grafana dashboard essentials:

Every service should have:

Request rate (req/s) over time
Error rate (%) over time — alert when > 1%
Latency p50/p95/p99 over time
Error budget burn rate gauge
Infrastructure metrics (CPU, memory, active connections)

Managed vs self-hosted: For small teams, Grafana Cloud free tier covers most needs. For production with higher cardinality, either self-host on a $20/month VM or pay for Grafana Cloud's paid tier. Don't self-host Prometheus in production without dedicated DevOps time.

Logs: Loki + S3

What it is: Loki is a log aggregation system from Grafana Labs. Unlike Elasticsearch, it indexes only metadata (labels), not the full log content — which makes it dramatically cheaper at scale.

Why Loki over Elasticsearch/OpenSearch: Cost. Elasticsearch indexes everything, which means high storage and memory costs. Loki stores raw logs in S3 and only indexes labels. For equivalent log volume, Loki costs roughly 1/10th of Elasticsearch.

The tradeoff: Full-text search is slower on Loki. If you're searching for a specific string inside a log message, Loki has to grep through the raw storage rather than querying an index. For most debugging workflows this is fine. For compliance or security log analysis with frequent full-text searches, Elasticsearch is worth the cost.

Log shipping with Vector:

# vector.yaml
sources:
  app_logs:
    type: file
    include: ["/var/log/app/*.log"]
 
transforms:
  parse_json:
    type: remap
    inputs: ["app_logs"]
    source: |
      . = parse_json!(.message)
      .timestamp = now()
 
sinks:
  loki:
    type: loki
    inputs: ["parse_json"]
    endpoint: "http://loki:3100"
    labels:
      service: "$service"
      environment: "$environment"
      level: "$level"

Key principle: Log structured JSON from your application. If you're writing console.log("User logged in"), change it to logger.info({ event: "user.login", userId, ip }). Structured logs are queryable; unstructured logs are grep-able at best.

Alerting: PagerDuty

What it is: On-call management and alert routing.

Why not just use Slack: Slack alerts get buried. PagerDuty wakes people up. When a P1 fires at 3 AM, you need a system that actually pages someone, escalates if they don't respond, and has clear runbook links. PagerDuty does this well.

Alert routing setup:

# pagerduty-rules.yaml (conceptual)
services:
  - name: "Order Service"
    escalation_policy: "Backend On-Call"
    alert_grouping:
      type: "time"
      timeout: 2  # Group alerts within 2 minutes
 
integrations:
  - name: "Grafana"
    type: "events_api_v2"
  - name: "Honeycomb"
    type: "events_api_v2"

What to page on (P1/P2):

Error rate > 5% for 5+ minutes
p95 latency > 2x SLO for 10+ minutes
Error budget burn rate > 14x (will exhaust budget in <2 days)
Complete service unavailability

What to notify but not page on:

Error rate 1-5% (investigate next business day)
Latency elevation below SLO breach
Cost anomaly (important but not urgent)

Frontend Observability: Replay.io

What it is: Session replay with DevTools-level debugging. Unlike FullStory or LogRocket, Replay.io records a deterministic replay — you can open DevTools on the recording and step through the JavaScript execution.

When I use it: When users report frontend bugs that I can't reproduce. I send affected users a link to record their session, they trigger the bug, and I get a full replay with console logs, network requests, and the ability to add console.logs retroactively.

Honest take: This is a premium debugging tool, not basic observability. Skip it if budget is tight. Use it if frontend bugs are a recurring pain point that's hard to reproduce.

Cost Observability

Infrastructure cost is an SLO, not a finance problem. Here's how I track it.

AWS Cost Anomaly Detection (free with AWS):

# Create monitor and alert via CLI
aws ce create-anomaly-monitor \
  --anomaly-monitor '{"MonitorName":"ProductMonitor","MonitorType":"DIMENSIONAL","MonitorDimension":"SERVICE"}'
 
aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "CostSpike",
    "MonitorArnList": ["arn:aws:ce::123456789:anomalymonitor/XXXXX"],
    "Subscribers": [{"Address": "your-sns-topic-arn", "Type": "SNS"}],
    "Threshold": 20,
    "Frequency": "IMMEDIATE"
  }'

Tagging strategy for cost allocation:

// Every AWS resource gets these tags
const standardTags = {
  Product: 'promptlib',          // Which product
  Environment: 'production',     // Env
  Team: 'backend',               // Owning team
  CostCenter: 'engineering',     // Finance allocation
};

Tag every resource. Without tags, you get one bill with no breakdown. With tags, you get per-product, per-environment cost that you can trend and alert on.

The Minimum Viable Observability Stack

If you're starting from zero and have limited time:

Priority	Tool	Cost	Time to set up
1	Structured logging (Pino/Winston → stdout)	Free	1 hour
2	Error tracking (Sentry)	Free tier available	30 minutes
3	Uptime monitoring (Better Uptime)	Free tier	15 minutes
4	Prometheus metrics + Grafana Cloud	Free tier	2 hours
5	Honeycomb tracing	Free tier (20GB/month)	2 hours

Start with 1-3. You can debug most production issues with structured logs and error tracking. Add metrics and tracing when you have multiple services or when debugging takes more than an hour on a regular basis.

Don't buy Datadog until you know exactly what you need from it. At $15-25/host/month plus per-metric and per-log-GB charges, it adds up fast for small teams. Prometheus + Grafana Cloud + Honeycomb costs a fraction and covers 90% of the same use cases.