Awesome Observability Tools

A curated, opinionated list of tools and resources dedicated to Observability as an engineering capability — not just monitoring.

Observability is the ability to understand the internal state of a system by examining its outputs — metrics, logs, traces, and profiles — in order to explain why a system behaves the way it does, including in ways you didn’t anticipate.

This list focuses on modern observability practices while acknowledging legacy and operations-oriented tools when they meaningfully contribute to system understanding — because most real-world environments are hybrid.


🎯 Scope & Intent

This list is curated with the following principles:

  • Observability as an engineering capability, not a product to buy
  • Preference for field-proven tools over purely theoretical or early-stage solutions
  • Explicit consideration of system complexity, scale, cost, and operational constraints
  • Recognition that OpenTelemetry is converging as the standard instrumentation layer — but the ecosystem remains diverse
  • Awareness that AI is reshaping observability — from anomaly detection to natural-language querying

The goal is to help practitioners:

  • Choose tools adapted to their context, scale, and maturity
  • Combine tools coherently rather than stack them blindly
  • Build observability systems that actually support debugging, diagnosis, and decision-making

🧭 How to Read This List

Tools are organized by technical building blocks, but observability problems are usually expressed in terms of intent. A complementary reading by intent is recommended:

🔎 Observability by Intent

IntentBuilding blocksKey tools
Golden Signals / RED / USEMetrics, dashboards, alertingPrometheus, Grafana, VictoriaMetrics, Alertmanager
Latency propagation & dependency analysisDistributed tracingOpenTelemetry, Jaeger, Tempo, Zipkin
High-cardinality debugging & unknown-unknownsEvent-based observability, rich contextOpenTelemetry, tracing backends, structured logging
Low-overhead production profilingContinuous profiling, eBPFParca, Pyroscope, async-profiler, bpftrace
Black-box & legacy systemsMetrics scraping, log-based analysisStatsD, Graphite, Zabbix, Checkmk
Cost-efficient observability at scaleTelemetry pipelines, samplingOTel Collector, Vector, Fluent Bit

Legend

SymbolMeaning
Widely adopted / reference solution
🟢Actively maintained
🔵Cloud-native / Kubernetes-friendly
🟠Commercial / paid offering
📚Extensive documentation
🚀High performance / low overhead
🧠Observability-first philosophy
🧰Operations heritage / legacy-friendly

Metrics Collection & Time-Series Storage

Scrapers, collectors, and time-series databases — the foundation of quantitative observability.

  • Prometheus ⭐🟢🔵📚🧠 — The de facto standard for cloud-native metrics. Pull-based model, dimensional data model, and PromQL. Excels at service-level monitoring; limited by single-node storage for very large deployments. [Go] [Apache-2.0] — GitHub
  • VictoriaMetrics ⭐🟢🚀🧠 — High-performance, cost-efficient Prometheus-compatible TSDB. Handles significantly higher cardinality and longer retention than vanilla Prometheus. Excellent choice when Prometheus query compatibility matters but scale exceeds single-node limits. [Go] [Apache-2.0] — GitHub
  • Thanos ⭐🟢🔵 — Adds long-term storage, global query view, and high availability to Prometheus. Sidecar architecture lets you keep existing Prometheus deployments while adding horizontal scale. [Go] [Apache-2.0] — GitHub
  • Mimir ⭐🟢🔵🚀 — Grafana’s horizontally scalable, highly available Prometheus-compatible TSDB. Designed from the ground up for multi-tenant, large-scale deployments. [Go] [AGPL-3.0] — GitHub
  • InfluxDB 🟢🟠 — Purpose-built time-series database with high write throughput. Strong ecosystem. v3 re-open-sourced under Apache 2.0 in 2024 with a Rust-based engine. [Go/Rust] [Apache-2.0/Commercial] — GitHub
  • OpenTelemetry Collector ⭐🟢🔵🧠 — Vendor-neutral telemetry collection, processing, and export pipeline. The backbone of modern instrumentation architectures. [Go] [Apache-2.0] — GitHub
  • Grafana Alloy ⭐🟢🔵🧠 — OpenTelemetry-native telemetry collector from Grafana Labs (successor to Grafana Agent). Supports metrics, logs, traces, and profiles. Native integration with the Grafana stack. [Go] [Apache-2.0] — GitHub
  • Telegraf 🟢 — Plugin-driven agent for collecting and reporting metrics. 300+ input plugins make it versatile for heterogeneous environments. [Go] [MIT] — GitHub
  • StatsD 🧰 — Lightweight, UDP-based metrics aggregation daemon. Simple protocol, widely supported by applications. Still relevant in legacy environments. [Node.js] [MIT]
  • Graphite 🧰 — One of the original time-series storage and graphing systems. Whisper backend, Carbon collector. Historical significance but limited compared to modern alternatives. [Python] [Apache-2.0] — GitHub
  • Netdata ⭐🟢🚀 — Real-time, per-second system and application monitoring with built-in anomaly detection. Zero-configuration agent with impressive out-of-the-box dashboards. [C] [GPL-3.0] — GitHub

Distributed Tracing

Request-level visibility across service boundaries — essential for understanding latency, dependencies, and failure propagation in distributed systems.

  • OpenTelemetry ⭐🟢🔵📚🧠 — The converging open standard for distributed tracing, metrics, and logs instrumentation. Language-specific SDKs, auto-instrumentation agents, and the Collector form a complete pipeline. If you’re starting today, start here. [Multiple] [Apache-2.0] — GitHub
  • Jaeger ⭐🟢🔵📚 — CNCF graduated distributed tracing backend and UI. Mature, well-documented, strong Kubernetes integration. Originally from Uber. [Go] [Apache-2.0] — GitHub
  • Grafana Tempo ⭐🟢🔵 — High-scale, cost-efficient tracing backend that requires only object storage (no indexing infrastructure). Pairs naturally with Grafana, Loki, and Mimir. [Go] [AGPL-3.0] — GitHub
  • Zipkin 🟢📚 — One of the pioneering distributed tracing systems (Twitter, 2012). Still actively maintained with a loyal community. Simpler architecture than Jaeger, good for smaller deployments. [Java] [Apache-2.0] — GitHub
  • Apache SkyWalking ⭐🟢🔵 — Full observability platform with strong tracing capabilities. Popular in the Java/JVM ecosystem. Auto-instrumentation via bytecode injection. [Java] [Apache-2.0] — GitHub
  • SigNoz 🟢🔵🧠 — Open-source observability platform built natively on OpenTelemetry. Unified metrics, traces, and logs in a single UI. ClickHouse-backed storage. Strong alternative to commercial APM. [Go/TypeScript] [ELv2/MIT] — GitHub
  • Pinpoint 🧰 — Bytecode-instrumentation-based APM and tracing for Java and PHP. Zero-code-change approach. Popular in Korean and Asian enterprise environments. [Java] [Apache-2.0] — GitHub

Log Management & Log Pipelines

Collection, processing, indexing, and analysis of log data — still the most universal telemetry signal.

  • Grafana Loki ⭐🟢🔵📚🧠 — Label-based log aggregation that indexes metadata, not content. Dramatically cheaper than full-text indexing at scale. Pairs with Grafana for exploration. [Go] [AGPL-3.0] — GitHub
  • Fluent Bit ⭐🟢🔵🚀 — Lightweight, high-performance log processor and forwarder designed for edge and containerized environments. Tiny memory footprint. [C] [Apache-2.0] — GitHub
  • Fluentd 🟢🔵 — CNCF graduated unified logging layer with 1000+ plugins. Heavier than Fluent Bit but more flexible for complex routing. [Ruby/C] [Apache-2.0] — GitHub
  • Vector 🟢🚀🧠 — High-performance observability data pipeline for logs, metrics, and traces. Built in Rust for reliability and throughput. Excellent for consolidating telemetry pipelines. [Rust] [MPL-2.0] — GitHub
  • Elasticsearch ⭐🟢🟠🧰 — Distributed search and analytics engine. Powerful full-text search, but storage costs and operational complexity can be significant at scale. License changed from Apache-2.0 to SSPL. [Java] [SSPL/Commercial] — GitHub
  • OpenSearch 🟢🔵 — Community-driven fork of Elasticsearch (post-license change). AWS-backed, Apache-2.0 licensed. Drop-in replacement for Elasticsearch in most deployments. [Java] [Apache-2.0] — GitHub
  • Logstash 🧰 — Flexible log ingestion and transformation pipeline. Part of the Elastic Stack. Heavy JVM footprint. [Java] [SSPL/Commercial] — GitHub
  • Graylog 🟢🟠🧰 — Centralized log management with built-in alerting and dashboards. Good for teams that want a self-contained log platform. [Java] [SSPL/Commercial] — GitHub
  • rsyslog 🟢🚀🧰 — High-performance system logging daemon. Handles millions of messages per second. Essential in Linux infrastructure. [C] [GPL-3.0] — GitHub

Observability Pipelines & Telemetry Processing

Transport, transformation, sampling, and routing of observability data — increasingly critical as telemetry volumes grow and costs need control.

  • OpenTelemetry Collector ⭐🟢🔵🧠 — The standard telemetry processing pipeline. Receivers, processors, and exporters for any signal to any backend. Supports tail-based sampling, attribute enrichment, and routing. [Go] [Apache-2.0] — GitHub
  • Vector 🟢🚀🧠 — End-to-end observability data routing and transformation. Programmable transforms (VRL language), strong at log-to-metric conversion and pipeline consolidation. [Rust] [MPL-2.0] — GitHub
  • Fluent Bit / Fluentd 🟢 — Log and telemetry forwarding pipelines with extensive plugin ecosystems. [Apache-2.0]
  • Logstash 🧰 — ETL-style processing for observability data. Powerful filter plugins but resource-intensive. [Java]
  • Cribl Stream 🟠🚀 — Commercial observability pipeline for routing, reducing, and enriching telemetry data before it reaches backends. Strong ROI story for organizations with high telemetry costs. [Commercial]

Visualization & Dashboards

Exploration, visualization, and correlation of observability data — where signals become understanding.

  • Grafana ⭐🟢📚🧠 — The de facto standard for observability dashboards. Supports 100+ data sources, alerting, annotations, and increasingly sophisticated exploration features. The center of gravity for open-source observability UIs. [TypeScript/Go] [AGPL-3.0] — GitHub
  • Kibana 🟢🟠🧰 — Visualization and exploration for Elasticsearch/OpenSearch data. Powerful for log exploration (Discover, Lens). Part of the Elastic Stack. [TypeScript] [SSPL/Commercial] — GitHub
  • OpenSearch Dashboards 🟢🔵 — Open-source fork of Kibana for OpenSearch. [TypeScript] [Apache-2.0] — GitHub
  • Apache Superset 🟢 — Scalable analytics and dashboarding platform. SQL-first, strong at ad-hoc data exploration. [Python] [Apache-2.0] — GitHub
  • Redash 🧰 — SQL-first data visualization and collaboration. Connects to many data sources. Minimal maintenance since Databricks acquisition. [Python] [BSD-2-Clause] — GitHub
  • Perses 🟢🔵🧠 — CNCF sandbox project for dashboards-as-code. Native PromQL and TraceQL support. Designed for GitOps-driven observability. [Go/TypeScript] [Apache-2.0] — GitHub

Profiling & Continuous Performance Analysis

Always-on, low-overhead profiling in production — the emerging “fourth pillar” of observability alongside metrics, logs, and traces.

  • Parca ⭐🟢🔵🧠 — eBPF-based continuous profiling platform. Zero-instrumentation, always-on profiling with differential flame graphs. CNCF sandbox project. [Go] [Apache-2.0] — GitHub
  • Grafana Pyroscope ⭐🟢🔵🧠 — Continuous profiling with flame graph visualization. Supports multiple languages. Integrates naturally with the Grafana stack. [Go] [AGPL-3.0] — GitHub
  • async-profiler 🟢🚀 — Low-overhead sampling profiler for JVM. Captures CPU, allocation, lock contention, and wall-clock profiles. The reference tool for Java performance analysis. [Java/C++] [Apache-2.0]
  • perf 🧰🚀 — Linux kernel performance analysis tool. Hardware counters, tracepoints, and sampling. Foundational for system-level performance work. [C] [GPL-2.0]
  • bpftrace 🟢🚀🧠 — High-level tracing language for Linux eBPF. One-liners and scripts for dynamic kernel and user-space tracing. Invaluable for ad-hoc production investigation. [C++] [Apache-2.0] — GitHub
  • bcc (BPF Compiler Collection) 🟢🚀 — Toolkit for creating eBPF-based tracing and networking programs. Includes dozens of ready-to-use tools (execsnoop, biolatency, tcplife, etc.). [C/Python] [Apache-2.0]
  • Grafana Beyla 🟢🔵🧠🚀 — eBPF-based auto-instrumentation for HTTP and gRPC services. Zero-code, zero-configuration application observability. Generates RED metrics and distributed traces without SDK integration. [Go] [Apache-2.0] — GitHub
  • Perfetto 🟢 — System-wide tracing and profiling toolkit from Google. Designed for Android and Chrome but increasingly used for general system analysis. [C++] [Apache-2.0] — GitHub

Alerting & Incident Response

Alert management, on-call workflows, and incident coordination — the operational bridge between observability and action.

  • Alertmanager ⭐🟢📚 — Prometheus-native alert handling with grouping, silencing, inhibition, and routing. [Go] [Apache-2.0] — GitHub
  • Grafana OnCall 🟢🔵 — Open-source on-call management and alert routing. Integrates natively with Grafana alerting. [Python] [AGPL-3.0] — GitHub
  • Keep 🟢🔵🧠 — Open-source alert management platform. Consolidates alerts from multiple sources with workflow automation. [Python] [MIT] — GitHub
  • Alerta 🟢 — Unified alert correlation and management. Consolidates alerts from multiple monitoring systems. [Python] [Apache-2.0] — GitHub
  • PagerDuty 🟠 — Industry-standard incident response and on-call management platform. [Commercial]
  • Opsgenie 🟠 — Alerting and escalation platform. Part of Atlassian suite. [Commercial]
  • Rootly 🟠🧠 — AI-assisted incident management with automated timelines and postmortem generation. [Commercial]

Observability Platforms (Integrated)

Full-stack platforms that combine metrics, logs, traces, and often profiling — trading flexibility for integration and convenience.

  • Datadog 🟠🧠 — Comprehensive SaaS observability platform with AI-powered features (Watchdog anomaly detection, automated root-cause analysis). Strong breadth, premium pricing. [Commercial]
  • Dynatrace 🟠🧠 — AI-driven observability with automatic topology discovery and root-cause analysis (Davis AI). Strong in enterprise and complex Java environments. [Commercial]
  • New Relic 🟠🧠 — Developer-centric observability with a generous free tier. NRQL query language, strong APM heritage. [Commercial]
  • Splunk Observability 🟠🧰 — Observability built on Splunk’s machine data analytics platform. Strong for organizations already invested in Splunk. [Commercial]
  • Elastic Observability 🟠🧰 — Observability solution built on the Elastic Stack (Elasticsearch, Kibana, APM). Self-managed and cloud options. [Commercial]
  • Honeycomb 🟠🧠 — Observability platform built around high-cardinality, high-dimensionality event data. Pioneers of the “observability vs. monitoring” distinction. BubbleUp feature for automated correlation. [Commercial]
  • Grafana Cloud 🟠🧠 — Managed Grafana stack (Mimir, Loki, Tempo, Pyroscope) with a generous free tier. Best of open-source with SaaS convenience. [Commercial]
  • Instana (IBM) 🟠🧠 — Automatic infrastructure and application discovery with real-time observability. Strong in containerized and microservice environments. [Commercial]
  • AppDynamics (Splunk/Cisco) 🟠🧰 — Enterprise APM with business transaction monitoring and code-level diagnostics. Merged into Splunk in 2025. [Commercial]
  • Chronosphere 🟠🧠 — Cloud-native observability platform focused on metrics at scale. Founded by Uber M3 creators. Strong cost control and cardinality management. [Commercial]
  • Lightstep / ServiceNow Cloud Observability 🟠🧠 — OpenTelemetry-native observability platform, now part of ServiceNow. [Commercial]

Monitoring Suites (Operations-Oriented)

Infrastructure-first and legacy monitoring systems — still widely deployed and relevant in hybrid environments.

  • Zabbix 🟢🧰 — Enterprise-grade monitoring platform with agent-based and agentless monitoring. Mature, highly configurable, strong in traditional infrastructure. [C] [GPL-2.0] — GitHub
  • Nagios 🟢🧰 — The grandfather of open-source monitoring. Check-based architecture. Enormous plugin ecosystem but showing its age. [C] [GPL-2.0] — GitHub
  • Icinga 🟢🧰 — Modern evolution of Nagios with better APIs, configuration management, and scalability. [C++] [GPL-2.0] — GitHub
  • Checkmk 🟢🟠🧰 — Infrastructure and application monitoring with auto-discovery. Scales well for large enterprise environments. [Python/C++] [GPL-2.0/Commercial] — GitHub

Service Mesh Observability

Observability primitives integrated into service mesh infrastructure — L7 visibility without application-level instrumentation.

  • Kiali 🟢🔵 — Observability console for Istio service mesh. Topology visualization, traffic flow, and health analysis. [Go] [Apache-2.0] — GitHub
  • Linkerd Viz 🟢🔵 — Built-in telemetry and dashboard for Linkerd service mesh. Lightweight, opinionated. [Go] [Apache-2.0] — GitHub
  • Hubble 🟢🔵🚀 — eBPF-powered network observability for Cilium. L3/L4/L7 flow visibility, DNS monitoring, and service dependency mapping — all without sidecars. [Go] [Apache-2.0] — GitHub

Database Observability

Query-level performance visibility for databases — often the critical path in application performance.

  • PMM (Percona Monitoring and Management) 🟢 — Open-source database performance monitoring for MySQL, PostgreSQL, and MongoDB. Query analytics, slow query analysis. [Go] [AGPL-3.0] — GitHub
  • pgwatch 🟢 — PostgreSQL-specific monitoring and metrics collection. [Go] [BSD-3-Clause]
  • pg_stat_monitor 🟢 — PostgreSQL extension for enhanced query performance monitoring. More granular than pg_stat_statements. [C] [Apache-2.0]
  • VividCortex / SolarWinds DPM 🟠 — SaaS query-level database performance monitoring. [Commercial]
  • Datadog DBM 🟠🧠 — Database monitoring with query-level explain plans, wait event analysis, and trace correlation. [Commercial]

Real User Monitoring (RUM) & Frontend Observability

Understanding performance as experienced by actual users — where infrastructure metrics end and user experience begins.

  • Sentry 🟢🧠 — Error tracking and performance monitoring for frontend and backend. Session replay, Web Vitals, and release health tracking. [Python] [BSL/Commercial] — GitHub
  • Grafana Faro 🟢🔵🧠 — Open-source frontend observability SDK. Captures errors, performance, and user events, sends to Grafana stack. [TypeScript] [Apache-2.0] — GitHub
  • OpenTelemetry Browser SDK 🟢🧠 — OTel instrumentation for web applications. Captures page loads, resource timings, and user interactions. [TypeScript] [Apache-2.0]
  • LogRocket 🟠 — Session replay combined with frontend performance monitoring. [Commercial]
  • Plausible 🟢🧰 — Lightweight, privacy-friendly web analytics. [Elixir] [AGPL-3.0] — GitHub
  • Matomo 🟢🧰 — Privacy-focused, self-hosted web analytics. GDPR-friendly Google Analytics alternative. [PHP] [GPL-3.0] — GitHub

AI-Augmented Observability

Tools and capabilities that apply machine learning and AI to observability data — reducing noise, accelerating diagnosis, and enabling predictive operations.

This is an emerging and fast-moving space. The tools below represent current capabilities, not a stable category.

  • Dynatrace Davis AI 🟠🧠 — Deterministic and causal AI for automatic root-cause analysis. Topology-aware, goes beyond statistical correlation. [Commercial]
  • Datadog Watchdog 🟠🧠 — ML-driven anomaly detection across metrics, logs, and APM data. Automatic story generation for correlated anomalies. [Commercial]
  • Moogsoft 🟠🧠 — AIOps platform for alert correlation, noise reduction, and incident clustering. [Commercial]
  • New Relic AI 🟠🧠 — Applied intelligence with anomaly detection, incident correlation, and natural-language querying (NRAI). [Commercial]
  • Honeycomb BubbleUp 🟠🧠 — Automated outlier correlation across high-cardinality dimensions. Helps identify “what’s different” about slow requests without manual hypotheses. [Commercial]
  • Coroot 🟢🔵🧠 — Open-source eBPF-powered observability with automated service map discovery and anomaly detection. [Go] [Apache-2.0] — GitHub

SLO Management

Defining, tracking, and alerting on Service Level Objectives — the bridge between observability data and reliability commitments.

  • Sloth 🟢🔵🧠 — SLO generation for Prometheus. Define SLOs in YAML, generates multi-window multi-burn-rate alerts automatically. [Go] [Apache-2.0] — GitHub
  • Pyrra 🟢🔵🧠 — SLO management and alerting with a web UI. Kubernetes-native, generates Prometheus recording rules and alerts from SLO definitions. [Go] [Apache-2.0]
  • OpenSLO 🟢🧠 — Open specification for defining SLOs as code. Vendor-neutral, enables GitOps-driven SLO management. [YAML] [Apache-2.0] — GitHub
  • Nobl9 🟠🧠 — Enterprise SLO platform connecting multiple data sources to unified SLO tracking and error budget management. [Commercial]

Synthetic Monitoring

Proactive monitoring from the outside — validating availability, performance, and correctness before users are affected.

  • Checkly 🟢🔵🔗🧠 — Monitoring as code for APIs and browsers. Playwright-based synthetic checks with CI/CD integration. [TypeScript] [Free tier/Commercial] — GitHub
  • Grafana Synthetic Monitoring 🟢🔵 — Probe-based synthetic monitoring integrated into Grafana Cloud. Multi-location HTTP, DNS, TCP, and ICMP checks. [Commercial]
  • Uptime Kuma ⭐🟢🧪 — Self-hosted monitoring tool with a clean UI. HTTP, TCP, DNS, and keyword monitoring with notifications. Simple and effective. [JavaScript] [MIT] — GitHub

Learning Resources

Books

  • Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda (O’Reilly, 2022)
  • Distributed Systems Observability — Cindy Sridharan (O’Reilly, 2018) — Free online
  • Cloud Native Observability with OpenTelemetry — Alex Boten (Packt, 2022)
  • Site Reliability Engineering — Betsy Beyer et al. (O’Reilly / Google, 2016) — Free online
  • BPF Performance Tools — Brendan Gregg (Addison-Wesley, 2019) — The definitive reference for eBPF-based performance analysis
  • Systems Performance — Brendan Gregg (Addison-Wesley, 2nd ed. 2020) — Essential reading for anyone serious about performance engineering

Online Resources


Contributing

Contributions are welcome. Please read CONTRIBUTING.md before submitting changes.

License

Released under CC0 1.0 Universal.