A curated, opinionated list of tools and resources dedicated to Observability as an engineering capability — from kernel-level tracing to full-stack platforms.
A curated, opinionated list of tools and resources dedicated to Observability as an engineering capability — not just monitoring.
Observability is the ability to understand the internal state of a system by examining its outputs — metrics, logs, traces, and profiles — in order to explain why a system behaves the way it does, including in ways you didn’t anticipate.
This list focuses on modern observability practices while acknowledging legacy and operations-oriented tools when they meaningfully contribute to system understanding — because most real-world environments are hybrid.
🎯 Scope & Intent
This list is curated with the following principles:
Observability as an engineering capability, not a product to buy
Preference for field-proven tools over purely theoretical or early-stage solutions
Explicit consideration of system complexity, scale, cost, and operational constraints
Recognition that OpenTelemetry is converging as the standard instrumentation layer — but the ecosystem remains diverse
Awareness that AI is reshaping observability — from anomaly detection to natural-language querying
The goal is to help practitioners:
Choose tools adapted to their context, scale, and maturity
Combine tools coherently rather than stack them blindly
Build observability systems that actually support debugging, diagnosis, and decision-making
🧭 How to Read This List
Tools are organized by technical building blocks, but observability problems are usually expressed in terms of intent. A complementary reading by intent is recommended:
Scrapers, collectors, and time-series databases — the foundation of quantitative observability.
Prometheus ⭐🟢🔵📚🧠 — The de facto standard for cloud-native metrics. Pull-based model, dimensional data model, and PromQL. Excels at service-level monitoring; limited by single-node storage for very large deployments. [Go] [Apache-2.0] — GitHub
VictoriaMetrics ⭐🟢🚀🧠 — High-performance, cost-efficient Prometheus-compatible TSDB. Handles significantly higher cardinality and longer retention than vanilla Prometheus. Excellent choice when Prometheus query compatibility matters but scale exceeds single-node limits. [Go] [Apache-2.0] — GitHub
Thanos ⭐🟢🔵 — Adds long-term storage, global query view, and high availability to Prometheus. Sidecar architecture lets you keep existing Prometheus deployments while adding horizontal scale. [Go] [Apache-2.0] — GitHub
Mimir ⭐🟢🔵🚀 — Grafana’s horizontally scalable, highly available Prometheus-compatible TSDB. Designed from the ground up for multi-tenant, large-scale deployments. [Go] [AGPL-3.0] — GitHub
InfluxDB 🟢🟠 — Purpose-built time-series database with high write throughput. Strong ecosystem. v3 re-open-sourced under Apache 2.0 in 2024 with a Rust-based engine. [Go/Rust] [Apache-2.0/Commercial] — GitHub
OpenTelemetry Collector ⭐🟢🔵🧠 — Vendor-neutral telemetry collection, processing, and export pipeline. The backbone of modern instrumentation architectures. [Go] [Apache-2.0] — GitHub
Grafana Alloy ⭐🟢🔵🧠 — OpenTelemetry-native telemetry collector from Grafana Labs (successor to Grafana Agent). Supports metrics, logs, traces, and profiles. Native integration with the Grafana stack. [Go] [Apache-2.0] — GitHub
Telegraf 🟢 — Plugin-driven agent for collecting and reporting metrics. 300+ input plugins make it versatile for heterogeneous environments. [Go] [MIT] — GitHub
StatsD 🧰 — Lightweight, UDP-based metrics aggregation daemon. Simple protocol, widely supported by applications. Still relevant in legacy environments. [Node.js] [MIT]
Graphite 🧰 — One of the original time-series storage and graphing systems. Whisper backend, Carbon collector. Historical significance but limited compared to modern alternatives. [Python] [Apache-2.0] — GitHub
Netdata ⭐🟢🚀 — Real-time, per-second system and application monitoring with built-in anomaly detection. Zero-configuration agent with impressive out-of-the-box dashboards. [C] [GPL-3.0] — GitHub
Distributed Tracing
Request-level visibility across service boundaries — essential for understanding latency, dependencies, and failure propagation in distributed systems.
OpenTelemetry ⭐🟢🔵📚🧠 — The converging open standard for distributed tracing, metrics, and logs instrumentation. Language-specific SDKs, auto-instrumentation agents, and the Collector form a complete pipeline. If you’re starting today, start here. [Multiple] [Apache-2.0] — GitHub
Jaeger ⭐🟢🔵📚 — CNCF graduated distributed tracing backend and UI. Mature, well-documented, strong Kubernetes integration. Originally from Uber. [Go] [Apache-2.0] — GitHub
Grafana Tempo ⭐🟢🔵 — High-scale, cost-efficient tracing backend that requires only object storage (no indexing infrastructure). Pairs naturally with Grafana, Loki, and Mimir. [Go] [AGPL-3.0] — GitHub
Zipkin 🟢📚 — One of the pioneering distributed tracing systems (Twitter, 2012). Still actively maintained with a loyal community. Simpler architecture than Jaeger, good for smaller deployments. [Java] [Apache-2.0] — GitHub
Apache SkyWalking ⭐🟢🔵 — Full observability platform with strong tracing capabilities. Popular in the Java/JVM ecosystem. Auto-instrumentation via bytecode injection. [Java] [Apache-2.0] — GitHub
SigNoz 🟢🔵🧠 — Open-source observability platform built natively on OpenTelemetry. Unified metrics, traces, and logs in a single UI. ClickHouse-backed storage. Strong alternative to commercial APM. [Go/TypeScript] [ELv2/MIT] — GitHub
Pinpoint 🧰 — Bytecode-instrumentation-based APM and tracing for Java and PHP. Zero-code-change approach. Popular in Korean and Asian enterprise environments. [Java] [Apache-2.0] — GitHub
Log Management & Log Pipelines
Collection, processing, indexing, and analysis of log data — still the most universal telemetry signal.
Grafana Loki ⭐🟢🔵📚🧠 — Label-based log aggregation that indexes metadata, not content. Dramatically cheaper than full-text indexing at scale. Pairs with Grafana for exploration. [Go] [AGPL-3.0] — GitHub
Fluent Bit ⭐🟢🔵🚀 — Lightweight, high-performance log processor and forwarder designed for edge and containerized environments. Tiny memory footprint. [C] [Apache-2.0] — GitHub
Fluentd 🟢🔵 — CNCF graduated unified logging layer with 1000+ plugins. Heavier than Fluent Bit but more flexible for complex routing. [Ruby/C] [Apache-2.0] — GitHub
Vector 🟢🚀🧠 — High-performance observability data pipeline for logs, metrics, and traces. Built in Rust for reliability and throughput. Excellent for consolidating telemetry pipelines. [Rust] [MPL-2.0] — GitHub
Elasticsearch ⭐🟢🟠🧰 — Distributed search and analytics engine. Powerful full-text search, but storage costs and operational complexity can be significant at scale. License changed from Apache-2.0 to SSPL. [Java] [SSPL/Commercial] — GitHub
OpenSearch 🟢🔵 — Community-driven fork of Elasticsearch (post-license change). AWS-backed, Apache-2.0 licensed. Drop-in replacement for Elasticsearch in most deployments. [Java] [Apache-2.0] — GitHub
Logstash 🧰 — Flexible log ingestion and transformation pipeline. Part of the Elastic Stack. Heavy JVM footprint. [Java] [SSPL/Commercial] — GitHub
Graylog 🟢🟠🧰 — Centralized log management with built-in alerting and dashboards. Good for teams that want a self-contained log platform. [Java] [SSPL/Commercial] — GitHub
rsyslog 🟢🚀🧰 — High-performance system logging daemon. Handles millions of messages per second. Essential in Linux infrastructure. [C] [GPL-3.0] — GitHub
Observability Pipelines & Telemetry Processing
Transport, transformation, sampling, and routing of observability data — increasingly critical as telemetry volumes grow and costs need control.
OpenTelemetry Collector ⭐🟢🔵🧠 — The standard telemetry processing pipeline. Receivers, processors, and exporters for any signal to any backend. Supports tail-based sampling, attribute enrichment, and routing. [Go] [Apache-2.0] — GitHub
Vector 🟢🚀🧠 — End-to-end observability data routing and transformation. Programmable transforms (VRL language), strong at log-to-metric conversion and pipeline consolidation. [Rust] [MPL-2.0] — GitHub
Fluent Bit / Fluentd 🟢 — Log and telemetry forwarding pipelines with extensive plugin ecosystems. [Apache-2.0]
Logstash 🧰 — ETL-style processing for observability data. Powerful filter plugins but resource-intensive. [Java]
Cribl Stream 🟠🚀 — Commercial observability pipeline for routing, reducing, and enriching telemetry data before it reaches backends. Strong ROI story for organizations with high telemetry costs. [Commercial]
Visualization & Dashboards
Exploration, visualization, and correlation of observability data — where signals become understanding.
Grafana ⭐🟢📚🧠 — The de facto standard for observability dashboards. Supports 100+ data sources, alerting, annotations, and increasingly sophisticated exploration features. The center of gravity for open-source observability UIs. [TypeScript/Go] [AGPL-3.0] — GitHub
Kibana 🟢🟠🧰 — Visualization and exploration for Elasticsearch/OpenSearch data. Powerful for log exploration (Discover, Lens). Part of the Elastic Stack. [TypeScript] [SSPL/Commercial] — GitHub
Apache Superset 🟢 — Scalable analytics and dashboarding platform. SQL-first, strong at ad-hoc data exploration. [Python] [Apache-2.0] — GitHub
Redash 🧰 — SQL-first data visualization and collaboration. Connects to many data sources. Minimal maintenance since Databricks acquisition. [Python] [BSD-2-Clause] — GitHub
Perses 🟢🔵🧠 — CNCF sandbox project for dashboards-as-code. Native PromQL and TraceQL support. Designed for GitOps-driven observability. [Go/TypeScript] [Apache-2.0] — GitHub
Profiling & Continuous Performance Analysis
Always-on, low-overhead profiling in production — the emerging “fourth pillar” of observability alongside metrics, logs, and traces.
Grafana Pyroscope ⭐🟢🔵🧠 — Continuous profiling with flame graph visualization. Supports multiple languages. Integrates naturally with the Grafana stack. [Go] [AGPL-3.0] — GitHub
async-profiler 🟢🚀 — Low-overhead sampling profiler for JVM. Captures CPU, allocation, lock contention, and wall-clock profiles. The reference tool for Java performance analysis. [Java/C++] [Apache-2.0]
perf 🧰🚀 — Linux kernel performance analysis tool. Hardware counters, tracepoints, and sampling. Foundational for system-level performance work. [C] [GPL-2.0]
bpftrace 🟢🚀🧠 — High-level tracing language for Linux eBPF. One-liners and scripts for dynamic kernel and user-space tracing. Invaluable for ad-hoc production investigation. [C++] [Apache-2.0] — GitHub
bcc (BPF Compiler Collection) 🟢🚀 — Toolkit for creating eBPF-based tracing and networking programs. Includes dozens of ready-to-use tools (execsnoop, biolatency, tcplife, etc.). [C/Python] [Apache-2.0]
Grafana Beyla 🟢🔵🧠🚀 — eBPF-based auto-instrumentation for HTTP and gRPC services. Zero-code, zero-configuration application observability. Generates RED metrics and distributed traces without SDK integration. [Go] [Apache-2.0] — GitHub
Perfetto 🟢 — System-wide tracing and profiling toolkit from Google. Designed for Android and Chrome but increasingly used for general system analysis. [C++] [Apache-2.0] — GitHub
Alerting & Incident Response
Alert management, on-call workflows, and incident coordination — the operational bridge between observability and action.
Alertmanager ⭐🟢📚 — Prometheus-native alert handling with grouping, silencing, inhibition, and routing. [Go] [Apache-2.0] — GitHub
Grafana OnCall 🟢🔵 — Open-source on-call management and alert routing. Integrates natively with Grafana alerting. [Python] [AGPL-3.0] — GitHub
Keep 🟢🔵🧠 — Open-source alert management platform. Consolidates alerts from multiple sources with workflow automation. [Python] [MIT] — GitHub
Alerta 🟢 — Unified alert correlation and management. Consolidates alerts from multiple monitoring systems. [Python] [Apache-2.0] — GitHub
Opsgenie 🟠 — Alerting and escalation platform. Part of Atlassian suite. [Commercial]
Rootly 🟠🧠 — AI-assisted incident management with automated timelines and postmortem generation. [Commercial]
Observability Platforms (Integrated)
Full-stack platforms that combine metrics, logs, traces, and often profiling — trading flexibility for integration and convenience.
Datadog 🟠🧠 — SaaS observability platform with AI-powered features (Watchdog anomaly detection, automated root-cause analysis). Strong breadth, premium pricing. [Commercial]
Dynatrace 🟠🧠 — AI-driven observability with automatic topology discovery and root-cause analysis (Davis AI). Strong in enterprise and complex Java environments. [Commercial]
New Relic 🟠🧠 — Developer-centric observability with a generous free tier. NRQL query language, strong APM heritage. [Commercial]
Splunk Observability 🟠🧰 — Observability built on Splunk’s machine data analytics platform. Strong for organizations already invested in Splunk. [Commercial]
Elastic Observability 🟠🧰 — Observability solution built on the Elastic Stack (Elasticsearch, Kibana, APM). Self-managed and cloud options. [Commercial]
Honeycomb 🟠🧠 — Observability platform built around high-cardinality, high-dimensionality event data. Pioneers of the “observability vs. monitoring” distinction. BubbleUp feature for automated correlation. [Commercial]
Grafana Cloud 🟠🧠 — Managed Grafana stack (Mimir, Loki, Tempo, Pyroscope) with a generous free tier. Best of open-source with SaaS convenience. [Commercial]
Instana (IBM) 🟠🧠 — Automatic infrastructure and application discovery with real-time observability. Strong in containerized and microservice environments. [Commercial]
AppDynamics (Splunk/Cisco) 🟠🧰 — Enterprise APM with business transaction monitoring and code-level diagnostics. Merged into Splunk in 2025. [Commercial]
Chronosphere 🟠🧠 — Cloud-native observability platform focused on metrics at scale. Founded by Uber M3 creators. Strong cost control and cardinality management. [Commercial]
Sematext 🟢🟠🧠📚 — SaaS observability platform with OpenTelemetry-native support and automatic topology discovery. [Commercial]
Monitoring Suites (Operations-Oriented)
Infrastructure-first and legacy monitoring systems — still widely deployed and relevant in hybrid environments.
Zabbix 🟢🧰 — Enterprise-grade monitoring platform with agent-based and agentless monitoring. Mature, highly configurable, strong in traditional infrastructure. [C] [GPL-2.0] — GitHub
Nagios 🟢🧰 — The grandfather of open-source monitoring. Check-based architecture. Enormous plugin ecosystem but showing its age. [C] [GPL-2.0] — GitHub
Icinga 🟢🧰 — Modern evolution of Nagios with better APIs, configuration management, and scalability. [C++] [GPL-2.0] — GitHub
Checkmk 🟢🟠🧰 — Infrastructure and application monitoring with auto-discovery. Scales well for large enterprise environments. [Python/C++] [GPL-2.0/Commercial] — GitHub
Service Mesh Observability
Observability primitives integrated into service mesh infrastructure — L7 visibility without application-level instrumentation.
Kiali 🟢🔵 — Observability console for Istio service mesh. Topology visualization, traffic flow, and health analysis. [Go] [Apache-2.0] — GitHub
Linkerd Viz 🟢🔵 — Built-in telemetry and dashboard for Linkerd service mesh. Lightweight, opinionated. [Go] [Apache-2.0] — GitHub
Hubble 🟢🔵🚀 — eBPF-powered network observability for Cilium. L3/L4/L7 flow visibility, DNS monitoring, and service dependency mapping — all without sidecars. [Go] [Apache-2.0] — GitHub
Database Observability
Query-level performance visibility for databases — often the critical path in application performance.
PMM (Percona Monitoring and Management) 🟢 — Open-source database performance monitoring for MySQL, PostgreSQL, and MongoDB. Query analytics, slow query analysis. [Go] [AGPL-3.0] — GitHub
pgwatch 🟢 — PostgreSQL-specific monitoring and metrics collection. [Go] [BSD-3-Clause]
pg_stat_monitor 🟢 — PostgreSQL extension for enhanced query performance monitoring. More granular than pg_stat_statements. [C] [Apache-2.0]
Datadog DBM 🟠🧠 — Database monitoring with query-level explain plans, wait event analysis, and trace correlation. [Commercial]
Real User Monitoring (RUM) & Frontend Observability
Understanding performance as experienced by actual users — where infrastructure metrics end and user experience begins.
Sentry 🟢🧠 — Error tracking and performance monitoring for frontend and backend. Session replay, Web Vitals, and release health tracking. [Python] [BSL/Commercial] — GitHub
Grafana Faro 🟢🔵🧠 — Open-source frontend observability SDK. Captures errors, performance, and user events, sends to Grafana stack. [TypeScript] [Apache-2.0] — GitHub
OpenTelemetry Browser SDK 🟢🧠 — OTel instrumentation for web applications. Captures page loads, resource timings, and user interactions. [TypeScript] [Apache-2.0]
Matomo 🟢🧰 — Privacy-focused, self-hosted web analytics. GDPR-friendly Google Analytics alternative. [PHP] [GPL-3.0] — GitHub
AI-Augmented Observability
Tools and capabilities that apply machine learning and AI to observability data — reducing noise, accelerating diagnosis, and enabling predictive operations.
This is an emerging and fast-moving space. The tools below represent current capabilities, not a stable category.
Dynatrace Davis AI 🟠🧠 — Deterministic and causal AI for automatic root-cause analysis. Topology-aware, goes beyond statistical correlation. [Commercial]
Datadog Watchdog 🟠🧠 — ML-driven anomaly detection across metrics, logs, and APM data. Automatic story generation for correlated anomalies. [Commercial]
Moogsoft 🟠🧠 — AIOps platform for alert correlation, noise reduction, and incident clustering. [Commercial]
New Relic AI 🟠🧠 — Applied intelligence with anomaly detection, incident correlation, and natural-language querying (NRAI). [Commercial]
Honeycomb BubbleUp 🟠🧠 — Automated outlier correlation across high-cardinality dimensions. Helps identify “what’s different” about slow requests without manual hypotheses. [Commercial]
Coroot 🟢🔵🧠 — Open-source eBPF-powered observability with automated service map discovery and anomaly detection. [Go] [Apache-2.0] — GitHub
SLO Management
Defining, tracking, and alerting on Service Level Objectives — the bridge between observability data and reliability commitments.
Sloth 🟢🔵🧠 — SLO generation for Prometheus. Define SLOs in YAML, generates multi-window multi-burn-rate alerts automatically. [Go] [Apache-2.0] — GitHub
Pyrra 🟢🔵🧠 — SLO management and alerting with a web UI. Kubernetes-native, generates Prometheus recording rules and alerts from SLO definitions. [Go] [Apache-2.0]
OpenSLO 🟢🧠 — Open specification for defining SLOs as code. Vendor-neutral, enables GitOps-driven SLO management. [YAML] [Apache-2.0] — GitHub
Nobl9 🟠🧠 — Enterprise SLO platform connecting multiple data sources to unified SLO tracking and error budget management. [Commercial]
Synthetic Monitoring
Proactive monitoring from the outside — validating availability, performance, and correctness before users are affected.
Checkly 🟢🔵🔗🧠 — Monitoring as code for APIs and browsers. Playwright-based synthetic checks with CI/CD integration. [TypeScript] [Free tier/Commercial] — GitHub
Grafana Synthetic Monitoring 🟢🔵 — Probe-based synthetic monitoring integrated into Grafana Cloud. Multi-location HTTP, DNS, TCP, and ICMP checks. [Commercial]
Uptime Kuma ⭐🟢🧪 — Self-hosted monitoring tool with a clean UI. HTTP, TCP, DNS, and keyword monitoring with notifications. Simple and effective. [JavaScript] [MIT] — GitHub
Cloud Native Observability with OpenTelemetry — Alex Boten (Packt, 2022)
Site Reliability Engineering — Betsy Beyer et al. (O’Reilly / Google, 2016) — Free online
BPF Performance Tools — Brendan Gregg (Addison-Wesley, 2019) — The definitive reference for eBPF-based performance analysis
Systems Performance — Brendan Gregg (Addison-Wesley, 2nd ed. 2020) — Essential reading for anyone serious about performance engineering
Practical Monitoring — Mike Julian (O’Reilly, 2017) — Vendor-neutral monitoring principles, anti-patterns, and on-call design
Understanding Software Dynamics — Richard L. Sites (Addison-Wesley, 2021) — Modern performance profiling with KUtrace, by the DEC Alpha co-architect
Designing Data-Intensive Applications — Martin Kleppmann (O’Reilly, 2017) — Essential architecture reference for distributed systems and data pipelines
Database Reliability Engineering — Laine Campbell, Charity Majors (O’Reilly, 2017) — SRE principles applied to database operations and performance