Monitoring And Observability Stack Overview

Complete overview of the monitoring and observability stack for Rackspace Genestack, providing telemetry collection, storage, visualization, and alerting across the entire infrastructure.

Component Guide Map

Genestack keeps monitoring configuration in service-specific directories so the Helm values and Kustomize overlays follow the same pattern as the rest of the platform:

/opt/genestack/base-helm-configs/<service>/
/etc/genestack/helm-configs/<service>/
/etc/genestack/kustomize/<service>/overlay/

The documentation still groups the stack conceptually so you can navigate it as one monitoring system:

Getting Started for install order and day-one validation
Prometheus for metrics storage and alerting
Loki for logs
Tempo for traces
Grafana for dashboards and datasources
OpenTelemetry for collectors and infrastructure telemetry receivers
OpenStack Exporter for OpenStack API availability probes
Pushgateway for short-lived job metrics
OpenTelemetry Metrics Reference

Architecture Overview

The observability stack is deployed in the monitoring namespace and provides comprehensive telemetry collection for:

Kubernetes cluster (nodes, pods, containers, resources)
OpenStack services (Nova, Neutron, Keystone, Cinder, Glance, and related APIs)
OpenStack compute domains via libvirt exporter metrics enriched with Nova/OpenStack metadata
Infrastructure services (MySQL, PostgreSQL, RabbitMQ, Memcached)
Application traces and distributed tracing

High-Level Architecture

┌────────────────────────────────────────────────────────────────────────────┐
│                          Monitoring Namespace                              │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  ┌────────────────────────────────────────────────────────────────────┐    │
│  │                    OpenTelemetry Collectors                        │    │
│  │                                                                    │    │
│  │  ┌───────────────────┐              ┌──────────────────────┐       │    │
│  │  │ Daemon Collectors │              │ Deployment Collector │       │    │
│  │  │  (DaemonSet)      │              │    (Deployment)      │       │    │
│  │  │                   │              │                      │       │    │
│  │  │ • Metrics (OTLP)  │              │ • MySQL Metrics      │       │    │
│  │  │ • Traces (OTLP)   │              │ • PostgreSQL Metrics │       │    │
│  │  │ • Logs (filelog)  │              │ • RabbitMQ Metrics   │       │    │
│  │  │ • Host Metrics    │              │ • Memcached Metrics  │       │    │
│  │  │ • K8s Events      │              │ • HTTP Checks        │       │    │
│  │  │ • Libvirt Exporter│              │                      │       │    │
│  │  └─────────┬─────────┘              └──────────┬───────────┘       │    │
│  └────────────┼───────────────────────────────────┼───────────────────┘    │
│               │                                   │                        │
│               ├──────────────┬────────────────────┼──────────┐             │
│               │              │                    │          │             │
│               ▼              ▼                    ▼          ▼             │
│      ┌────────────┐  ┌────────────┐     ┌────────────┐  ┌──────────┐       │
│      │ Prometheus │  │    Loki    │     │   Tempo    │  │AlertMgr  │       │
│      │            │  │            │     │            │  │          │       │
│      │ • Metrics  │  │ • Logs     │     │ • Traces   │  │• Alerts  │       │
│      │ • Alerts   │  │ • Search   │     │ • Storage  │  │• Routes  │       │
│      │ • Rules    │  │ • Storage  │     │            │  │          │       │
│      └─────┬──────┘  └─────┬──────┘     └─────┬──────┘  └────┬─────┘       │
│            │               │                  │              │             │
│            └───────────────┴──────────────────┴──────────────┘             │
│                                     │                                      │
│                                     ▼                                      │
│                            ┌─────────────────┐                             │
│                            │    Grafana      │                             │
│                            │                 │                             │
│                            │ • Dashboards    │                             │
│                            │ • Visualization │                             │
│                            │ • Explore       │                             │
│                            │                 │                             │
│                            └─────────────────┘                             │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Technology Stack

Core Components

Component	Purpose	Technology	Storage
OpenTelemetry	Telemetry collection and processing	OTel Collector Contrib	Stateless
Prometheus	Metrics storage and querying	Prometheus 2.x	Local TSDB (15d retention)
Loki	Log aggregation and querying	Grafana Loki 2.x	S3/Filesystem
Tempo	Distributed tracing	Grafana Tempo 2.x	S3/Filesystem
Grafana	Visualization and dashboards	Grafana 10.x	MariaDB
Alertmanager	Alert routing and notification	Prometheus Alertmanager	Stateless

Supporting Components

Component	Purpose	Details
kube-state-metrics	Kubernetes cluster state metrics	Exports cluster-level metrics (deployments, pods, nodes)
node-exporter	Node-level system metrics	CPU, memory, disk, network per node
libvirt exporter	Nova compute domain metrics	Exports VM state, CPU, memory, block, interface, vCPU, and job metrics with OpenStack metadata
MariaDB	Grafana persistent storage	Stores dashboards, users, datasources
ServiceMonitors	Prometheus target discovery	Auto-discovers scrape targets via CRDs

Data Flow

Metrics Flow

┌──────────────────────────────────────────────────────────────────────────┐
│                          Metrics Collection                              │
└──────────────────────────────────────────────────────────────────────────┘

Sources                     Collectors                Storage   Visualization
────────                    ──────────                ───────   ─────────────
Kubernetes Pods ─────┐
OpenStack Services ──┤
Host Metrics ────────┤                         ┌──► Prometheus ──► Grafana
K8s Events ──────────┤                         │    (TSDB)         (Dashboards)
Libvirt Exporter ────┘ ───────► OTel Daemon ───┤
                                 (DaemonSet)   │
                                               └──► Alerting / Rules

MySQL ───────────────┐
PostgreSQL ──────────┤
RabbitMQ ────────────┤ ───────► OTel Deployment ───────► Prometheus ──► Grafana
Memcached ───────────┘           (Single Pod)             (TSDB)

ServiceMonitors ────────────────────────────────────────► Prometheus ──► Grafana
(kube-state-metrics, node-exporter, other direct scrapes)

Logs Flow

┌──────────────────────────────────────────────────────────────────────────┐
│                           Logs Collection                                │
└──────────────────────────────────────────────────────────────────────────┘

Log Sources                 Collectors         Processing        Storage
───────────                 ──────────         ──────────        ───────
/var/log/pods/ ──────┐
(K8s containers)     │
                     ├──► OTel Daemon ──► Processors ──► Loki ──► Grafana
/var/log/pods/ ──────┤    (filelog)         • Parse       (Store) (Explore)
(OpenStack svcs)     │                      • Enrich              (Search)
                     │                      • Label
K8s Events ──────────┘                      • Filter

Traces Flow

┌──────────────────────────────────────────────────────────────────────────┐
│                          Traces Collection                               │
└──────────────────────────────────────────────────────────────────────────┘

Application Code         Collectors            Storage          Visualization
────────────────         ──────────            ───────          ─────────────
Python Apps  ────┐
(OpenTelemetry)  │
                 ├──► OTel Daemon  ──► Tempo  ──► Grafana
Java Apps ───────┤    (OTLP gRPC)      (S3)       (Trace View)
(Jaeger SDK)     │                                (Service Graph)
                 │
Go Apps ─────────┘
(Zipkin)

Components

OpenTelemetry Collectors

Purpose: Universal telemetry collection, processing, and export.

Daemon Collector (DaemonSet)

Deployment: One pod per Kubernetes node
Collects:
OTLP metrics and traces from applications
Jaeger and Zipkin traces (legacy protocols)
Container logs via filelog (/var/log/pods)
Host metrics (CPU, memory, disk, network)
Kubernetes events via k8sobjects
Libvirt exporter metrics from compute nodes, including:
- domain inventory and timeout status: libvirt_domains, libvirt_domain_timed_out
- domain metadata and state: libvirt_domain_openstack_info, libvirt_domain_info, libvirt_domain_info_state
- CPU and vCPU metrics: libvirt_domain_info_cpu_time_seconds_total, libvirt_domain_vcpu_*
- memory metrics: libvirt_domain_memory_stats_*
- block metrics: libvirt_domain_block_stats_*
- interface metrics: libvirt_domain_interface_stats_*
- migration and job progress metrics: libvirt_domain_job_info_*
Processes:
Kubernetes attributes enrichment (pod, namespace, labels)
Resource attribute manipulation
Batching and memory limiting
Exports:
Metrics → Prometheus (remote write)
Logs → Loki (OTLP/HTTP)
Traces → Tempo (OTLP/gRPC)

Deployment Collector (Single Pod)

Deployment: One pod on the control plane
Collects:
MySQL metrics (connections, queries, locks, buffer pool)
PostgreSQL metrics (backends, transactions, deadlocks, size)
RabbitMQ metrics (messages, queues, consumers, node health)
Memcached metrics (hit ratio, evictions, memory)
HTTP endpoint checks (OpenStack API health)
Exports:
Metrics → Prometheus (remote write)

Key Features:

Vendor-agnostic telemetry format (OTLP)
Built-in processors for data manipulation
Multiple protocol support (OTLP, Jaeger, Zipkin)
Automatic Kubernetes metadata enrichment
Compute-local libvirt scraping from the daemon collector, which keeps VM metrics aligned with the node that runs the guest

Prometheus

Purpose: Time-series metrics storage and querying.

Deployment:

StatefulSet with persistent storage (15 days retention)
Alertmanager for alert routing

Collects Metrics From:

OpenTelemetry collectors (remote write)
kube-state-metrics (Kubernetes cluster state)
node-exporter (node system metrics)
libvirt exporter metrics relayed through OpenTelemetry
ServiceMonitors (auto-discovered targets)

Data Model:

Time-series: metric_name{label="value"} value timestamp
Query language: PromQL
Storage: Local TSDB (optimized for time-series data)

Retention:

Default: 15 days
Configurable via retention.time and retention.size

Key Features:

Powerful query language (PromQL)
Multi-dimensional data model
Built-in alerting rules
Service discovery and scraping
High performance for time-series queries

Loki

Purpose: Log aggregation and querying system (like Prometheus but for logs).

Deployment:

Distributed architecture:
Write: Ingests logs from OTel
Read: Serves log queries
Backend: Long-term storage

Data Model:

Logs stored with labels (not indexed content)
Query language: LogQL (similar to PromQL)
Cost-effective: only indexes labels, not log content

Storage:

Short-term: local filesystem
Long-term: S3-compatible object storage

Key Features:

Label-based indexing (efficient storage)
LogQL for querying (grep-like but better)
Integrates with Grafana (Explore interface)
Multi-tenancy support
Automatic log parsing and labeling

Query Examples:

# All logs from OpenStack namespace
{namespace="openstack"}

# Errors from Nova service
{namespace="openstack", service_name="nova"} |= "ERROR"

# Rate of errors per minute
rate({namespace="openstack"} |= "ERROR" [5m])

Tempo

Purpose: Distributed tracing backend.

Deployment:

Scalable architecture (ingester, querier, compactor)
S3-compatible storage for traces

Data Model:

Traces composed of spans
Each span has:
Trace ID (unique per request)
Span ID (unique per operation)
Parent span ID (for hierarchy)
Timestamps, tags, logs

Key Features:

Cost-effective trace storage
TraceQL for querying traces
Integrates with Grafana and Loki
Automatic trace-to-logs correlation

Use Cases:

Debug slow API requests
Find bottlenecks in microservices
Trace request flow across services
Identify error propagation

Grafana

Purpose: Visualization and dashboarding platform.

Deployment:

Deployment with MariaDB backend
Persistent storage for dashboards and configuration

Datasources:

Prometheus: Metrics and alerting
Loki: Log searching and exploration
Tempo: Distributed tracing

Features:

Dashboards: Pre-built and custom visualizations
Explore: Ad-hoc querying interface
Alerting: Visual alert rule builder
Correlation: Jump from metrics → logs → traces

Pre-configured Dashboard Themes:

Kubernetes cluster monitoring
Node resource usage
OpenStack service health
Nova/libvirt compute domain health and capacity
Database performance
RabbitMQ queue monitoring

Alertmanager

Purpose: Alert routing and notification management.

Deployment:

StatefulSet for HA (high availability)
Receives alerts from Prometheus

Features:

Grouping: Combine similar alerts
Inhibition: Suppress dependent alerts
Silencing: Temporarily mute alerts
Routing: Send to different channels (Slack, PagerDuty, email)

Alert Flow:

Prometheus Rules ──► Alertmanager ──► Notification Channels
 (PromQL)              (Routing)          • Slack
                                           • PagerDuty
                                           • Email
                                           • Webhooks

What Gets Monitored

Kubernetes Infrastructure

Component	Metrics	Logs	Traces
Nodes	CPU, memory, disk, network	System logs	N/A
Pods	Resource usage, restarts, status	Container stdout/stderr	App traces
Containers	CPU, memory, I/O	Application logs	App traces
Services	Request rate, latency, errors	Service logs	Service traces
Cluster State	Deployments, StatefulSets, DaemonSets	K8s events	N/A

OpenStack Services

Service	Metrics	Logs	Examples
Nova API	API requests, scheduler activity, VM operations	nova-api, nova-scheduler logs	VM creation traces
Nova / Libvirt Compute	Domain count, state, CPU, memory, vCPU, disk I/O, network I/O, migration/job progress	nova-compute, libvirtd logs	VM lifecycle and compute-node troubleshooting
Neutron	Network operations, ports	neutron-server logs	Network provisioning
Keystone	Auth requests, token operations	keystone logs	Auth flow traces
Cinder	Volume operations	cinder-api logs	Volume attach traces
Glance	Image operations	glance-api logs	Image upload traces

All OpenStack services are logged; the above is representative, not exhaustive.

OpenStack Compute / Libvirt Metric Families

The libvirt exporter provides detailed domain metrics that are best understood as Nova compute visibility rather than generic host metrics.

Family	Examples	Typical Use
Domain inventory and metadata	`libvirt_domains`, `libvirt_domain_timed_out`, `libvirt_domain_openstack_info`, `libvirt_domain_info`, `libvirt_domain_info_state`	Count VMs, track state, map domains to instance name, flavor, project, and compute host
CPU and vCPU	`libvirt_domain_info_cpu_time_seconds_total`, `libvirt_domain_info_virtual_cpus`, `libvirt_domain_vcpu_current`, `libvirt_domain_vcpu_time_seconds_total`, `libvirt_domain_vcpu_wait_seconds_total`, `libvirt_domain_vcpu_delay_seconds_total`, `libvirt_domain_vcpu_state`	VM CPU usage, vCPU wait/delay analysis, scheduling pressure
Memory	`libvirt_domain_info_memory_usage_bytes`, `libvirt_domain_info_maximum_memory_bytes`, `libvirt_domain_memory_stats_*`	Guest memory usage, RSS, ballooning, faults, swap, usable/unused memory
Block I/O	`libvirt_domain_block_stats_*`, `libvirt_domain_block_stats_info`	Disk throughput, IOPS, read/write latency, flush latency, capacity
Interface I/O	`libvirt_domain_interface_stats_*`, `libvirt_domain_interface_stats_info`	Guest network throughput, packet rates, drops, errors
Migration and job metrics	`libvirt_domain_job_info_*`	Migration progress, elapsed time, remaining time, bytes processed

Libvirt Metadata and Labeling Notes

Use metadata from libvirt_domain_openstack_info for VM identity and OpenStack context:

instance_name: friendly VM name
instance_id: UUID for the Nova server
project_name and project_id: tenant/project context
flavor_name: Nova flavor context
host_name: compute host

Be careful with the scrape label instance: in daemon-local libvirt scraping it may only identify the exporter endpoint, such as localhost:9474, rather than the Nova server identity.

Infrastructure Services

Service	Metrics Collected	Key Metrics
MySQL/MariaDB	25+ metrics	Connections, queries, locks, buffer pool
PostgreSQL	30+ metrics	Backends, commits, deadlocks, cache hits
RabbitMQ	20+ metrics	Messages, queues, consumers, memory
Memcached	15+ metrics	Hit ratio, evictions, memory usage

Application Instrumentation

Applications can send telemetry using:

OpenTelemetry SDKs (Python, Java, Go, Node.js)
Jaeger client libraries (legacy)
Zipkin libraries (legacy)

Send to: opentelemetry-kube-stack-daemon-collector:4317 (OTLP gRPC)

Access Points

Grafana UI

URL: http://grafana.monitoring.svc.cluster.local
     or https://grafana.your-domain.com (if Ingress configured)

Default Credentials:
  Username: admin
  Password: (from secret: grafana-admin-password)

Access via kubectl port-forward:

kubectl -n monitoring port-forward svc/grafana 3000:80
# Open: http://localhost:3000

Use Cases:

View infrastructure dashboards
Explore Nova/libvirt compute dashboards
Correlate metrics with logs and traces

Prometheus UI

URL: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090

Access via kubectl port-forward:

kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090
# Open: http://localhost:9090

Use Cases:

Query metrics with PromQL
View active alerts
Check target health
Debug scraping issues
Validate libvirt exporter series and label joins

Alertmanager UI

URL: http://kube-prometheus-stack-alertmanager.monitoring.svc.cluster.local:9093

Access via kubectl port-forward:

kubectl -n monitoring port-forward svc/kube-prometheus-stack-alertmanager 9093:9093
# Open: http://localhost:9093

Use Cases:

View active alerts
Silence alerts
Check alert routing

Key Features

1. Unified Observability

Three Pillars in One Stack:

Metrics: numerical time-series data (what is happening)
Logs: event records (what happened)
Traces: request flows (how it happened)

Correlation:

Jump from metric spike → related logs → trace details
Unified view across all telemetry types
Follow Nova API activity into compute-node domain metrics and compute logs

2. Kubernetes-Native

Service Discovery: automatic target discovery via ServiceMonitors
Metadata Enrichment: automatic pod/namespace/label tagging
Resource Awareness: monitors Kubernetes resources (CPU, memory, limits)
CRD-Based: uses Kubernetes custom resources for configuration

3. OpenStack Aware

Service Logs: parses OpenStack log formats
API Monitoring: tracks OpenStack API health
Request Tracing: traces requests across OpenStack services
Compute Metrics: monitors libvirt/Nova domains with OpenStack metadata for instance, project, flavor, and compute host
Infrastructure Metrics: monitors databases and message queues

4. High Cardinality Support

Prometheus: handles millions of time series
Loki: efficient label-based log storage
Tempo: cost-effective trace storage at scale

5. Alerting and Notification

PrometheusRules: define alerts in YAML
Alertmanager: route alerts to multiple channels
Grafana Alerts: visual alert builder
Alert Hierarchy: parent/child relationships and inhibition

6. Multi-Tenant Ready

Namespace Isolation: clear separation of concerns
RBAC Integration: Kubernetes RBAC for access control
Datasource Permissions: Grafana team-based access
Tenant Context: OpenStack labels such as project, user, and flavor make per-tenant views possible without changing the scrape path

7. Long-Term Storage

Prometheus: 15 days in-cluster, can extend with Thanos
Loki: S3 backend for long-term log retention
Tempo: S3 backend for trace history

8. Cost-Effective

Loki: only indexes labels (not log content), lowering storage costs
Tempo: compressed trace storage in object storage
Prometheus: efficient TSDB for time-series data
OpenTelemetry: centralizes metric forwarding

Summary

What This Stack Provides

✅ Complete observability: metrics, logs, and traces in one unified stack
✅ Kubernetes-native: built for cloud-native environments
✅ OpenStack-aware: tailored for OpenStack deployments
✅ Compute-aware: includes libvirt/Nova domain metrics with OpenStack metadata
✅ Scalable: handles large-scale infrastructure
✅ Cost-effective: efficient storage and indexing
✅ Open source: no vendor lock-in, community-driven

Key Benefits

Faster Debugging: correlated telemetry (metrics → logs → traces)
Proactive Monitoring: alerts before users notice issues
Performance Optimization: identify bottlenecks and optimize
Compute Visibility: inspect per-VM CPU, memory, network, disk, and migration progress from the same monitoring stack
Compliance: audit trails and long-term retention
Team Collaboration: shared dashboards and knowledge

Monitoring And Observability Stack Overview

Table of Contents

Component Guide Map

Architecture Overview

High-Level Architecture

Technology Stack

Core Components

Supporting Components

Data Flow

Metrics Flow

Logs Flow

Traces Flow

Components

OpenTelemetry Collectors

Daemon Collector (DaemonSet)

Deployment Collector (Single Pod)

Prometheus

Loki

Tempo

Grafana

Alertmanager

What Gets Monitored

Kubernetes Infrastructure

OpenStack Services

OpenStack Compute / Libvirt Metric Families

Libvirt Metadata and Labeling Notes

Infrastructure Services

Application Instrumentation

Access Points

Grafana UI

Prometheus UI

Alertmanager UI

Key Features

1. Unified Observability

2. Kubernetes-Native

3. OpenStack Aware

4. High Cardinality Support

5. Alerting and Notification

6. Multi-Tenant Ready

7. Long-Term Storage

8. Cost-Effective

Summary

What This Stack Provides

Key Benefits