Getting Started with Genestack Monitoring
This guide walks you through setting up a complete observability stack for your Genestack deployment. The monitoring system provides unified telemetry collection across metrics, logs, and traces using industry-standard open-source tools.
Overview
The Genestack observability stack includes:
- OpenTelemetry - Universal telemetry collection and processing (metrics, logs, traces)
- Prometheus - Time-series database and metrics storage
- Loki - Log aggregation and querying system
- Tempo - Distributed tracing backend
- Grafana - Unified visualization platform for all telemetry types
- AlertManager - Alert routing and notification management
Architecture
All components are deployed in the monitoring namespace for simplified management and access control.
┌─────────────────────────────────────────────────────────────┐
│ Monitoring Namespace │
├─────────────────────────────────────────────────────────────┤
│ │
│ OpenTelemetry Collectors │
│ ↓ Collect ↓ │
│ Metrics, Logs, Traces from: │
│ • Kubernetes (nodes, pods, containers) │
│ • OpenStack services (Nova, Neutron, etc.) │
│ • Infrastructure (MySQL, RabbitMQ, etc.) │
│ │
│ ↓ Store in ↓ │
│ │
│ Prometheus (metrics) + Loki (logs) + Tempo (traces) │
│ │
│ ↓ Visualize in ↓ │
│ │
│ Grafana (dashboards, alerts, exploration) │
│ │
└─────────────────────────────────────────────────────────────┘
What Gets Monitored
| Component | Metrics | Logs | Traces |
|---|---|---|---|
| Kubernetes | Node/pod/container resources | Container logs, K8s events | N/A |
| OpenStack | API performance | Service logs (Nova, Neutron, etc.) | Request flows |
| Databases | Connections, queries, locks | Query logs | Slow query traces |
| Message Queues | Messages, consumers, memory | RabbitMQ logs | Message processing |
| Cache | Hit ratio, evictions | Memcached logs | Cache operations |
Prerequisites
Before proceeding, ensure you have:
- ✅ A running Genestack deployment
- ✅ Helm 3.x installed
- ✅
kubectlaccess to your Kubernetes cluster - ✅ Cluster admin or namespace admin permissions
- ✅ Sufficient cluster resources (see resource requirements)
Required Tools
Namespace Preparation
# Create the monitoring namespace
kubectl create namespace monitoring
# Label it for easy identification
kubectl label namespace monitoring \
name=monitoring \
monitoring=enabled
Installation Steps
Follow these steps in order to build your complete observability stack.
Step 1: Install the Prometheus Stack
The kube-prometheus-stack is the foundation of your metrics infrastructure. It provides time-series storage, alerting, and scraping capabilities.
What Gets Installed
- Prometheus Operator - Manages Prometheus lifecycle via CRDs
- Prometheus Server - Collects and stores metrics (15-day retention)
- AlertManager - Routes alerts to notification channels
- Node Exporter - Exports node hardware and OS metrics
- Kube State Metrics - Exports Kubernetes cluster state metrics
- Grafana - Pre-configured with Prometheus datasource (optional, we'll install separately)
Installation
# Install using Genestack script
bin/install-kube-prometheus-stack.sh
# Or install manually with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
-n monitoring \
--create-namespace \
-f /etc/genestack/helm-configs/prometheus/prometheus-helm-overrides.yaml
Verify Installation
# Check pods are running
kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus
# Expected output:
# prometheus-kube-prometheus-stack-prometheus-0 Running
# alertmanager-kube-prometheus-stack-alertmanager-0 Running
# Check Prometheus targets
kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090 &
# Open: http://localhost:9090/targets
# You should see targets for node-exporter and kube-state-metrics
📚 For detailed configuration, see the Prometheus installation guide.
Step 2: Install Loki
Loki provides cost-effective log aggregation using label-based indexing (similar to Prometheus but for logs).
What Gets Installed
- Loki Gateway - Receives logs from collectors
- Loki Write - Ingests and indexes logs
- Loki Read - Serves log queries
- Loki Backend - Handles compaction and retention
Installation
# Install using Genestack script
bin/install-loki.sh
# Or install manually with Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki \
-n monitoring \
-f /etc/genestack/helm-configs/loki/loki-helm-overrides.yaml
Verify Installation
# Check Loki pods
kubectl -n monitoring get pods -l app.kubernetes.io/name=loki
# Test Loki readiness
kubectl -n monitoring port-forward svc/loki-gateway 3100:80 &
curl http://localhost:3100/ready
# Expected: ready
Key Features
- Label-based indexing - Only indexes labels, not log content (cost-effective)
- LogQL - Query language similar to PromQL
- S3 backend - Long-term storage in object storage
- Multi-tenancy - Namespace isolation support
Step 3: Install Tempo
Tempo provides distributed tracing for tracking requests across microservices.
What Gets Installed
- Tempo Distributor - Receives traces from applications
- Tempo Ingester - Writes traces to storage
- Tempo Query Frontend - Serves trace queries
- Tempo Compactor - Compacts and manages trace data
Installation
# Install using Genestack script
bin/install-tempo.sh
# Or install manually with Helm
helm install tempo grafana/tempo \
-n monitoring \
-f /etc/genestack/helm-configs/tempo/tempo-helm-overrides.yaml
Verify Installation
# Check Tempo pods
kubectl -n monitoring get pods -l app.kubernetes.io/name=tempo
# Test Tempo readiness
kubectl -n monitoring port-forward svc/tempo 3100:3100 &
curl http://localhost:3100/ready
# Expected: ready
Key Features
- TraceQL - Query language for searching traces
- S3 backend - Cost-effective trace storage
- Trace correlation - Links traces to logs and metrics
- OpenTelemetry native - First-class OTLP support
Step 4: Install OpenTelemetry
OpenTelemetry provides unified telemetry collection, processing, and export for metrics, logs, and traces.
What Gets Installed
Daemon Collectors (DaemonSet - one per node)
Collects telemetry from each node:
- Container logs from /var/log/pods
- Host metrics (CPU, memory, disk, network)
- OTLP metrics and traces from applications
- Kubernetes events
Deployment Collector (Single pod)
Collects infrastructure metrics: - MySQL/MariaDB metrics (connections, queries, locks) - PostgreSQL metrics (backends, deadlocks, transactions) - RabbitMQ metrics (messages, queues, consumers) - Memcached metrics (hit ratio, evictions) - HTTP endpoint health checks
Prerequisites
Before installing OpenTelemetry, copy required secrets from the openstack namespace:
# Copy RabbitMQ credentials
kubectl -n openstack get secret rabbitmq-default-user -o yaml \
| yq 'del(.metadata.creationTimestamp, .metadata.uid, .metadata.ownerReferences, .metadata.resourceVersion, .metadata.namespace)' \
| kubectl apply --namespace monitoring -f -
# Copy PostgreSQL credentials
kubectl get secret postgres.postgres-cluster.credentials.postgresql.acid.zalan.do \
-n openstack -o yaml \
| sed 's/namespace: openstack/namespace: monitoring/' \
| kubectl apply -f -
# Copy MariaDB monitoring credentials
kubectl get secret mariadb-monitoring \
-n openstack -o yaml \
| sed 's/namespace: openstack/namespace: monitoring/' \
| kubectl apply -f -
# Verify secrets are copied
kubectl -n monitoring get secrets | grep -E "rabbitmq|postgres|mariadb"
Installation
# Install using Genestack script
bin/install-opentelemetry-kube-stack.sh
# Or install manually with Helm
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
helm install opentelemetry-kube-stack open-telemetry/opentelemetry-operator \
-n monitoring \
-f /etc/genestack/helm-configs/opentelemetry/otel-helm-overrides.yaml
Verify Installation
# Check daemon collectors (one per node)
kubectl -n monitoring get pods -l app.kubernetes.io/instance=opentelemetry-kube-stack-daemon
# Check deployment collector
kubectl -n monitoring get pods -l app.kubernetes.io/instance=opentelemetry-kube-stack-deployment
# Verify logs are being collected
kubectl -n monitoring logs daemonset/opentelemetry-kube-stack-daemon-collector --tail=50
# Verify metrics are being sent to Prometheus
kubectl -n monitoring logs deployment/opentelemetry-kube-stack-deployment-collector --tail=50 | grep "prometheusremotewrite"
Key Features
- Vendor-neutral - Industry standard for telemetry collection
- Multi-protocol - Supports OTLP, Jaeger, Zipkin
- Processing pipeline - Transform, filter, batch telemetry data
- Auto-instrumentation - Kubernetes metadata enrichment
Configuration
OpenTelemetry exporters are configured to send data to:
- Prometheus - http://kube-prometheus-stack-prometheus.monitoring.svc:9090/api/v1/write
- Loki - http://loki-gateway.monitoring.svc/otlp
- Tempo - tempo.monitoring.svc:4317
All within the same monitoring namespace for simplified networking.
Step 5: Install Grafana
Grafana provides unified visualization for metrics, logs, and traces with pre-built and custom dashboards.
What Gets Installed
- Grafana Server - Web UI and API
- MariaDB - Database for dashboards and users
- Pre-configured Datasources:
- Prometheus (metrics)
- Loki (logs)
- Tempo (traces)
Installation
# Install using Genestack script
bin/install-grafana.sh
# Or install manually with Helm
helm install grafana grafana/grafana \
-n monitoring \
-f /etc/genestack/helm-configs/grafana/grafana-helm-overrides.yaml
Create Grafana Database
# Apply the MariaDB manifest
kubectl apply -f base-kustomize/grafana/base/grafana-database.yaml
# Wait for database to be ready
kubectl -n monitoring wait --for=condition=ready pod \
-l app.kubernetes.io/name=mariadb,app.kubernetes.io/instance=grafana \
--timeout=300s
Access Grafana
# Get admin password
kubectl -n monitoring get secret grafana -o jsonpath='{.data.admin-password}' | base64 -d
echo
# Port-forward to Grafana
kubectl -n monitoring port-forward svc/grafana 3000:80
# Open browser to http://localhost:3000
# Username: admin
# Password: (from above command)
Verify Datasources
After logging in:
1. Go to Configuration → Data Sources
2. Verify these datasources are configured and working:
- ✅ Prometheus - http://kube-prometheus-stack-prometheus.monitoring.svc:9090
- ✅ Loki - http://loki-gateway.monitoring.svc
- ✅ Tempo - http://tempo.monitoring.svc:3100
3. Click Test on each datasource to verify connectivity
Import Dashboards
# Import Kubernetes cluster monitoring dashboard
python3 scripts/import-grafana-dashboard.py \
--grafana-url http://localhost:3000 \
--dashboard-id 7249 \
--api-key <your-api-key>
# Or manually import via UI:
# Dashboards → Import → Enter dashboard ID from grafana.com
📚 For more information, see the Grafana documentation.
Step 6: Deploy Service-Specific Metric Exporters (Optional)
The OpenTelemetry deployment collector already provides metrics for: - MySQL/MariaDB - PostgreSQL - RabbitMQ - Memcached
However, if you need additional service-specific exporters, they are available for deployment.
Available Exporters And Additional Metrics Services
- OpenStack Exporter - OpenStack API metrics
- Custom Node Metrics - Custom Node Metrics
- OVN Metrics - OVN Metrics
- Push Gateway - Prometheus Push Gateway
When to Use Additional Exporters
Use additional exporters when: - You need metrics not provided by OpenTelemetry receivers - You want more granular metrics than OTel provides - You're integrating third-party services with specific exporters
Note: Most infrastructure metrics are already collected by OpenTelemetry, so additional exporters may not be necessary.
Step 7: Configure AlertManager
AlertManager handles alert routing and notifications. Configure it to send alerts to your preferred channels.
Available Integrations
- Slack - Send alerts to Slack channels (guide)
- Email - SMTP-based email notifications
- PagerDuty - Incident management integration
- Webhooks - Custom HTTP endpoints
- OpsGenie - On-call management
- VictorOps - Incident response
Configure Slack Alerts (Example)
# Edit AlertManager configuration
kubectl -n monitoring edit secret alertmanager-kube-prometheus-stack-alertmanager
# Add Slack webhook configuration
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
title: 'Cluster Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
Test AlertManager
# Port-forward to AlertManager
kubectl -n monitoring port-forward svc/kube-prometheus-stack-alertmanager 9093:9093
# Open browser to http://localhost:9093
# You should see the AlertManager UI with configured receivers
📚 For detailed configuration, see AlertManager Slack integration.
Step 8: Customize Alerting Rules
Default Alerting Rules
Prometheus includes default alerting rules for common scenarios. View them:
# List PrometheusRule resources
kubectl -n monitoring get prometheusrule
# View specific rule
kubectl -n monitoring get prometheusrule kube-prometheus-stack-kubernetes-resources -o yaml
Custom Alerting Rules
Create custom alerting rules in /etc/genestack/helm-configs/prometheus/alerting_rules.yaml:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: genestack-custom-alerts
namespace: monitoring
spec:
groups:
- name: database-alerts
interval: 30s
rules:
# High MySQL connection usage
- alert: MySQLConnectionsHigh
expr: |
(mysql_connection_count{state="open"} / 1000) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL connection usage high"
description: "MySQL is using {{ $value }}% of max connections"
# PostgreSQL deadlocks
- alert: PostgreSQLDeadlocks
expr: |
rate(postgresql_deadlocks[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL deadlocks detected"
description: "Database {{ $labels.database }} has deadlocks"
# RabbitMQ message backlog
- alert: RabbitMQBacklog
expr: |
rabbitmq_message_current{state="ready"} > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "RabbitMQ queue backlog"
description: "Queue {{ $labels.queue }} has {{ $value }} messages ready"
- name: openstack-alerts
interval: 30s
rules:
# Nova API down
- alert: NovaAPIDown
expr: |
httpcheck_error{http_url=~".*nova.*"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Nova API is down"
description: "Nova API endpoint {{ $labels.http_url }} is unreachable"
Apply the custom rules:
Operator-Provided Alerting Rules
Many Genestack operators include built-in ServiceMonitor and PodMonitor resources with pre-configured alerts:
- RabbitMQ Operator - Queue depth, consumer count, node health (documentation)
- PostgreSQL Operator - Replication lag, connection limits, deadlocks
- MariaDB Operator - Replication status, slow queries
These operator-managed rules follow best practices and are automatically deployed with the operator.
Verification and Testing
End-to-End Verification
Run through this checklist to ensure everything is working:
1. Check All Pods Running
kubectl -n monitoring get pods
# Expected pods:
# - prometheus-kube-prometheus-stack-prometheus-0
# - alertmanager-kube-prometheus-stack-alertmanager-0
# - loki-write-*, loki-read-*, loki-backend-*
# - tempo-*
# - grafana-*
# - opentelemetry-kube-stack-daemon-collector-* (one per node)
# - opentelemetry-kube-stack-deployment-collector-*
2. Verify Metrics Collection
# Port-forward to Prometheus
kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090 &
# Open http://localhost:9090/targets
# Verify targets are up and being scraped
# Query for metrics
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result[] | select(.value[1]=="1") | .metric.job' | sort -u
# Should see jobs like:
# "kube-state-metrics"
# "node-exporter"
# "opentelemetry-deployment"
# "rabbitmq"
# "mysql"
3. Verify Logs Collection
# Port-forward to Loki
kubectl -n monitoring port-forward svc/loki-gateway 3100:80 &
# Query for recent logs
curl -s 'http://localhost:3100/loki/api/v1/query?query={namespace="kube-system"}&limit=5' | jq '.data.result'
# Should return log entries
4. Verify Traces (If Applications are Instrumented)
# Port-forward to Tempo
kubectl -n monitoring port-forward svc/tempo 3100:3100 &
# Check Tempo is ready
curl http://localhost:3100/ready
# Expected: ready
5. Test Grafana Dashboards
# Access Grafana
kubectl -n monitoring port-forward svc/grafana 3000:80
# Open http://localhost:3000
# Navigate to Dashboards → Browse
# Open "Kubernetes / Compute Resources / Cluster"
# Verify metrics are displaying
6. Test Alerting
Create a test alert:
# test-alert.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: test-alert
namespace: monitoring
spec:
groups:
- name: test
rules:
- alert: TestAlert
expr: vector(1)
labels:
severity: warning
annotations:
summary: "This is a test alert"
# Apply test alert
kubectl apply -f test-alert.yaml
# Wait 30 seconds, then check AlertManager
kubectl -n monitoring port-forward svc/kube-prometheus-stack-alertmanager 9093:9093 &
# Open http://localhost:9093
# You should see the test alert firing
# Cleanup
kubectl -n monitoring delete prometheusrule test-alert
Resource Requirements
Minimum (Small Cluster - Dev/Test)
Total for monitoring namespace:
| Component | CPU | Memory | Storage |
|---|---|---|---|
| Prometheus | 2 cores | 4Gi | 50Gi |
| Loki | 1 core | 2Gi | 20Gi |
| Tempo | 500m | 1Gi | 10Gi |
| Grafana | 500m | 512Mi | 10Gi |
| OTel (per node) | 500m | 512Mi | - |
| OTel (deployment) | 500m | 512Mi | - |
| Total | ~6 cores | ~10Gi | 90Gi |
Production (Large Cluster)
| Component | CPU | Memory | Storage |
|---|---|---|---|
| Prometheus | 4 cores | 16Gi | 500Gi |
| Loki | 4 cores | 8Gi | 100Gi + S3 |
| Tempo | 2 cores | 4Gi | 50Gi + S3 |
| Grafana | 1 core | 2Gi | 10Gi |
| OTel (per node) | 2 cores | 2Gi | - |
| OTel (deployment) | 1 core | 1Gi | - |
| Total | ~15 cores | ~35Gi | 660Gi |
Note: Storage can be reduced significantly by using S3 backends for Loki and Tempo.
Common Issues and Troubleshooting
No Metrics Appearing in Prometheus
Check:
# Verify OpenTelemetry is sending metrics
kubectl -n monitoring logs deployment/opentelemetry-kube-stack-deployment-collector | grep prometheusremotewrite
# Check Prometheus targets
kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090
# Open: http://localhost:9090/targets
Fix: Verify exporter endpoints in OpenTelemetry config point to monitoring namespace.
No Logs in Loki
Check:
# Verify logsCollection preset is enabled
kubectl get opentelemetrycollector -n monitoring opentelemetry-kube-stack-daemon -o yaml | grep -A 5 "logsCollection"
# Check Loki gateway is reachable
kubectl -n monitoring exec daemonset/opentelemetry-kube-stack-daemon-collector -- wget -O- http://loki-gateway/ready
Fix: Ensure logsCollection.enabled: true in OpenTelemetry values.yaml.
Grafana Datasources Not Working
Check:
# Verify datasource URLs
# Should all reference monitoring namespace:
# - http://kube-prometheus-stack-prometheus.monitoring.svc:9090
# - http://loki-gateway.monitoring.svc
# - http://tempo.monitoring.svc:3100
Fix: Update datasource configuration in Grafana Helm values.
High Memory Usage
Check:
Fix: - Increase memory limits in Helm values - Reduce retention periods - Enable S3 backends for long-term storage
Next Steps
Once your monitoring stack is deployed:
1. Explore Grafana
- Dashboards → Browse pre-built Kubernetes dashboards
- Explore → Use Prometheus/Loki/Tempo query interfaces
- Alerting → View active alerts and alert rules
2. Verify Metrics Collection
# Check all targets are being scraped
kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090
# Open: http://localhost:9090/targets
3. Test Log Searching
# Access Grafana → Explore → Select Loki datasource
# Query: {namespace="openstack"}
# Try: {namespace="openstack"} |= "ERROR"
4. Create Custom Dashboards
- Use Grafana dashboard editor
- Query Prometheus for metrics
- Query Loki for logs
- Combine multiple datasources in one dashboard
5. Tune Alert Thresholds
- Review default alerts in PrometheusRules
- Adjust thresholds based on your baseline
- Add custom alerts for your services
6. Set Up Long-Term Storage (Production)
- Configure S3 backends for Loki and Tempo
- Set up Thanos for long-term Prometheus metrics
- Configure backup strategies
7. Enable Security
- Configure TLS for external access
- Set up authentication (LDAP, OAuth)
- Configure NetworkPolicies for namespace isolation
- Rotate default credentials
Additional Resources
Documentation
- Observability Stack Overview - Comprehensive architecture guide
- Base Component Metrics Reference - K8s and related component metrics
- Additional Component Metrics Reference - Database and additional component metrics
- Prometheus Documentation - Prometheus-specific configuration
- Grafana Documentation - Grafana-specific configuration
- AlertManager Slack Integration - Slack notifications
External Documentation
- OpenTelemetry Documentation
- Prometheus Documentation
- Loki Documentation
- Tempo Documentation
- Grafana Documentation
Community
Summary
You now have a complete observability stack providing:
✅ Metrics - Time-series data from Kubernetes, OpenStack, and infrastructure
✅ Logs - Centralized log aggregation from all services
✅ Traces - Distributed tracing for request flows
✅ Alerting - Automated notifications for issues
✅ Visualization - Unified dashboards for all telemetry
The stack is:
- Unified - All components in monitoring namespace
- Scalable - Handles large-scale deployments
- Open Source - No vendor lock-in
- Production-Ready - Battle-tested technologies
Happy monitoring! 🎉