GBIM's Observability Playbook: Custom Metrics, Correlation IDs, and k6 Tests
GBIM implemented a robust observability stack by integrating custom Prometheus business metrics, end-to-end correlation IDs, and automated k6 smoke tests, providing granular insights beyond generic…
GBIM implemented a robust observability stack by integrating custom Prometheus business metrics, end-to-end correlation IDs, and automated k6 smoke tests, providing granular insights beyond generic HTTP telemetry.
The team behind GBIM shifted its monitoring strategy, moving beyond generic HTTP telemetry to implement a focused observability stack. This involved custom Prometheus metrics like gbm_auth_register_total, end-to-end correlation IDs for request tracing, and k6 smoke tests integrated with Prometheus and Grafana. The initiative aimed to provide explicit monitoring signals for critical user activities such as registration, account activation, and administrative approvals, which were previously lacking. This setup allows for direct answers to business questions, such as the success or failure rates of user registrations.
WHAT THEY DID
Custom Business Metrics for Operational Flows
GBIM's team enhanced its backend with custom Prometheus metrics, moving beyond standard HTTP telemetry. These metrics, defined in monitoring/metrics.py, track critical operational flows like authentication, account activation, and administrative actions. Specific counters and durations were implemented, including gbm_auth_register_total{role,outcome} to track user registrations by role and outcome, and gbm_auth_activation_total{outcome} for account activation statuses.
Other metrics include gbm_auth_reactivation_total{outcome}, gbm_auth_email_send_duration_seconds{event,outcome}, gbm_admin_account_verification_total{action,outcome}, and gbm_pengajuan_admin_status_update_total{status,outcome}. These allow Grafana dashboards to display business-specific insights. For example, the system can now report how many registrations failed due to validation errors or if account activations frequently fail because of invalid or expired tokens. This granular data supports direct analysis of business outcomes rather than just system health. (Source: dev.to blog post, "Membangun Observability GBIM")
End-to-End Correlation ID Implementation
To improve error tracing and request visibility, GBIM implemented a consistent end-to-end correlation ID system. The frontend now sends an X-Correlation-ID header with each request. The backend is configured to validate this ID, generate one if absent, and then return it in the response. This ensures that every request, from its origin at the frontend through various backend services, carries a unique identifier.
Backend logs were updated to incorporate this corr_id through a logging filter. This allows engineers to trace a specific user interaction across multiple services and log entries, significantly reducing the time required to diagnose issues. Before this implementation, tracing errors across the frontend and backend was difficult due to inconsistent ID propagation. (Source: dev.to blog post, "Membangun Observability GBIM")
Automated k6 Smoke Tests with Telemetry Integration
GBIM developed k6 smoke tests designed to run as Kubernetes Jobs, providing automated and consistent performance monitoring. These k6 scripts are configured to send telemetry data directly to Prometheus via remote write. This integration ensures that the Grafana dashboards, which previously risked being empty, consistently display performance metrics from these synthetic tests.
The k6 tests simulate user journeys and critical API interactions. By running these tests regularly within the Kubernetes environment, the team gains continuous insight into application performance and availability. This proactive approach helps identify regressions before they impact actual users, providing a reliable signal for the overall system health. (Source: dev.to blog post, "Membangun Observability GBIM")
Frontend Analytics with GA4 Helper
While the core observability stack focused on Prometheus, Grafana, k6, and request logs, the frontend also integrated an analytics event helper for Google Analytics 4 (GA4). This helper sends user activity signals, complementing the backend metrics with client-side behavioral data. The GA4 implementation is restricted to specific hosts and environments, ensuring data quality and compliance. This provides an additional layer of insight into user engagement, though the primary evidence for operational health and business outcomes remains rooted in the Prometheus-Grafana-k6 stack. (Source: dev.to blog post, "Membangun Observability GBIM")
WHAT WE'D CHANGE
The GBIM observability playbook provides a solid foundation, but certain aspects warrant modification for broader applicability or future scaling. The reliance on custom Prometheus metrics, while powerful for specific business outcomes, introduces a maintenance overhead. Each new operational flow or business question requires explicit code changes in monitoring/metrics.py and subsequent deployments. This can slow down iteration cycles for business intelligence or feature development, especially in fast-moving environments. An alternative approach might involve a more generalized event-logging strategy, where structured events are emitted and then processed by a dedicated observability pipeline (e.g., using OpenTelemetry collectors to transform events into metrics) without direct code coupling.
The explicit mention of X-Correlation-ID in the frontend and backend indicates a custom implementation. While effective, this approach can become complex in microservices architectures with multiple hops and varying service languages. Adopting an industry-standard distributed tracing solution, such as OpenTelemetry or Jaeger, would provide more robust context propagation across service boundaries. These systems offer automatic instrumentation for many frameworks and languages, reducing the manual effort required for consistent correlation ID handling and providing richer trace data beyond simple IDs.
Finally, the k6 smoke tests running as Kubernetes Jobs are valuable for synthetic monitoring. However, the piece does not detail the alerting strategy tied to these tests. Without well-defined thresholds and automated alerts for performance degradations or failures detected by k6, the telemetry data remains passive. Integrating these test results with an incident management system and establishing clear SLOs (Service Level Objectives) based on k6 outputs would transform proactive monitoring into actionable incident prevention. This would ensure that the investment in k6 tests translates directly into improved system reliability.
LANDING
GBIM's approach demonstrates that effective observability extends beyond infrastructure monitoring to encompass direct business outcomes. By instrumenting critical operational flows with custom metrics and ensuring end-to-end request traceability, the team established a clear link between system performance and user experience. The integration of automated synthetic tests further solidifies this foundation, providing early warnings for potential issues. This playbook highlights the necessity of tailoring observability tools to specific business questions, transforming raw telemetry into actionable insights for product and engineering teams.
Pull quote: “GBIM's team enhanced its backend with custom Prometheus metrics, moving beyond standard HTTP telemetry.”
Every claim ties to a primary source. See our methodology.