Datadog Observability Implementation for E-commerce Platform

Challenge

A mid-size e-commerce company running on on-premise Kubernetes was experiencing persistent performance issues visible to end users. Response times varied wildly, from 200ms to over 3000ms for the same transactions, but the team couldn’t pinpoint the root cause.

Their monitoring was fragmented across seven separate tools with no correlation between them:

Proxmox for host metrics
Rancher for basic Kubernetes metrics and logs
Grafana for extended Kubernetes metrics
Loki for application logs
Azure Application Insights for backend APM
CloudFlare for CDN and traffic statistics
Google Analytics for business metrics

Key problems with this setup included aggressive sampling on the APM side (dropping critical performance data for cost reasons), no frontend instrumentation, no end-to-end transaction tracing from CloudFlare through to the backend, and partial ownership of monitoring systems by the infrastructure vendor, meaning historical data would be lost if the relationship ended.

The infrastructure itself was a Kubernetes cluster running on Proxmox across 3 physical servers in a colocation facility, with CloudFlare as the CDN and TLS termination layer.

Method

We conducted an infrastructure audit covering network analysis, Kubernetes configuration, backend performance, storage, and the existing monitoring architecture. Based on the findings, we implemented Datadog as a unified observability platform:

Infrastructure Monitoring: Kubernetes API metrics, database performance (SQL and Redis), host-level metrics from the Proxmox hypervisor, and storage performance monitoring for both block storage (ZFS over LVM) and object storage (Minio).

APM (Application Performance Monitoring): Replaced Azure Application Insights with Datadog APM, providing full distributed tracing across the backend application without the aggressive sampling that was hiding performance issues.

Network Observability: End-to-end traffic analysis from the on-premise Kubernetes cluster, through the hypervisor network layer, across the Nginx load balancer, and out to CloudFlare. This was the critical missing piece, as no previous tool was watching the full network path.

This project demonstrates our cloud observability services and Datadog consulting capabilities applied to a complex on-premise environment.

Results

Within hours of the initial Datadog deployment, the team gained visibility they had never had before. The first and most significant finding: a ~1000ms latency penalty on every single request between the colocation data centre and CloudFlare. A full second of network transit time that was invisible to all previous monitoring tools because none of them were watching that specific path.

This discovery reframed the entire performance conversation. The issue wasn’t application code, Kubernetes configuration, or database queries. It was a network routing problem between the colocation provider and CloudFlare that could only be seen with correlated end-to-end observability.

Additional findings from the audit included:

Storage performance degradation from ZFS virtualisation layers reducing physical disk throughput
Single points of failure in Minio (standalone mode) and NFS storage with no high availability
Limited autoscaling capability due to the Proxmox hypervisor not supporting node-level autoscaling
CPU clock speeds below 3GHz potentially limiting application performance under load

The engagement opened a broader conversation about the fundamental limitations of the on-premise setup for an e-commerce business that needs to scale rapidly for paid advertising campaigns. The client has since engaged with us on planning a migration to AWS using the AWS Migration Acceleration Program (MAP).

Datadog implementation reveals hidden network latency in on-premise e-commerce infrastructure

Technologies used

Challenge

Method

Results

Conclusions