Kubernetes Platform Modernisation for Live Events Streaming

About the Customer

Our client is an Edinburgh-based fintech company that provides global payment infrastructure for digital content. Their platform enables rights holders and media owners to monetise videos, live streams, podcasts, and articles through micropayments and pay-per-view access without requiring consumers to commit to subscriptions. The company partners with major sports and entertainment organisations to offer flexible, subscription-free content access to global audiences during live events.

The Challenge

The client’s existing Kubernetes platform had been designed for a previous line of business focused on video processing. As the company pivoted to real-time payment processing during live sporting events, the infrastructure presented critical challenges:

Traffic unpredictability. Live events generate massive, sudden traffic spikes that last hours and then drop to near-zero. The existing cluster sizing assumed steady-state video processing workloads, not burst payment processing.
Single-account blast radius. All environments (development, staging, production) ran in a single AWS account with no isolation. For a fintech platform processing real payments, an incident in development could impact production.
Limited developer velocity. Inherited CI/CD workflows were slow and fragile. No ephemeral preview environments. A shared staging environment created a bottleneck where developers queued for access.
No platform visibility. Monitoring was minimal. During live events, the team had no reliable signals for whether infrastructure was keeping up with demand or degrading silently.
Cost opacity. With everything in one account, billing attribution across environments was impossible. Over-provisioned resources from the video processing era were burning budget without justification.

The risk was clear: without modernisation, the next major live event could expose payment processing failures to a global audience.

The Solution

Multi-Account Architecture

We restructured the AWS environment into a multi-account Organisation with separate accounts for development, staging, and production. Service Control Policies enforce consistent guardrails across all accounts. Account-level isolation guarantees that development activity can never impact production payment processing.

Kubernetes on Amazon EKS

The client had deep Kubernetes expertise, so we retained EKS and modernised the cluster configuration rather than migrating to a different orchestration model. We evaluated ECS Fargate as an alternative but the rewrite cost for existing Helm charts, manifests, and deployment tooling was not justified.

The modernisation focused on:

Reconfiguring node groups and workload resource profiles sized for payment processing rather than video processing
Implementing horizontal pod autoscaling tuned for live event traffic patterns
Improving cluster-level monitoring through Amazon Managed Service for Prometheus and CloudWatch Container Insights

Distributed tracing is being implemented based on OpenTelemetry, feeding into the Amazon Managed Service for Prometheus stack with Amazon Managed Grafana dashboards providing visibility across payment processing flows during live events.

Data Layer

The platform runs Aurora Serverless v2 for both PostgreSQL and MySQL workloads, with RDS Proxy providing connection pooling during traffic spikes. Amazon MSK handles event streaming for real-time payment processing and revenue split calculations. ElastiCache provides the caching layer.

A significant part of our work was optimising the data layer’s cost profile. Through CloudWatch capacity analysis, we identified clusters that were over-provisioned around the clock (locking capacity despite low average usage) alongside clusters that were correctly sized and legitimately spiked during events. Each cluster required individual tuning based on actual usage patterns rather than blanket changes. We also identified legacy engine versions incurring unnecessary extended support surcharges.

Infrastructure as Code

All AWS infrastructure is defined and managed through Terraform, from the multi-account Organisation structure and EKS clusters to Aurora configurations, MSK, IAM roles, security groups, and networking. Changes are version-controlled, peer-reviewed, and applied consistently across environments.

The IaC foundation is particularly critical for live event preparation. Infrastructure pre-scaling (node groups, database capacity ceilings, connection pool sizing) needs to happen reliably and reproducibly before each event rather than through manual console adjustments. Our Cloud Under Control services cover this approach in detail.

While there are no heavy compliance requirements driving policy-as-code today, elements of an AWS Landing Zone with AWS Config rules are planned and currently in the backlog to provide automated compliance validation as the platform scales into new event partnerships.

Security and Compliance

IAM Identity Center replaced scattered access management with centralised SSO and enforced MFA. Permission sets map to job functions, sessions expire automatically, and there are no long-lived IAM user credentials anywhere in the system. AWS Secrets Manager stores all database credentials, payment processing API keys, and integration secrets, eliminating hardcoded credentials entirely.

AWS KMS provides customer-managed encryption keys for data at rest across Aurora, S3, and EBS. Security groups enforce layered network isolation: the ALB is the only internet-facing component, EKS nodes only accept traffic from the ALB, Aurora only accepts connections from EKS, and MSK brokers are restricted to authorised workloads. All defined in Terraform.

CloudTrail records all API activity across all accounts and regions. VPC endpoints route S3 and DynamoDB traffic internally rather than through NAT Gateways, reducing both the attack surface and data transfer costs.

Container image scanning is performed at the ECR level, catching vulnerabilities before images reach production. Kosli is integrated into the CI pipelines as a supply chain attestation and SAST tool, providing an audit trail of what was built, tested, and deployed. We are evaluating additional options to strengthen the shift-left security posture further.

The architecture positions the client for ISO 27001 certification, with multi-account isolation, KMS encryption, IAM Identity Center, WAF, and full audit trails already in place. See our security and compliance services for how we approach compliance readiness.

CI/CD and Developer Experience

We rebuilt the deployment pipelines using GitHub Actions, replacing inherited workflows that had been slowing the team down. The new pipelines handle automated testing, container image building, ECR push, and EKS deployment with rolling updates and health check validation.

The biggest quality-of-life improvement was the introduction of ephemeral preview environments: isolated, per-pull-request environments that spin up automatically and let developers validate changes in a realistic setup before merging. This eliminated the shared staging bottleneck entirely. See our CI/CD consulting services for more on this approach.

AI-Augmented Development

Beyond infrastructure modernisation, we are supporting the client with AI-based experiments across two areas: growth and lead generation workflows, and SDLC improvements to enhance developer experience for the engineering team. These experiments focus on reducing friction in day-to-day development tasks and accelerating feedback loops rather than replacing existing tooling.

Backup and Disaster Recovery

Aurora Serverless v2 provides continuous automated backups with point-in-time recovery, critical for a payment platform where data loss means financial discrepancies. MSK’s built-in partition replication ensures message durability across broker failures. EKS workloads can be redeployed from ECR at any time, and the entire infrastructure can be recreated from the Terraform codebase.

We established RTO and RPO targets with the client, with specific attention to what failure looks like during a live event. The single-region deployment (eu-west-1) with multi-AZ distribution was a conscious tradeoff between cost and availability, accepted based on current traffic patterns.

Operational runbooks for live event preparation and incident response are a significant focus for the next quarter as the second half of the year brings more intensive event schedules. We are evaluating tools like the AWS DevOps Agent for automated triaging, with the goal of moving from reactive incident response towards preventive remediations backed by thorough runbook coverage.

Results

Vulnerability resolution time: Mean time to resolution dropped from 48 hours to 8 hours, an 83% improvement.
Manual security overhead: Reduced by 93% through automated tooling and centralised access management.
Live event reliability: EKS auto-scaling handles traffic spikes dynamically with pre-warming based on event schedules. The manual intervention and degradation that occurred during previous events are eliminated.
Developer velocity: Ephemeral preview environments and AI-augmented development workflows shortened feedback loops. CI/CD pipelines handle deployment automation end-to-end with no manual steps.
Cost transparency: Right-sized infrastructure through empirical capacity analysis, with clear billing separation across environments for financial oversight.
Compliance readiness: Multi-account architecture, KMS encryption, IAM Identity Center, WAF, and full audit trails position the client for ISO 27001 certification.

Service Level Objectives: Availability percentage as a Service Level Indicator together with SLOs for core payment processing services are being implemented as a quality gateway, establishing an SRE mode of working that ties platform reliability directly to business outcomes during live events.

Lessons Learned

We initially underestimated the complexity of refactoring mature Infrastructure as Code alongside a complex event-driven architecture built on Kafka. The combination of deeply nested Terraform modules and tightly coupled MSK configurations held us back in the early weeks. We overcame this by stepping back, identifying the critical bottlenecks, and re-evaluating priorities to match the client’s immediate business expectations rather than attempting a full overhaul on all fronts simultaneously. The lesson: with a live platform processing payments during scheduled events, you sequence work around business risk rather than technical elegance.

What’s Next

The modernisation phase has transitioned into an ongoing retainer for CloudOps and platform maintenance. We continue to manage the EKS clusters, tune Aurora configurations around event schedules, optimise costs, and support the engineering team with deployments and troubleshooting.

As transaction volumes scale with new event partnerships, we are developing auto-scaling strategies that pre-warm infrastructure based on event schedules and historical traffic patterns, ensuring the platform is ready before the audience arrives.

AWS Services: Amazon EKS, Amazon ECR, Application Load Balancer, Amazon Aurora Serverless v2 (PostgreSQL, MySQL), Amazon RDS Proxy, Amazon MSK, Amazon ElastiCache, AWS Secrets Manager, AWS KMS, AWS IAM Identity Center, Amazon CloudWatch, Amazon Managed Service for Prometheus, AWS CloudTrail, Amazon S3, AWS Certificate Manager, Amazon Route 53, AWS Lambda, AWS WAF

Tools: Terraform, GitHub Actions, Kosli, Docker, Helm

Kubernetes Platform Modernisation for High-Traffic Live Events Streaming

Technologies used