A fintech-for-media company modernised its AWS Kubernetes platform to handle unpredictable traffic spikes during live sports events. Multi-account architecture, EKS optimisation, Aurora Serverless v2 tuning, and ephemeral preview environments replaced a single-account setup with no blast radius containment.
Industry
Fintech, Content Distribution
Location
Edinburgh, United Kingdom
Time
07.2025 - Present (ongoing retainer)
Company
UK Startup, Fintech, Micro-Payment Platform
Our client is an Edinburgh-based fintech company that provides global payment infrastructure for digital content. Their platform enables rights holders and media owners to monetise videos, live streams, podcasts, and articles through micropayments and pay-per-view access without requiring consumers to commit to subscriptions. The company partners with major sports and entertainment organisations to offer flexible, subscription-free content access to global audiences during live events.
The client’s existing Kubernetes platform had been designed for a previous line of business focused on video processing. As the company pivoted to real-time payment processing during live sporting events, the infrastructure presented critical challenges:
The risk was clear: without modernisation, the next major live event could expose payment processing failures to a global audience.
We restructured the AWS environment into a multi-account Organisation with separate accounts for development, staging, and production. Service Control Policies enforce consistent guardrails across all accounts. Account-level isolation guarantees that development activity can never impact production payment processing.
The client had deep Kubernetes expertise, so we retained EKS and modernised the cluster configuration rather than migrating to a different orchestration model. We evaluated ECS Fargate as an alternative but the rewrite cost for existing Helm charts, manifests, and deployment tooling was not justified.
The modernisation focused on:
Distributed tracing is being implemented based on OpenTelemetry, feeding into the Amazon Managed Service for Prometheus stack with Amazon Managed Grafana dashboards providing visibility across payment processing flows during live events.
The platform runs Aurora Serverless v2 for both PostgreSQL and MySQL workloads, with RDS Proxy providing connection pooling during traffic spikes. Amazon MSK handles event streaming for real-time payment processing and revenue split calculations. ElastiCache provides the caching layer.
A significant part of our work was optimising the data layer’s cost profile. Through CloudWatch capacity analysis, we identified clusters that were over-provisioned around the clock (locking capacity despite low average usage) alongside clusters that were correctly sized and legitimately spiked during events. Each cluster required individual tuning based on actual usage patterns rather than blanket changes. We also identified legacy engine versions incurring unnecessary extended support surcharges.
All AWS infrastructure is defined and managed through Terraform, from the multi-account Organisation structure and EKS clusters to Aurora configurations, MSK, IAM roles, security groups, and networking. Changes are version-controlled, peer-reviewed, and applied consistently across environments.
The IaC foundation is particularly critical for live event preparation. Infrastructure pre-scaling (node groups, database capacity ceilings, connection pool sizing) needs to happen reliably and reproducibly before each event rather than through manual console adjustments. Our Cloud Under Control services cover this approach in detail.
While there are no heavy compliance requirements driving policy-as-code today, elements of an AWS Landing Zone with AWS Config rules are planned and currently in the backlog to provide automated compliance validation as the platform scales into new event partnerships.
IAM Identity Center replaced scattered access management with centralised SSO and enforced MFA. Permission sets map to job functions, sessions expire automatically, and there are no long-lived IAM user credentials anywhere in the system. AWS Secrets Manager stores all database credentials, payment processing API keys, and integration secrets, eliminating hardcoded credentials entirely.
AWS KMS provides customer-managed encryption keys for data at rest across Aurora, S3, and EBS. Security groups enforce layered network isolation: the ALB is the only internet-facing component, EKS nodes only accept traffic from the ALB, Aurora only accepts connections from EKS, and MSK brokers are restricted to authorised workloads. All defined in Terraform.
CloudTrail records all API activity across all accounts and regions. VPC endpoints route S3 and DynamoDB traffic internally rather than through NAT Gateways, reducing both the attack surface and data transfer costs.
Container image scanning is performed at the ECR level, catching vulnerabilities before images reach production. Kosli is integrated into the CI pipelines as a supply chain attestation and SAST tool, providing an audit trail of what was built, tested, and deployed. We are evaluating additional options to strengthen the shift-left security posture further.
The architecture positions the client for ISO 27001 certification, with multi-account isolation, KMS encryption, IAM Identity Center, WAF, and full audit trails already in place. See our security and compliance services for how we approach compliance readiness.
We rebuilt the deployment pipelines using GitHub Actions, replacing inherited workflows that had been slowing the team down. The new pipelines handle automated testing, container image building, ECR push, and EKS deployment with rolling updates and health check validation.
The biggest quality-of-life improvement was the introduction of ephemeral preview environments: isolated, per-pull-request environments that spin up automatically and let developers validate changes in a realistic setup before merging. This eliminated the shared staging bottleneck entirely. See our CI/CD consulting services for more on this approach.
Beyond infrastructure modernisation, we are supporting the client with AI-based experiments across two areas: growth and lead generation workflows, and SDLC improvements to enhance developer experience for the engineering team. These experiments focus on reducing friction in day-to-day development tasks and accelerating feedback loops rather than replacing existing tooling.
Aurora Serverless v2 provides continuous automated backups with point-in-time recovery, critical for a payment platform where data loss means financial discrepancies. MSK’s built-in partition replication ensures message durability across broker failures. EKS workloads can be redeployed from ECR at any time, and the entire infrastructure can be recreated from the Terraform codebase.
We established RTO and RPO targets with the client, with specific attention to what failure looks like during a live event. The single-region deployment (eu-west-1) with multi-AZ distribution was a conscious tradeoff between cost and availability, accepted based on current traffic patterns.
Operational runbooks for live event preparation and incident response are a significant focus for the next quarter as the second half of the year brings more intensive event schedules. We are evaluating tools like the AWS DevOps Agent for automated triaging, with the goal of moving from reactive incident response towards preventive remediations backed by thorough runbook coverage.
We initially underestimated the complexity of refactoring mature Infrastructure as Code alongside a complex event-driven architecture built on Kafka. The combination of deeply nested Terraform modules and tightly coupled MSK configurations held us back in the early weeks. We overcame this by stepping back, identifying the critical bottlenecks, and re-evaluating priorities to match the client’s immediate business expectations rather than attempting a full overhaul on all fronts simultaneously. The lesson: with a live platform processing payments during scheduled events, you sequence work around business risk rather than technical elegance.
The modernisation phase has transitioned into an ongoing retainer for CloudOps and platform maintenance. We continue to manage the EKS clusters, tune Aurora configurations around event schedules, optimise costs, and support the engineering team with deployments and troubleshooting.
As transaction volumes scale with new event partnerships, we are developing auto-scaling strategies that pre-warm infrastructure based on event schedules and historical traffic patterns, ensuring the platform is ready before the audience arrives.
AWS Services: Amazon EKS, Amazon ECR, Application Load Balancer, Amazon Aurora Serverless v2 (PostgreSQL, MySQL), Amazon RDS Proxy, Amazon MSK, Amazon ElastiCache, AWS Secrets Manager, AWS KMS, AWS IAM Identity Center, Amazon CloudWatch, Amazon Managed Service for Prometheus, AWS CloudTrail, Amazon S3, AWS Certificate Manager, Amazon Route 53, AWS Lambda, AWS WAF
Tools: Terraform, GitHub Actions, Kosli, Docker, Helm
Vulnerability resolution time dropped 83%. Manual security overhead reduced by 93%. The platform now auto-scales reliably for live event traffic spikes with pre-warming based on event schedules, and the engineering team ships faster through ephemeral preview environments and AI-augmented development workflows.