Using Route 53 weighted routing to safely migrate production traffic without any downtime
The underlying work was a migration from AWS Lambda to ECS Fargate for compute and S3 with Cloudfront for static content. It was deployed into a completely new clean AWS account to ensure that everything was created using Terraform, running both the legacy and new infrastructure in parallel meant that we could test production data with internal users before using customer traffic.
To test with internal users we initially wanted to set up IP based routing to send all traffic from the office to the new infrastructure, but as it looked like too much work I decided to keep it simple and set up a www2 subdomain that pointed at the new infrastructure. Next I did an @here in #general with a message to get everyone to play around with the new site, after a bunch of positive reactions and no initial problems we decided to route 20% of out customer traffic through it. Below is an example of how the Route53 records were set up, this setup allowed us to split traffic from internal users, production to legacy and production to new infrastructure.
After the initial 20% production traffic configuration we noticed a few logging errors, but due to being able to revert completely back to the legacy infrastructure if something broke we treated it as just as any bug. Over the course of two hours we had tested the new infrastructure internally and started serving 100% of our production traffic through it.
This is how weighted routing looks in the AWS Console, you create identical records but supply a different weight. It distributes the traffic to the percentage of the weight/total weight of the records, using 20 and 80 makes it really simple.
Below are two graphs from the migration of production traffic, the first one is the number of Lambda invocations decreasing over reducing the legacy route weight.
The HTTP Response codes graph is the new infrastructure using an Application Load Balancer, gradually increasing over time with more weight against it.
Overall the process went really smoothly and we managed to safely migrate all production traffic without any downtime.