· 8 min read

Blue-Green Deployments on Terraform (For 850 Million Monthly Active Players)

In this post, you'll get a sneak peak into how we improved our infrastructure deployment practices, all by utilising the “Blue-Green deployment approach”.

The data collection API is one of the most critical and highly loaded services in GameAnalytics’ backend infrastructure, responsible for receiving and storing raw game events from 850+ million unique monthly players in 70,000 games currently. An outage of the service at that scale would lead to irreversible data loss and thousands of sad customers.

In this blog post, we will discuss how we improved our infrastructure deployment practices by utilising the “Blue-Green deployment approach” powered by Terraform – a recipe that helps us achieve 100% uptime for our Data Collectors while continuously delivering new releases.

Data Collectors

Collectors are responsible for receiving high volumes of raw game events from players around the globe and storing them for subsequent processing by our analytics systems. It’s a REST service that, on busy days, handles up to 4.5 million HTTP requests per minute. Collectors were written in Erlang back in the early days of GameAnalytics, with a set of strict requirements: it needed to be fast, scalable, and predictable.

Due to the stateless nature of the service and the concurrency-oriented fault-tolerant characteristics offered by Erlang VM, the development of this type of service was a simple task (relatively speaking). It took us quite a long time to produce a convenient and cost-efficient deployment approach.

At first, our deployment procedures left a lot to be desired. They weren’t fully automated and required our engineers to perform many manual tasks, including:

  1. Provisioning new sets of instances;
  2. Deploying new releases to the new instances;
  3. Adding the new instances to a load-balancer;
  4. Gradually removing old instances from the load-balancer;
  5. And, ultimately, terminating the old instances.

While some of the steps were later simplified with a Fabric scripts, the process was still tedious and time-consuming. Also, this kind of deployment was not cost-efficient, as it required simultaneously running two full-size clusters for some time – there wasn’t any straightforward way to gradually switch traffic between instances running different releases. Another fundamental problem here was the lack of a rapid way to rollback as the rollback implied performing the same steps in the reverse order. Again, this was slow and error-prone.

Over time the situation started getting worse; with the growth of load also came a growth in the number of instances and, consequently, duration and complexity of the deployments.

About “Blue-Green deployments”

In our search for an optimal deployment process, we decided to go with Blue-Green deployments. It’s a well-known solution that – in our opinion – is easy to understand, reliable, and provides flexibility.

A classical Blue-Green infrastructure consists of two environments and a router that allows flipping traffic between the environments. In a typical Blue-Green deployment, an engineer deploys a new release in an idle environment and, once the software is ready, flips the switch, and all requests start going to the new environment. The traffic can be flipped back to the original cluster if troubles occur for a fast and reliable rollback.

Our Blue-Green infrastructure consists of two load-balancers pointing to individual auto-scaling groups. A set of weighted DNS record sets allows us to choose how much traffic each of the load-balancers should receive.

In Amazon Route 53 weighted records allow routing of variable portions of traffic from as little as 1/255 (or 0.4%), helping to reduce deployment risks even further as the traffic can be addressed precisely and in small increments.

Infrastructure as code

The infrastructure change wouldn’t have been complete without improving our tools. Although Fabric + Boto kit is nice and easy to get started within early stages of a project, it doesn’t scale well and can become a bottleneck as the team and infrastructure become bigger.

When deciding on a new infrastructure management tool, we picked Terraform – an increasingly popular open-source tool that helps us make infrastructure changes safely and predictably through declarative configuration files. One of the key things that appeal to us about Terraform is that it allows treating configuration files exactly like code, meaning you can keep change history in Git, offer changes through Pull Requests, and collaborate with colleagues in a familiar fashion. Enough talking; let’s see it in action!

Our Blue-Green infrastructure required the following resources:

  • A pair of Route 53 CNAME records with a weighted routing policy;
  • A pair of Load-Balancers (LB) with respective Target Groups (TG), and;
  • Two Auto-Scaling Groups (ASG).

In Terraform, resources are components of your infrastructure. For instance, this is what a DNS record with two routing policies could look like:

resource “aws_route53_record” “api-blue” {
zone_id = “${var.zone-id}”
name = “api”
type = “CNAME”
ttl = “${var.api-ttl}”

weighted_routing_policy {
weight = “${var.api-blue-weight}”
}
set_identifier = “api-blue”
records = “${var.api-blue-records}”
}

resource “aws_route53_record” “api-green” {
zone_id = “${var.zone-id}”
name = “api”
type = “CNAME”
ttl = “${var.api-ttl}”

weighted_routing_policy {
weight = “${var.api-green-weight}”
}
set_identifier = “api-green”
records = “${var.api-green-records}”
}

A load-balancer with a Target Group:

resource “aws_alb” “collect-alb” {
name = “${var.env}-collect-${var.flavour}-lb”
internal = false
load_balancer_type = “application”

security_groups = “${var.collect-lb-security-group-id}”
subnets = “${var.collect-lb-subnets}”
}

resource “aws_alb_target_group” “collect-alb-tg” {
name = “${var.env}-collect-${var.flavour}-tg”
port = “${var.collect-listen-port}”
protocol = “HTTPS”
vpc_id = “${var.vpc-id}”

health_check {

}
}

An auto-scaling group:

resource “aws_autoscaling_group” “collect-asg” {
name = “${var.env}-collect-${var.flavour}-asg”
max_size = “${var.max-size}”
min_size = “${var.min-size}”
health_check_grace_period = 180
health_check_type = “ELB”

force_delete = “${var.force-delete}”
launch_configuration = “${var.collect-asg-lc.name}”
vpc_zone_identifier = [“${var.subnet-ids}”]

target_group_arns = [“${var.tg-arns}”]
}

Multiple related resources can be grouped together in Modules which has a benefit of better reusability and, in our case, better code organisation.

If we wrap all our resources into modules our final configuration should looks something like this:

module “api-records” {
source = “/modules/api-blue-green-records”

api-blue-weight = 0
api-green-weight = 100
ttl = 60
}

module “collect-blue-asg” {
source = “/modules/collect-asg”
env = “live”,
flavour = “blue”,
ami-id = “${var.ami-release-1}”
max-size = 85
desired-capacity = 65
min-size = 65
}

module “collect-green-asg” {
source = “/modules/collect-asg”
env = “live”,
flavour = “green”,
ami-id = “${var.ami-release-1}”
max-size = 85
desired-capacity = 0
min-size = 0
}

In the given configuration example we are sending 100% of traffic to Blue ASG, which consists of 65 active instances launched from Release 1 AMI (Amazon Machine Image). Now, let’s see how easily we can deploy Release v2.

Let’s assume we already have Release v2 AMI prepared. All we need to do now is make sure our configuration file is up-to-date with the infrastructure by running `terraform init && terraform plan`, then update the `ami-id` attribute of Green ASG with the new image ID and scale it out to a reasonable number of instances (let’s say 10, for the purpose of this deep dive).

module “api-records” {
source = “/modules/api-blue-green-records”

api-blue-weight = 0
api-green-weight = 100
ttl = 60
}

module “collect-blue-asg” {
source = “/modules/collect-asg”
env = “live”,
flavour = “blue”,
ami-id = “${var.ami-release-1}”
max-size = 85
desired-capacity = 65
min-size = 65
}

module “collect-green-asg” {
source = “/modules/collect-asg”
env = “live”,
flavour = “green”,
ami-id = “${var.ami-release-2}”
max-size = 85
desired-capacity = 10
min-size = 10
}

Build and change infrastructure by running `terraform apply`. We should have Green instances up and running Release v2 in a couple of minutes.

Let’s send 5% traffic to Green group.

module “api-records” {
source = “/modules/api-blue-green-records”

api-blue-weight = 5
api-green-weight = 95
ttl = 60
}

module “collect-blue-asg” {
source = “/modules/collect-asg”
env = “live”,
flavour = “blue”,
ami-id = “${var.ami-release-1}”
max-size = 85
desired-capacity = 65
min-size = 65
}

module “collect-green-asg” {
source = “/modules/collect-asg”
env = “live”,
flavour = “green”,
ami-id = “${var.ami-release-1}”
max-size = 85
desired-capacity = 10
min-size = 10
}

Apply the change and watch your new software in action. It’s that simple!

Pitfalls and future work

Below are some of the observations and difficulties we faced while getting there. Hopefully, this saves you time if you want to follow similar practices in your production environment.

Keep your TTL on point

The TTL (Time To Live) value of your DNS record will affect how quickly and smoothly the traffic will respond to a weight change. We found TTL value of 60 seconds to provide the best balance for our use case.

Warm-up your load-balancers

One of the problems we faced was that the Classic and Application load-balancer in AWS requires pre-warming. Routing even 15% of our traffic to a cold load-balancer appears to be too much. Therefore, we have to increase it in tiny increments. This is one of the inconveniences we are still trying to find a solution for. It is possible to request a pre-warming procedure from AWS Support, but this is often not practical since we deploy often.

Create separate states for Terraform

While using a single remote state file might work fine in early stages, this approach won’t scale and will likely become a bottleneck for situations where multiple people want to deploy unrelated services independently. It’s easier to avoid this problem altogether by introducing separate states early.

Clever auto-scaling

A set of clever auto-scaling policies can make Blue-Green deployments even sleeker by eliminating the need for manual indication of a required number of instances. We’re currently working on this and will eventually write about in our blog.

We’re hiring!

If you’re a savvy developer looking to work in the cutting-edge tech industry, then we’re always on the lookout for bright, enthusiastic, and ambitious minds to join our growing engineering team. Check out the GameAnalytics careers page to see the benefits we offer and the roles available. Even if you don’t see an open position, drop us an email with your details – we’re always keen to chat!