The data collection API is one of the most critical and highly loaded services in GameAnalytics’ backend infrastructure, responsible for receiving and storing raw game events from 850+ million unique monthly players in nearly 70,000 games currently. Outage of the service at that scale would lead to irreversible data loss and thousands of sad customers.

In this blog post, we’re going to talk about how we improved our infrastructure deployment practices by utilising the “Blue-Green deployment approach” powered by Terraform – a recipe that helps us achieve 100% uptime for our Data Collectors, while continuously delivering new releases.

Data Collectors

Collectors are responsible for receiving high volumes of raw game events from players around the globe and storing them for subsequent processing by our analytics systems. It’s a REST service that, in busy days, handles up to 4.5 million HTTP requests per minute. Collectors were written in Erlang back in early days of GameAnalytics, with a set of strict requirements in mind: it needed to be fast, scalable and predictable.

Due to the stateless nature of the service and the concurrency-oriented fault-tolerant characteristics offered by Erlang VM, development of this type of service was a simple task (relatively speaking). Even so, it took us quite a long to come up with a convenient and cost-efficient deployment approach.

At first our deployment procedures left a lot to be desired. They weren’t fully automated and required our engineers to perform many manual tasks, including:

  1. Provisioning new sets of instances;
  2. Deploying new releases to the new instances;
  3. Adding the new instances to a load-balancer;
  4. Gradually removing old instances from the load-balancer;
  5. And, ultimately, terminating the old instances.

While some of the steps were later simplified with a set of Fabric scripts, the process was still tedious and time-consuming. Also, this kind of deployment was not cost efficient, as it required simultaneously running two full-size clusters for some time – there wasn’t any easy way to gradually switch traffic between instances running different releases. Another fundamental problem here was lack of a rapid way to rollback as, essentially, the rollback implied performing the same steps in the reverse order. Again, this was slow and error-prone.

Over time the situation started getting worse; with the growth of load also came a growth of the number of instances and, consequently, duration and complexity of the deployments.

About “Blue-Green deployments”

In our search for an optimal deployment process, we decided to go ahead with Blue-Green deployments. It’s a well-known solution that – in our opinion – is easy to understand, reliable, and provides flexibility.

A classical Blue-Green infrastructure consists of two environments and a router that allows flipping traffic between the environments. In a typical Blue-Green deployment, an engineer deploys a new release in an idle environment and, once the software is ready, flips the switch and all requests start going to the new environment. If troubles occur, the traffic can be flipped back to the original cluster for a fast and reliable rollback.

Our Blue-Green infrastructure consists of two load-balancers pointing to individual auto-scaling groups, and a set of weighted DNS record sets allowing us to choose how much traffic each of the load-balancers should receive.

In Amazon Route 53 weighted records allow routing of variable portions of traffic from as little as 1/255 (or 0.4%), helping to reduce deployment risks even further as the traffic can be addressed precisely and in small increments.

Infrastructure as code

The infrastructure change wouldn’t have been complete without improving the tools we use. Although, Fabric + Boto kit is nice and easy to get started with in early stages of a project, it doesn’t scale well and can become a bottleneck as team and infrastructure become bigger.

When deciding on a new infrastructure management tool, we picked Terraform – an increasingly popular open source tool that helps us to make infrastructure changes safely and predictably through declarative configuration files. One of the key things that appeals to us about Terraform is that it allows treating configuration files exactly like code, meaning you can keep change history in Git, offer changes through Pull Requests, and collaborate with colleagues in a very familiar fashion. Enough talking, let’s see it in action!

Our Blue-Green infrastructure required the following resources:

  • A pair of Route 53 CNAME records with a weighted routing policy;
  • A pair of Load-Balancers (LB) with respective Target Groups (TG), and;
  • Two Auto-Scaling Groups (ASG).

In Terraform resources are components of your infrastructure. For instance, this is what a DNS record with two routing policies could look like:

A load-balancer with a Target Group:

An auto-scaling group:

Multiple related resources can be grouped together in Modules which has a benefit of better reusability and, in our case, better code organisation.

If we wrap all our resources into modules our final configuration should looks something like this:

In the given configuration example we are sending 100% of traffic to Blue ASG which consists of 65 active instances launched from Release 1 AMI (Amazon Machine Image). Now, let’s see how easily we can deploy Release v2.

Let’s assume we already have Release v2 AMI prepared. All we need to do now is make sure our configuration file is up-to-date with the infrastructure by running terraform init && terraform plan, then update the ami-id attribute of Green ASG with the new image ID and scale it out to a reasonable number of instances (let’s say 10, for the purpose of this deep dive).

Build and change infrastructure by running terraform apply. We should have Green instances up and running Release v2 in a couple of minutes.

Let’s send 5% traffic to Green group.

Apply the change and watch your new software in action. It’s that simple!

Pitfalls and future work

Below are some of the observations and difficulties we faced while getting there. Hopefully this saves you a bit of time if you want to follow similar practices in your production environment.

Keep your TTL on point

The TTL (Time To Live) value of your DNS record will affect how quickly and smoothly the traffic will respond to a weight change. We found TTL value of 60 seconds to provide the best balance for our use case.

Warm-up your load-balancers

One of the problems we faced was that the Classic and Application load-balancer in AWS requires pre-warming. Routing even 15% of our traffic to a cold load-balancer appears to be too much, therefore we have to increase it in tiny increments. This is one of the inconveniences we are still trying to find a solution for. It is possible to request pre-warming procedure from AWS Support, but since we deploy often this is not practical.

Create separate states for Terraform

While using a single remote state file might work fine in early stages, this approach won’t scale and will likely become a bottleneck for situations where multiple people want to deploy unrelated services independently. It’s easier to avoid this problem altogether by introducing separate states early.

Clever auto-scaling

A set of clever auto-scaling policies can make Blue-Green deployments even sleeker by eliminating the need for manual indication of a required number of instances. This is something we’re currently working on, and will eventually write about in our blog.

We’re hiring!

If you’re a savvy developer looking to work in the cutting-edge of the tech industry then we’re always on the lookout for bright, enthusiastic, and ambitious minds to join our growing engineering team. Check out the GameAnalytics careers page to see the benefits we offer and the roles available. Even if you don’t see an open position, drop us an email with your details – we’re always keen to chat!

Great stories about great game developers, and how they thrive in the era of data.

* You get our industry report the minute you sign up.

Learn more about GameAnalytics