14 April 2021 Multi-region Active-Active k8s

An architecture for running multi-region active-active kubernetes clusters

Andy Czerwinski

Andy Czerwinski

Head of Digital Platform Engineering at British Gas

Twitter   LinkedIn

12 April 2021 - 5 min read

So you’re busy playing pong with your new monkey LinkedIn contact when a call comes in on the bat-phone (read Teams call)

“Andy, running our k8s solution across three availability zones in AWS isn’t resilient enough”, said the voice over the call, “we need to build a DR (Disaster Recovery) solution in a different region.”

Now there have been a number of these types of discussions before we actually migrated to the new and shiny EKS cluster in AWS, moving off our antiquated hosting solution powered by Nomad (kudos goes out to Hashicorp whose open source cluster solution saved our bacon when we started rolling out microservices) to Kubernetes a couple of years ago.

At the time, we were given a dispensation to migrate to AWS EKS only into one region, on the proviso that we moved to a multi-region solution as soon as DevOps-ily possible. Keeping quiet and our heads down meant we dodged that conversation for several years before someone blew the dust off that old proviso and brought it back to people’s attention. So the reason is based around policies within Centrica where Disaster Recovery data centres need to be x miles apart, and unfortunately for AWS the 3 Availability Zones in London are too close to be classified as DR. Highly available, yes, Disaster Recovery, no.

What does your infrastructure look like?

As with most projects that have been ongoing for several years we’d built up a few legacy systems (Weblogic 10g cluster, running serverside web applications for one) that we had somehow managed to migrate across to K8s, which was a major feat in itself. Another bit of legacy was a massive Oracle Database that underpinned our authentication, authorisation and also most of the journeys on the site. With the move to K8s we got the ability to scale out our Apache web servers which we couldn’t do on the old Nomad based infrastructure, and run Oracle on an RDS instance.

As you can tell from the picture we had a lot of moving parts, of which some were so old no one supported them anymore, or were coming close to their end of life (Oracle, Java etc).

Alongside the infrastructure and move to EKS, we have also invested a lot of time and effort in automating our infrastructure deployments via Terraform,

and our k8s pod deployments via Helm.

We’ve got a pretty robust way of building out and deploying infrastructure and all the supporting pods for the whole kit and caboodle. But you know what, it’s always nice to push the bounds of technology beyond the ordinary to the extraordinary.

So what did we propose?

Our main concept was to move beyond building out a DR environment that would sit in a dormant state, and will only kick in when the normal primary environment goes down in a big way. With most DR environments, you’d be paying $$$ for lots of infrastructure that’s sitting there idle. Even minimising the pods to minimums (by default our Helm charts define minimums) the cost would be pretty high.

So the first “architectural decision” was to move from active-passive set up to an active-active solution so that we get some benefits from the additional cluster. To make it work though, the 2 regions would act like independent EKS clusters. The only things shared between the clusters would be data that is required to be shared, as the microservices hosted on the clusters are stateless. Another decision is that microservices on one cluster can’t talk across region to the other cluster, which gets rid of the service discovery conumdrum, i.e.

  • C1.MS1 can call C1.MS2
  • C1.MS1 can not call C2.MS2

For anything that need to be synced across regions we’d utilise AWS managed datastores that were designed to work across region.

The solution was elegant, and extraodinary, utilising the power of AWS. The elephant in the room was the old legacy application that sat on Weblogic and relied heavily on the Oracle Database, which in its current configuration wouldn’t allow us to run across region. So for this to be considered we needed to decommission the Weblogic services, so that we can move the authentication and authorisation aspects from Oracle and into DynamoDB for example.

Other code, architectural and process changes

To fully realise the ambition, once we’d removed the legacy application, there were a number of other code, architectural and process changes that would really need to be fulfilled before we’d be able to go fully active-active,

  • Microservices must remove their dependency on Redis for maintaining state. Redis should only be used for caching and if Redis is not available then the microservice should get the data from the backend system.
  • Move existing data into DynamoDB, or an AWS database that can share its data across region. So move away from Oracle into DynamoDB to support multi-region
  • Move application, state specific data out of Redis into DynamoDB, so applications can function without Redis being present and ensure long lived (non cache data) is persisted correctly
  • Potentially, move all caching to Elasticache Redis and out of the k8s hosted instances of Redis, so that data in these stores is synced across region
  • Ensure no FlushAll operations are required on Redis (Better maintenance/management)
  • Implement Shipper to coordinate multi cluster Helm deployments - if you're deploying active-active you need to make sure that your code versions deployed in the pods on both clusters are the same
  • Improve management/operations usage of Elasticache Redis
  • Move away from DBA Team managed Oracle RDS to Devops team managed DynamoDB, or similar multi-region databases

Conclusions

Since those heady days of architectural design, heavily leaning on the shoulder’s of Paul Hopkins, the architect and mastermind for the Digital DevOps team, things have moved on and the first couple of steps along the road to multi-region active-active k8s clusters have been taken. The move to using DynamoDB has started, and control of the service is in the hands of the local DevOps team, deployed out by Terraform into the AWS account. Code changes have started to be implemented to move any stateful requirements out of Redis and into DynamoDB. But the biggest success has been the final decommissioning of the old Weblogic stack removing one of the biggest obstacles to moving forward to the bright blue-green pastures of multi-region resilience.

Just as an addendum, we had looked at moving to Cognito as part of this design, but it was rejected due to its biggest (at the time of writing) issue which was the lack of multi-region functionality. The way it currently works is that the one way encryption of the user passwords uses different keys for different regions which means failing over is not possible, and if you ever did fail over customers would need to create new passwords. That would be a suboptimal outcome. Also running active-active, you’d want to have a Cognito in each region anyway. There are solutions to this issue (mainly around using lambdas to send data to both instances at the same time to keep them the same) so we may be visiting that product again in the future.