15 April 2021 Change Management at Scale and Speed
How Centrica Manages Change Across It's Cloud Estate
Head of Cloud Engineering
15 April 2021 - 4 min read
As an SRE team responsible for managing hundreds of Cloud Accounts for Centrica it goes without saying that we rely heavily on automation to allow us to operate efficiently at this scale. But automation alone does not provide all the solutions to operating at scale. Let’s take a look at how we manage change within the Centrica Cloud Engineering SRE team.
Let’s start by addressing the elephant in the room when talking about Change Management. Change Management exists to reduce the risks of making change to business critical IT systems, but in so many cases it only focuses on the implementation of those changes to the production environments. For a more rounded conversation we need to think about the bigger Risk Management picture. Change Management processes exist to help manage complex changes to interdependent systems of which the implementation can be both unreliable and unpredictable.
In the Cloud Engineering team our experience has taught us that no amount of process can fully protect changes of this nature and instead our approach is to make simple changes to loosely coupled (non interdependent) systems in an automated manner so that the implementation is both repeatable and predictable.
Let’s explore each of these 3 points in turn….
The more complex a change is the more risk there is of it going wrong. More importantly though, the more complex a change is the harder it is to determine the cause when things do go wrong. In order to “manage’’ complex changes companies have a tendency to add more process, especially approvals. The more people who have to look at it, the greater the chance of one of them spotting a potential problem, or conflict. The downside with adding more processes is that it slows down the release cycle, and to compensate for this development teams and product managers try to fit more changes into a single release, thus increasing the complexity and statistical chance of a problem occuring - and thus change management becomes the ultimate self fulfilling prophecy.
We break our changes down into many small individual changes. Not only does this decrease the risk of any individual change, but it means that should a change be problematic it is trivial to pinpoint the cause to a specific change. As changes are deployed individually rolling back any change can easily be done without also removing other changes.
Making many small changes does of course mean that we may need to make many changes to deliver an individual feature to our customers, in fact we make on average between 1,200 and 1,500 successful changes to production systems a week. This means that we need accurate and robust systems in place for recording and reporting on changes so we can pinpoint changes to the exact second they were made. How we do that is the subject for another blog post.
Any system that is dependent on yet another system to provide it’s functionality will be at the mercy of that secondary system both in terms of reliability but also it’s rate of change. Entire release cycles need to be carefully planned and coordinated and fundamental changes to one system may be delayed or vetoed entirely by another dependent system.
Cloud Engineering adopts the “design for failure” principle that takes an approach that “assumes that there will be a hardware or system failure”, and so our systems must be designed to handle this scenario. Whilst this leads to some more challenges during the development stage, the payback is operational simplicity and stability.
The Cloud Engineering Team put a huge amount of time and effort into ensuring we have the right systems designs in place for our needs. It is a simple task to state systems should be “designed for failure” and yet this belittles the amount of time, effort and in many cases prior failures that are required to be able to deliver this.
Our automation does more than simply deliver change at scale in a repeatable way. It also executes a comprehensive suite of tests against these changes, both in isolation and integrated with the rest of our services.
It is not uncommon for one of our Engineers to spend more time writing the test cases than the actual code. However once written these tests can be run again and again on every subsequent code deployment to validate the systems are still functioning as designed.
Failure of any test, including retries, if appropriate causes the pipeline execution to fail. If the pipeline should fail at any stage, the only remediation available to our engineers is to modify the code at source and re-run the entire pipeline. It is not possible to force the pipeline to skip stages or ignore failed tests.
Pipeline failures are surprisingly common - we are only human after all. But as none of the failures ever make their way into production we enjoy a very high degree of success on those that do. Any one of these pipeline failures represents a potential incident or inadvertently introduced undesirable behaviour or bug that has been avoided.
Because the entire process is automated and requires no human interaction, we can run it multiple times with minimal additional engineering effort, thus there is little or no overhead to any given change, allowing us to make many many small changes for the same effort as a single (large) change.
Our customers (internal DevOps teams) require clear visibility of all of the changes we have made to their environments. For this reason we have a full audit trail of every change made, both at the code level, in our source code repository, and in our internal ITSM systems where we can correlate changes in key performance metrics with changes to environments,
As long as every change is delivered via our automation our SRE Engineers are empowered to release code to production environments whenever they feel it is appropriate to do so. We do not have an approvals process for releases to live. We believe there is little or no additional value a human review/approval can add except to confirm what the automation has already proven.