16 June 2021 EBS Volume Migration in Action

How we migrated our entire AWS estate from GP2 to GP3 (without breaking anything)

Christopher Livermore

Head of Cloud Engineering

16 June 2021 - 6 min read

A few weeks ago one of my engineers wrote about a scalable solution for migrating from gp2 to gp3 disks for an entire AWS organisation.

We’ve now been running this solution in a number of our AWS Organizations long enough to have migrated all of the gp2 storage to gp3 and learnt a number of things in the process.

The migration service has been deliberately designed to be cautious. It attempts to migrate individual disks one at a time, as opposed to an instruction to AWS to bulk migrate en-mass.

The initial deployment of the lambda saw it run every 15 minutes, meaning it had a maximum capacity of 4 migrations an hour (96/day) for each availability zone in every account. As we have multiple AWS accounts each utilising 3-16 regions with multiple Availability Zones (AZs) our total conversion rate was significantly higher, ranging from 250-500 volumes per day.

The following graphs show the reduction in number of gp2 volumes for 2 of our AWS Organizations at a high level, and then broken down by AWS account.

Default disk order

The initial version of the code made no effort to sort the list of gp2 devices and always picked the first from the list. This led to us converting the newest disks first. In the majority of cases the order wasn’t important, but in a small number of cases where we build large short lived clusters we observed that the solution would convert these in preference to long lived disks, often failing to complete the conversion before the cluster was destroyed. In the following graph the blue line represents gp2 volumes in a single AWS account. You can clearly see a cluster of volumes being provisioned and destroyed on a daily cycle. You can also observe the change in behaviour from June 10th onwards when the conversion lambda was deployed in this account. You will also notice that every day the disks were destroyed before the lambda had time to convert more than about 25% of the disks.

On June 13th we changed the frequency of the lambda to run every 5 minutes, instead of 15. If you look at the above graph you will observe no noticeable difference in behaviour, whereas the following graph for a different account in the same payer shows a noticeable change in behaviour around this time.

We can conclude from this that the disks in the first account took around 15 minutes to complete their conversion and increasing the frequency has no impact because the lambda will not start a new migration if it detects one is already in progress.

Largest disk first?

At this point we started discussing whether we should change the sort order of the results to prioritise either the oldest or the largest disk.

Our observations are that in our environment disks are either short lived for less than 12 hours as part of an automated batch job, or long lived, in excess of 24 hours as part of a more stable service tier. One theory we have not yet tested is whether targeting disks older than 24 hours in preference to newer disks would have allowed the bulk of the estate to have been converted more quickly.

Logically prioritising larger disks appears to make sense as the benefit of the conversion is lower cost, and cost is based on the size of the disk. We have not tested this theory either, however in discussions with AWS they highlighted that gp2 and gp3 disks manage IOPS in a slightly different way.

The amount of provisioned IOPS for gp2 is related to the disk size, as the size increases so do the IOPS. The customer has no control of that. Using gp3 allows you to manage IOPs independently from disk size, the default value being 3,000 IOPS for gp3.

Any gp2 disk over 1Tb in size will be provisioned with more than 3,000 IOPS due to the relationship between IOPS and disk size. Converting this to gp3 could lead to the IOPS defaulting to 3,000 which in some circumstances could lead to throughput bottlenecks.

It is of course possible to increase the IOPS of a gp3 volume during the conversion, but this adds extra cost, that may not be necessary. We could also look at not only the provisioned IOPS of a gp2 volume, but the cloudwatch metrics showing actual consumed IOPS and build some logic into our conversion code to choose a sensible value. For now, we have chosen to exclude disks larger than 1TB from the conversion process as we have so few of them. If this is a use case that applies to you, some consideration as to the best way to approach this problem will be required.

Another reason we have chosen to ignore these volumes is the EBS modification cool down period which prevents further changes being made for a period of 6 hours.

Multi AZ

We’ve had a number of discussions internally and with AWS about our current approach of converting volumes in multiple AZs in parallel. One argument against doing this is that many services utilise multi AZ for failover and resilience and in this scenario a better approach might be to bulk upgrade one AZ, before moving to the next. However we also have a number of single AZ services, for example, batch jobs that are built using automation, run for a short period of time and then terminate. These do not require multi AZ strategies because they can simply be re-run in the event of a failure.

We have seen no performance or availability issues caused to our services by the gp3 migration activities we have been performing. The majority of our internal DevOps Engineers are probably not even aware that it is an ongoing process. Whilst we can theorise about the most optimal way to perform these conversions I am confident our current approach, whilst not perfect, is delivering outcomes that work for our use cases.

Cost Benefits.

We’ve been able to track the number of gp2 and gp3 volumes throughout our entire AWS estate, but what has been the financial impact of this conversion. We know that gp3 is upto 25% cheaper than gp2, so we ought to be able to see a genuine reduction in cost.

The following figures are from 2 of our AWS payer accounts, but have been altered by a consistent scale factor and so do not represent actual spend.

As you can see from the graphs the rate at which we have been able to enable the conversions varies, but the overall impact remains the same. The overall costs of gp2 and gp3 storage devices has decreased by approx 20% to date for the payer account that has completed the migration and is moving towards this figure for the other payer account, which we completed recently but will need to wait for month end for comparable figures from the billing system.

As you will observe from the earlier graphs, we continue to convert a number of disks on a daily basis as new gp2 disks are created by automation. We are working with our devops teams to migrate their automated scripts and templates to use gp3, however many of the images (AMIs) we use are still provided as a gp2 image. We envisage the need to maintain the conversion service for a significant period of time to ensure we identify and convert any newly created gp2 disks, potentially as long as AWS continue to offer gp2 as a resource type.

We are considering allowing DevOps teams to add a tag to their volumes to opt out of the conversion process, for example if they are running commercial software that may not have been certified on gp3. That brings with it the concern that opting out is the safest option, but means we will not realise the cost savings, and we would need to build governance around auditing which services have opted out. Thus far we have decided not to do this. Opting out is not an option.

I’d like to extend my gratitude to our our AWS Enterprise Support TAM (Technical Account Manager) James Waggott who has been a key advocate of the benefits of upgrading from gp2 to gp3 and has been actively involved in all of the discussions around the design and implementation of this process across the whole of Centrica.